mobicham

580 posts

mobicham

@mobicham

I like to shrink dem models 🤏 ML/AI @dropbox Prev. Co-Founder & Principal Scientist @mobius_labs (acquired by @dropbox) PhD @inria

Berlin, Germany Katılım Kasım 2023

114 Takip Edilen707 Takipçiler

mobicham@mobicham·16 Mar

@QuixiAI Isn't that just the standard DeepSeek style block quant ?

English

466

Eric Hartford@QuixiAI·16 Mar

I reverse engineered Qwen 3.5's FP8 format, and provide a script to recreate it.

English

167

10.2K

mobicham@mobicham·12 Mar

@MainzOnX I think they were compiled with the same flags, same Triton version, different ptxas version

English

Adam Mainz@MainzOnX·12 Mar

@mobicham Oh wtf that’s retarded. Ptx to ptxas being a black box always sucks. Probably the NV ptx flags for optimizations under the hood not giving the results we all expect

English

mobicham@mobicham·12 Mar

Triton ships ptxas 12.9 for Blackwell, but CUDA 13.0 ptxas adds support for e2m1x2.f16x2 which makes activation quant go brrr. However, it seems that ptxas 13.0 actually generates worse kernels, typically with large M 🤔

English

1.6K

mobicham@mobicham·12 Mar

@MainzOnX 3.6.0, but this is not a Triton issue, it's a ptxas issue, the ptx is identical from Triton (except .version)

English

100

Adam Mainz@MainzOnX·12 Mar

@mobicham Triton version? I use 13 for triton

English

104

mobicham@mobicham·12 Mar

@kimbochen Different SASS with the same ptx 😢

English

Kimbo@kimbochen·12 Mar

@mobicham Wow this is cursed Different numerics for same instr on different GPUs?

English

170

mobicham@mobicham·12 Mar

@hy3na_xyz Wat @gaunernst make it rain

English

406

Hyena@hy3na_xyz·12 Mar

TC is now 1M-2.5M+

Hyena@hy3na_xyz

800K-1.5M+ TC Perf Eng (Blackwell + AMD) Opportunity to work w Neolabs. DM

English

12.2K

mobicham@mobicham·12 Mar

This bit flip might be the issue

English

147

mobicham@mobicham·12 Mar

@snowclipsed Another piece of disappointment: the RTX PRO 6000 is on paper 2000 TFLOPS for FP4, but really if you factor in everything (activation quant, etc.), the best you can reach is 1300 TFLOPS ish

English

mobicham@mobicham·12 Mar

@snowclipsed Yeah, it is actually worse than what I thought, sometimes more than 10% slower on sm_120 for large shapes, you can try it (back up first): cp /usr/local/cuda13.0/bin/ptxas /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas-blackwell; rm -rf ~/.triton/cache;

English

137

mobicham@mobicham·12 Mar

@snowclipsed @stochasticchasm

QME

snow@snowclipsed·12 Mar

@mobicham @stochasticchasm another SM_120 hate evidence dropped...

English

mobicham@mobicham·12 Mar

@sweinoid Yes, custom kernel, but you can patch vLLM to use it

English

swein@sweinoid·12 Mar

@mobicham Are you having to modify a kernel to do this? I doubt its as simple as a vllm arg!

English

mobicham@mobicham·11 Mar

Little trick to outperform Cutlass for NVFP4 on sm_120: use mixed TMA: because TMA requires padding to 128, I don't use it for the activation scales, resulting in a huge bump for decoding speed🫡!

English

4.6K

mobicham@mobicham·7 Mar

🫡

QME

517

mobicham@mobicham·6 Mar

🫡

GPU MODE@GPU_MODE

Our next kernel competition is now open for submissions! A $1.1M cash prize competition sponsored by AMD on optimizing DeepSeek-R1-0528, GPT-OSS-120B on MI355X Registration: luma.com/cqq4mojz

ART

413

mobicham@mobicham·6 Mar

@gaunernst What I mean is that you can patch the static cache class in huggingface to return only the right slice, so it will not run attention with the whole max seq len. I remember there is some trickery with torch compile to make slicing work with breaking it

English

Thien Tran@gaunernst·6 Mar

@mobicham my problem with it is that for example if i want max context = 40k, static cache impl in HF will do attention on all 40k all the time (iiuc), which makes it very slow. I could limit it to 1-2k context for example, but I feel that's "cheating"

English

Thien Tran@gaunernst·5 Mar

Small update to the Qwen3-0.6B "megakernel". Managed to hit ~700tok/s on 5090 (including tokenizer decode + print to screen). Quite far from @AlpinDale's 1k tok/s. And speed drops significantly at longer context due to missing "flash" decoding i.e. split-K for attention

English

103

mobicham@mobicham·6 Mar

@gaunernst You can technically patch static cache in huggingface and make it work. I did something similar to use arbitrary batch-sizes with the same static cache instance

English

Thien Tran@gaunernst·5 Mar

Using the same setup, HF eager reaches ~130tok/s. I know there is torch.compile support with static cache, but that requires attention for full cache size, which makes fancy demo if u limit the context, but I think it's "cheating" if the setup can't support longer context.

English

1.5K

mobicham@mobicham·4 Mar

@xeophon It's worse than Qwen3 4B Instruct 2507 in terms of instruction following

English

Xeophon@xeophon·3 Mar

qwen3.5 4b + disabled thinking is so good, man

English

516

39.6K

mobicham@mobicham·3 Mar

@BarneyFlames I think so 👀! (code not merged yet, mega PR yet to be made)

English

Total NIMBY Death@BarneyFlames·3 Mar

@mobicham should work well on G7e instances, right ?

English

mobicham@mobicham·3 Mar

Finally squeezing some time to revisit GemLite 🫡

English

1.1K

mobicham@mobicham·25 Şub

@AlpinDale Try Qwen3.5

Filipino

240

Alpin@AlpinDale·25 Şub

I really want SOTA OSS models to be multimodal. It sucks that you have to use a downgraded model to get vision.

English

5.5K

Keşfet

@QuixiAI @MainzOnX @kimbochen @hy3na_xyz @gaunernst @snowclipsed @stochasticchasm @sweinoid