mobicham

580 posts

mobicham banner
mobicham

mobicham

@mobicham

I like to shrink dem models 🤏 ML/AI @dropbox Prev. Co-Founder & Principal Scientist @mobius_labs (acquired by @dropbox) PhD @inria

Berlin, Germany Katılım Kasım 2023
114 Takip Edilen707 Takipçiler
mobicham
mobicham@mobicham·
@QuixiAI Isn't that just the standard DeepSeek style block quant ?
English
1
0
3
466
Eric Hartford
Eric Hartford@QuixiAI·
I reverse engineered Qwen 3.5's FP8 format, and provide a script to recreate it.
Eric Hartford tweet media
English
4
8
167
10.2K
mobicham
mobicham@mobicham·
@MainzOnX I think they were compiled with the same flags, same Triton version, different ptxas version
English
1
0
0
23
Adam Mainz
Adam Mainz@MainzOnX·
@mobicham Oh wtf that’s retarded. Ptx to ptxas being a black box always sucks. Probably the NV ptx flags for optimizations under the hood not giving the results we all expect
English
1
0
0
27
mobicham
mobicham@mobicham·
Triton ships ptxas 12.9 for Blackwell, but CUDA 13.0 ptxas adds support for e2m1x2.f16x2 which makes activation quant go brrr. However, it seems that ptxas 13.0 actually generates worse kernels, typically with large M 🤔
English
4
0
28
1.6K
mobicham
mobicham@mobicham·
@MainzOnX 3.6.0, but this is not a Triton issue, it's a ptxas issue, the ptx is identical from Triton (except .version)
English
1
0
2
100
Kimbo
Kimbo@kimbochen·
@mobicham Wow this is cursed Different numerics for same instr on different GPUs?
English
1
0
2
170
mobicham
mobicham@mobicham·
This bit flip might be the issue
mobicham tweet media
English
0
0
1
147
mobicham
mobicham@mobicham·
@snowclipsed Another piece of disappointment: the RTX PRO 6000 is on paper 2000 TFLOPS for FP4, but really if you factor in everything (activation quant, etc.), the best you can reach is 1300 TFLOPS ish
English
0
0
0
51
mobicham
mobicham@mobicham·
@snowclipsed Yeah, it is actually worse than what I thought, sometimes more than 10% slower on sm_120 for large shapes, you can try it (back up first): cp /usr/local/cuda13.0/bin/ptxas /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas-blackwell; rm -rf ~/.triton/cache;
English
3
0
4
137
mobicham
mobicham@mobicham·
@sweinoid Yes, custom kernel, but you can patch vLLM to use it
English
0
0
0
51
swein
swein@sweinoid·
@mobicham Are you having to modify a kernel to do this? I doubt its as simple as a vllm arg!
English
1
0
0
70
mobicham
mobicham@mobicham·
Little trick to outperform Cutlass for NVFP4 on sm_120: use mixed TMA: because TMA requires padding to 128, I don't use it for the activation scales, resulting in a huge bump for decoding speed🫡!
mobicham tweet media
English
3
3
83
4.6K
mobicham
mobicham@mobicham·
🫡
mobicham tweet media
QME
0
0
14
517
mobicham
mobicham@mobicham·
@gaunernst What I mean is that you can patch the static cache class in huggingface to return only the right slice, so it will not run attention with the whole max seq len. I remember there is some trickery with torch compile to make slicing work with breaking it
English
2
0
1
66
Thien Tran
Thien Tran@gaunernst·
@mobicham my problem with it is that for example if i want max context = 40k, static cache impl in HF will do attention on all 40k all the time (iiuc), which makes it very slow. I could limit it to 1-2k context for example, but I feel that's "cheating"
English
1
0
0
37
Thien Tran
Thien Tran@gaunernst·
Small update to the Qwen3-0.6B "megakernel". Managed to hit ~700tok/s on 5090 (including tokenizer decode + print to screen). Quite far from @AlpinDale's 1k tok/s. And speed drops significantly at longer context due to missing "flash" decoding i.e. split-K for attention
Thien Tran tweet media
English
3
7
103
9K
mobicham
mobicham@mobicham·
@gaunernst You can technically patch static cache in huggingface and make it work. I did something similar to use arbitrary batch-sizes with the same static cache instance
English
1
0
0
30
Thien Tran
Thien Tran@gaunernst·
Using the same setup, HF eager reaches ~130tok/s. I know there is torch.compile support with static cache, but that requires attention for full cache size, which makes fancy demo if u limit the context, but I think it's "cheating" if the setup can't support longer context.
English
2
0
14
1.5K
mobicham
mobicham@mobicham·
@xeophon It's worse than Qwen3 4B Instruct 2507 in terms of instruction following
English
0
0
0
46
Xeophon
Xeophon@xeophon·
qwen3.5 4b + disabled thinking is so good, man
English
29
13
516
39.6K
mobicham
mobicham@mobicham·
@BarneyFlames I think so 👀! (code not merged yet, mega PR yet to be made)
English
0
0
1
24
mobicham
mobicham@mobicham·
Finally squeezing some time to revisit GemLite 🫡
mobicham tweet media
English
3
0
11
1.1K
Alpin
Alpin@AlpinDale·
I really want SOTA OSS models to be multimodal. It sucks that you have to use a downgraded model to get vision.
English
8
0
80
5.5K