driss guessous

476 posts

driss guessous

driss guessous

@drisspg

bytes and nuggets @pytorch https://t.co/gWVJmW741f

Katılım Aralık 2023
251 Takip Edilen1.5K Takipçiler
driss guessous
driss guessous@drisspg·
To hell with big TMA long live ld/st
English
3
4
35
3.1K
driss guessous
driss guessous@drisspg·
@maharshii > it's unfortunate that torch scaled mm api does not provide a global scale dequantization argument Can you elaborate here? #torch.nn.functional.scaled_mm" target="_blank" rel="nofollow noopener">docs.pytorch.org/docs/2.12/gene… This does support global scales. We should probably expand a lil in the docs but here is gist: gist.github.com/drisspg/c97e3c…
English
1
1
6
398
maharshi
maharshi@maharshii·
comparison of my CuTeDSL NVFP4 gemm kernel with cublas (via torch): it's unfortunate that torch scaled mm api does not provide a global scale dequantization argument and that makes the FLOPS come down. also, not sure why the FLOPS are this low, maybe power throttling?
maharshi tweet media
maharshi@maharshii

I wrote a custom NVFP4 GEMM kernel in CuTeDSL stripping away almost all the fancy CuTe layouts "headache" in the official examples and doing the PTX, TMA, and Tcgen05 manually. It's crazy how low-level you can go with this and still be performant! My notes and code are below:

English
4
4
112
8.5K
tender
tender@tenderizzation·
I have a stacktrace right here. This is stable diffusion in pytorch, right after flash-attention was updated. The only difference between clean wholesome image generation and this compute-sanitizer IMA was that flash-attention upgrade removing head dim 160 kernels. This is what stable diffusion now looks like in pytorch.
tender tweet media
English
1
0
83
3.7K
driss guessous
driss guessous@drisspg·
@_seemethere @difficultyang Yeah big caveat as that when you first use it’s gunna suck, but if you stick with it and actually muck around with the system prompt +extensions you end up with something that feels very tailored to your preferences
English
1
0
0
84
eli
eli@_seemethere·
@difficultyang @drisspg I usually use pi as a direct replacement for codex + Claude code. When I was still using opus I found pi to be a much better harness for opus directly. The nicest thing about pi is if the harness doesn’t do a thing that you need it to do it’s so easy to extend it.
English
1
0
1
121
driss guessous retweetledi
Han Guo
Han Guo@HanGuo97·
LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).
Han Guo tweet media
English
15
100
675
188.9K
driss guessous
driss guessous@drisspg·
Omni looks really cool, everything else is so mehh
English
0
0
4
571
driss guessous retweetledi
Degen CPA
Degen CPA@DrewVento·
BREAKING: Victor Wembanyama has joined Anthropic.
Degen CPA tweet mediaDegen CPA tweet media
English
58
319
7.1K
355.4K
driss guessous
driss guessous@drisspg·
@typedfemale @ezyang ``` function sanitize() { CUTE_DSL_LINEINFO=1 CUDA_LAUNCH_BLOCKING=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool memcheck "$@" } ``` A very handy zsh function
English
0
0
3
88
typedfemale
typedfemale@typedfemale·
@ezyang interesting, ok thanks i'll give it a try
English
1
0
2
143
typedfemale
typedfemale@typedfemale·
the CUDA caching allocator is such a great way to create extremely "interesting" bugs for yourself
English
8
1
125
13.8K
driss guessous
driss guessous@drisspg·
If you couldn’t be bothered to write it why in the world would I read it
English
1
4
65
5.8K
driss guessous
driss guessous@drisspg·
Yooooooooo so like what did OAI do with all those sora face scans
English
0
0
9
1K
driss guessous
driss guessous@drisspg·
the death of rigor happens one prompt at a time
English
0
0
7
394
typedfemale
typedfemale@typedfemale·
@drisspg i have something cool for you if you want to accelerate shift invariant mask mods, but i don't think outside of translation and reflection there is one
English
1
0
0
824
driss guessous
driss guessous@drisspg·
fun problem; given an integer function f and a domain D, which functions preserve ordered contiguity for every consecutive subspan of D?; winner gets 4x faster FlexFlash Attention (for some mask mods)
English
2
1
13
2.2K
driss guessous
driss guessous@drisspg·
@henrylhtsang ahh that is fair but important detail; I didnt specify the subspan size of D bigger is better but > 1 is better than none
English
0
0
1
146