Sabitlenmiş Tweet

I wrote a matmul kernel on B200 in pure CUDA/PTX that beats cuBLAS by 6% at M=N=K=8192.
Inspired by @gaunernst's blog on Blackwell instructions with benchmarking done on @modal.
Blog: paulwillchan.com/articles/outpe…
Repo: github.com/Better-Call-Pa…


English
