Jędrzej Maczan

85 posts

Jędrzej Maczan banner
Jędrzej Maczan

Jędrzej Maczan

@jedmaczan

Low level deep learning: torch-webgpu / tiny-vllm / pytorch compiler / @pagedout_zine https://t.co/zM6gQ2L7VL cover art by https://t.co/rs0nMCNnN6

Kaszëbë, Poland Katılım Mayıs 2023
253 Takip Edilen513 Takipçiler
Sabitlenmiş Tweet
Jędrzej Maczan
Jędrzej Maczan@jedmaczan·
I built a tiny-vllm in C++ and CUDA - paged attention - continuous batching - educational - 100% human-written™ And now I writing a course where you will build your own vLLM yourself. Still work in progress, I'll finish by the end of April. All for free ofc, just a GitHub repo
English
15
30
593
17.9K
Jędrzej Maczan
Jędrzej Maczan@jedmaczan·
I built a tiny-vllm in C++ and CUDA - paged attention - continuous batching - educational - 100% human-written™ And now I writing a course where you will build your own vLLM yourself. Still work in progress, I'll finish by the end of April. All for free ofc, just a GitHub repo
English
15
30
593
17.9K
ICE
ICE@ICE257_·
@jedmaczan Followed, keep me updated
English
1
0
1
30
Yuvraj Singh
Yuvraj Singh@YuvrajS9886·
@jedmaczan Exact prerequisites like c or cuda cus I know nothing ☠️?
English
1
0
2
244
Jędrzej Maczan
Jędrzej Maczan@jedmaczan·
Andrej I got inspired by your llm.c and how you explain things from scratch. I’d love if you take a look at my project and the course (I won’t lie, I try to complement LLM101n a bit) @karpathy
English
0
0
3
548
Yacine Mahdid
Yacine Mahdid@yacinelearning·
yo guys what you working on this week???
Yacine Mahdid tweet media
English
47
0
124
10K
G K
G K@gauravkaul·
@jedmaczan Great work 👏 very well documented
English
1
0
2
219
anirudh bv
anirudh bv@anirudhbv_ce·
Finally got my Softmax kernels running on a @nvidia Blackwell B300 today! A single-pass tiled Softmax and a two-pass streaming Online Softmax. Writing ct.load() feels like cheating compared to manual Triton pointer math when mapping directly to TMA hardware.
anirudh bv tweet mediaanirudh bv tweet mediaanirudh bv tweet mediaanirudh bv tweet media
English
9
12
122
6.3K
Jędrzej Maczan retweetledi
Tri Dao
Tri Dao@tri_dao·
The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth.  Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.
Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English
30
230
1.8K
185.4K
Jędrzej Maczan retweetledi
Leonardo de Moura
Leonardo de Moura@Leonard41111588·
AI is writing a growing share of the world's software. No one is formally verifying any of it. New essay: "When AI Writes the World's Software, Who Verifies It?" leodemoura.github.io/blog/2026/02/2…
English
41
248
1.6K
420.4K