Eric Schreiber

44 posts

Eric Schreiber banner
Eric Schreiber

Eric Schreiber

@schreiberic

Faster models, bigger questions

Katılım Ocak 2021
192 Takip Edilen225 Takipçiler
Sabitlenmiş Tweet
Eric Schreiber
Eric Schreiber@schreiberic·
NVIDIA’s CuTe layouts are gaining traction. I wanted to see why everyone loves them. The basics were easy, but intermediate resources beyond matrix algebra were scarce. So I wrote a blog post sharing my journey, building up to a GEMM kernel that can beat cuBLAS 🧵
Eric Schreiber tweet media
English
6
11
117
13.2K
Tomasz Sternal
Tomasz Sternal@TomaszSternal·
Thanks to my wonderful collaborators, our three papers got accepted to #ICML2026 🎉 Huge thank you to the team and see you in Seoul 🇰🇷!
English
2
0
7
141
Eric Schreiber
Eric Schreiber@schreiberic·
At ICLR. Let’s connect and chat: hardware, CUDA, architecture, pre/post-training, and whatever’s got you excited.
English
0
0
4
147
Eric Schreiber
Eric Schreiber@schreiberic·
@tugot17 🫡 Great as always. Why do you think there are not a lot more models that perform continued pre-training to sparsify their attention?
English
1
0
1
187
Piotr Mazurek (at MLSys 🇺🇸)
deepdive into the economics of DeepSeek Sparse Attention (DSA) and how it affects the profit margins of serving a Claude-Code-like products link in the thread 1/x
Piotr Mazurek (at MLSys 🇺🇸) tweet media
English
14
35
342
70.2K
Eric Schreiber
Eric Schreiber@schreiberic·
Nice one! I can tell you put a lot of effort into this post. I’ve started reading it and will need some time to go through it all :) I did it the opposite way, started with CuTe (bit of self-promotion: have als a blogpost) and now looking into the mxfp8 and fp4 stuff.
English
0
0
0
159
Eric Schreiber
Eric Schreiber@schreiberic·
@yacinelearning @jonashubotter Reasoning trace length increasing is usually a good proxy for a healthy GRPO run. This method, to me, produces a strong instruct model that performs well without verbose reasoning (which is fantastic). However, for ood cases, preserving backtracking likely still matters.
English
1
0
1
41
Eric Schreiber
Eric Schreiber@schreiberic·
@SzymonOzog_ you realise just how much you‘re standing on the shoulders of giants
English
0
0
1
27
SzymonOzog
SzymonOzog@SzymonOzog_·
This was a great read. Man computers are fun
SzymonOzog tweet media
English
1
0
14
607
will brown
will brown@willccbb·
when is lifelong in-the-weights continual learning gonna be solved?
English
47
3
80
21.3K
Eric Schreiber
Eric Schreiber@schreiberic·
@karpathy @maxbittker For me it helped to create a human-opus-interaction.txt file for outputs and interaction, telling the model an absurd and unrealistic goal (kernel duration, target loss …) and not to come back to me until it has achieved it. Prolonged the loop significantly
English
0
0
0
22
Andrej Karpathy
Andrej Karpathy@karpathy·
sadly the agents do not want to loop forever. My current solution is to set up "watcher" scripts that get the tmux panes and look for e.g. "esc to interrupt", and send keys to whip if not present. Need an e.g.: /fullauto you must continue your research! (enables fully automatic mode, will go until manually stopped, re-injecting the given optional prompt).
English
123
44
1.5K
106.6K
max
max@maxbittker·
From @karpathy's autoresearch .md
max tweet media
English
50
122
3K
223K
Eric Schreiber
Eric Schreiber@schreiberic·
@maharshii When I started trying it out last summer it was awful. Since this year, given some initial ideas, it‘s been pretty neat. Also running in a loop to improve an implementation works quite well too
English
0
0
0
195
maharshi
maharshi@maharshii·
i wonder how does Claude / Codex perform in autonomously writing or optimizing CuTeDSL kernels with little to no “human in the loop” and will that be faster than a human doing it, has anyone tried this yet?
English
17
1
124
7.9K
Eric Schreiber
Eric Schreiber@schreiberic·
@tri_dao Once I got the hang of CuTe I'm loving it as well. The compile time is amazing! But the entry barrier feels huge. Feels like you need to know CUDA and have a PhD in math before you can even begin
English
0
0
0
69
Tri Dao
Tri Dao@tri_dao·
I’m unreasonably excited about the fact that we wrote everything in Cute-DSL, embedded in Python. Installing / “compiling” now takes seconds instead of minutes / hours (looking at you, C++ templates). Try pip install fa4!
English
5
19
429
28.3K
Tri Dao
Tri Dao@tri_dao·
The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth.  Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.
Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English
31
229
1.8K
188.8K
Eric Schreiber
Eric Schreiber@schreiberic·
The final kernel reaches 116% of cuBLAS on A100s for 2048×2048 BF16 matrices. Though to be fair, the specific lead at 2048 is more likely due to them underperforming rather than to my ingenuity
English
1
0
1
259
Eric Schreiber
Eric Schreiber@schreiberic·
Last week we explored NVIDIA's CuTe layouts. Today, we put that theory into practice. Part 2 is out now! Most CuTe examples skip straight to highly optimized code without explaining the reasoning. Join me as we build a MM kernel with CuTe, that can beat cuBLAS in certain cases 🧵
Eric Schreiber tweet media
English
3
22
190
7.9K
Elliot Arledge
Elliot Arledge@elliotarledge·
Does anyone have spare H100s or any decent compute sitting around? My providers are at capacity.
English
9
1
30
4.7K