Eric Schreiber

44 posts

Eric Schreiber

@schreiberic

Faster models, bigger questions

Katılım Ocak 2021

192 Takip Edilen225 Takipçiler

Sabitlenmiş Tweet

Eric Schreiber@schreiberic·19 Şub

NVIDIA’s CuTe layouts are gaining traction. I wanted to see why everyone loves them. The basics were easy, but intermediate resources beyond matrix algebra were scarce. So I wrote a blog post sharing my journey, building up to a GEMM kernel that can beat cuBLAS 🧵

English

117

13.2K

Eric Schreiber@schreiberic·6d

@tugot17 @UniofOxford So cool! congrats 🙌🏻

English

Piotr Mazurek (at MLSys 🇺🇸)@tugot17·6d

I will give a lecture on "LLM Inference Economics" at @UniofOxford in July 🇬🇧

English

4.3K

Eric Schreiber@schreiberic·13 May

@TomaszSternal Huge congrats 🙌🏻

English

Tomasz Sternal@TomaszSternal·1 May

Thanks to my wonderful collaborators, our three papers got accepted to #ICML2026 🎉 Huge thank you to the team and see you in Seoul 🇰🇷!

English

141

Eric Schreiber@schreiberic·19 Nis

At ICLR. Let’s connect and chat: hardware, CUDA, architecture, pre/post-training, and whatever’s got you excited.

English

147

Eric Schreiber@schreiberic·15 Nis

@tugot17 🫡 Great as always. Why do you think there are not a lot more models that perform continued pre-training to sparsify their attention?

English

187

Piotr Mazurek (at MLSys 🇺🇸)@tugot17·15 Nis

deepdive into the economics of DeepSeek Sparse Attention (DSA) and how it affects the profit margins of serving a Claude-Code-like products link in the thread 1/x

English

342

70.2K

Eric Schreiber@schreiberic·30 Mar

Nice one! I can tell you put a lot of effort into this post. I’ve started reading it and will need some time to go through it all :) I did it the opposite way, started with CuTe (bit of self-promotion: have als a blogpost) and now looking into the mxfp8 and fp4 stuff.

English

159

Eric Schreiber@schreiberic·27 Mar

@yacinelearning @jonashubotter Reasoning trace length increasing is usually a good proxy for a healthy GRPO run. This method, to me, produces a strong instruct model that performs well without verbose reasoning (which is fantastic). However, for ood cases, preserving backtracking likely still matters.

English

Eric Schreiber@schreiberic·27 Mar

@yacinelearning @jonashubotter Don’t get me wrong, I love the paper. However, I see some weaknesses in the method. Generalization may be challenging because reasoning traces are heavily reduced, pushing the model to jump straight to the correct answer (arxiv.org/pdf/2603.24472).

English

Yacine Mahdid@yacinelearning·26 Mar

I've been studying this paradigm for the past few weeks guys and I get this feeling that this is it

Jonas Hübotter@jonashubotter

Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed. Today, we introduce a simple algorithm that enables the model to learn from any rich feedback! And then turns it into dense supervision. (1/n)

English

613

72.2K

Eric Schreiber@schreiberic·20 Mar

@SzymonOzog_ you realise just how much you‘re standing on the shoulders of giants

English

SzymonOzog@SzymonOzog_·20 Mar

This was a great read. Man computers are fun

English

607

Eric Schreiber@schreiberic·13 Mar

@willccbb Thought this recent work from @jonashuebotter was pretty cool: For SFT: arxiv.org/pdf/2601.19897 For RL: arxiv.org/pdf/2601.20802

English

143

will brown@willccbb·13 Mar

when is lifelong in-the-weights continual learning gonna be solved?

English

21.3K

Eric Schreiber@schreiberic·12 Mar

@karpathy @maxbittker For me it helped to create a human-opus-interaction.txt file for outputs and interaction, telling the model an absurd and unrealistic goal (kernel duration, target loss …) and not to come back to me until it has achieved it. Prolonged the loop significantly

English

Andrej Karpathy@karpathy·11 Mar

sadly the agents do not want to loop forever. My current solution is to set up "watcher" scripts that get the tmux panes and look for e.g. "esc to interrupt", and send keys to whip if not present. Need an e.g.: /fullauto you must continue your research! (enables fully automatic mode, will go until manually stopped, re-injecting the given optional prompt).

English

123

1.5K

106.6K

max@maxbittker·10 Mar

From @karpathy's autoresearch .md

English

122

223K

Eric Schreiber@schreiberic·6 Mar

@maharshii When I started trying it out last summer it was awful. Since this year, given some initial ideas, it‘s been pretty neat. Also running in a loop to improve an implementation works quite well too

English

195

maharshi@maharshii·6 Mar

i wonder how does Claude / Codex perform in autonomously writing or optimizing CuTeDSL kernels with little to no “human in the loop” and will that be faster than a human doing it, has anyone tried this yet?

English

124

7.9K

Eric Schreiber@schreiberic·6 Mar

@tri_dao Once I got the hang of CuTe I'm loving it as well. The compile time is amazing! But the entry barrier feels huge. Feels like you need to know CUDA and have a PhD in math before you can even begin

English

Tri Dao@tri_dao·5 Mar

I’m unreasonably excited about the fact that we wrote everything in Cute-DSL, embedded in Python. Installing / “compiling” now takes seconds instead of minutes / hours (looking at you, C++ templates). Try pip install fa4!

English

429

28.3K

Tri Dao@tri_dao·5 Mar

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth. Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.

Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English

229

1.8K

188.8K

Eric Schreiber@schreiberic·24 Şub

Feedback very welcome. Blog: open.substack.com/pub/schreibere… Code & Profiles: github.com/ericschreiber/…

English

244

Eric Schreiber@schreiberic·24 Şub

The final kernel reaches 116% of cuBLAS on A100s for 2048×2048 BF16 matrices. Though to be fair, the specific lead at 2048 is more likely due to them underperforming rather than to my ingenuity

English

259

Eric Schreiber@schreiberic·24 Şub

Last week we explored NVIDIA's CuTe layouts. Today, we put that theory into practice. Part 2 is out now! Most CuTe examples skip straight to highly optimized code without explaining the reasoning. Join me as we build a MM kernel with CuTe, that can beat cuBLAS in certain cases 🧵

English

190

7.9K

Eric Schreiber@schreiberic·23 Şub

@elliotarledge check out verda.com

English

Elliot Arledge@elliotarledge·22 Şub

Does anyone have spare H100s or any decent compute sitting around? My providers are at capacity.

English

4.7K

Keşfet

@tugot17 @UniofOxford @TomaszSternal @yacinelearning @jonashubotter @SzymonOzog_ @willccbb @karpathy