Sabitlenmiş Tweet
Eric Schreiber
44 posts

Eric Schreiber
@schreiberic
Faster models, bigger questions
Katılım Ocak 2021
192 Takip Edilen225 Takipçiler


Thanks to my wonderful collaborators, our three papers got accepted to #ICML2026 🎉
Huge thank you to the team and see you in Seoul 🇰🇷!
English

@tugot17 🫡 Great as always.
Why do you think there are not a lot more models that perform continued pre-training to sparsify their attention?
English

@yacinelearning @jonashubotter Reasoning trace length increasing is usually a good proxy for a healthy GRPO run. This method, to me, produces a strong instruct model that performs well without verbose reasoning (which is fantastic). However, for ood cases, preserving backtracking likely still matters.
English

@yacinelearning @jonashubotter Don’t get me wrong, I love the paper. However, I see some weaknesses in the method. Generalization may be challenging because reasoning traces are heavily reduced, pushing the model to jump straight to the correct answer (arxiv.org/pdf/2603.24472).
English

I've been studying this paradigm for the past few weeks guys and I get this feeling that this is it
Jonas Hübotter@jonashubotter
Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed. Today, we introduce a simple algorithm that enables the model to learn from any rich feedback! And then turns it into dense supervision. (1/n)
English

@SzymonOzog_ you realise just how much you‘re standing on the shoulders of giants
English

@willccbb Thought this recent work from @jonashuebotter was pretty cool:
For SFT: arxiv.org/pdf/2601.19897
For RL: arxiv.org/pdf/2601.20802
English

@karpathy @maxbittker For me it helped to create a human-opus-interaction.txt file for outputs and interaction, telling the model an absurd and unrealistic goal (kernel duration, target loss …) and not to come back to me until it has achieved it. Prolonged the loop significantly
English

sadly the agents do not want to loop forever. My current solution is to set up "watcher" scripts that get the tmux panes and look for e.g. "esc to interrupt", and send keys to whip if not present. Need an e.g.:
/fullauto you must continue your research!
(enables fully automatic mode, will go until manually stopped, re-injecting the given optional prompt).
English

@maharshii When I started trying it out last summer it was awful. Since this year, given some initial ideas, it‘s been pretty neat. Also running in a loop to improve an implementation works quite well too
English

@tri_dao Once I got the hang of CuTe I'm loving it as well. The compile time is amazing!
But the entry barrier feels huge. Feels like you need to know CUDA and have a PhD in math before you can even begin
English

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth.
Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.
Ted Zadouri@tedzadouri
Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/
English

Feedback very welcome.
Blog: open.substack.com/pub/schreibere…
Code & Profiles: github.com/ericschreiber/…
English










