rohan anil

10.1K posts

rohan anil

@_arohan_

member of technical staff & co-founder of @coreautoai - and continuing to aspire to understand deep learning.

Katılım Aralık 2017

2.3K Takip Edilen42.1K Takipçiler

Sabitlenmiş Tweet

rohan anil@_arohan_·19 Nis

It turns out multi step backpropaganda is better. paper has a beautiful way of improving backpropagation. One iteration cleanly gets us backprop, multiple iterations get us a preconditioned update.

rohan anil@_arohan_

@LinYorker @ryu0000000001 @weijie444 arxiv.org/abs/2106.06199 Same update here

English

197

74.1K

rohan anil@_arohan_·20h

@LakshyAAAgrawal @agarwl_ Very cool! Thanks, Nice work!

English

218

Lakshya A Agrawal@LakshyAAAgrawal·20h

Figure 14 (in appendix) provides a breakdown of various ablations and comparative experiments. First, we see that FST-trained model, even without the co-optimized prompt (“Slow only (FST w/o prompt)”, green) gains +4.2pp over RL-only model. Next, FST with the co-optimized prompt (“Slow + Fast (FST)”, green) outperforms RL-only model with GEPA-optimized prompt (“Slow + Fast (RL + GEPA)”) by +3.8pp, which increases further if we run GEPA again on FST-optimized model instead of just taking the co-optimized prompt. Further, we see this on other datasets as well: ⦁ On HoVer: FST 30.8% vs RL+GEPA 24.9% (+5.9pp), even using 3x fewer data points ⦁ In easy-to-hard generalization, training on Polaris and testing on HMMT25, we see: FST 39.7% vs RL+GEPA 37.1% (+2.6pp), using 1.4x fewer datapoints. This is with the co-optimized prompt. Running GEPA on top of the FST model achieves 41.4% (+4.3pp).

English

463

rohan anil retweetledi

Rishabh Agarwal@agarwl_·23h

Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights. So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA. Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models). I think this idea of learning both fast-slow weights would be a good foundation for continual learning. PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea. See more details here: gepa-ai.github.io/gepa/blog/2026…

English

453

43.5K

rohan anil@_arohan_·1d

@giffmana I forgot to do laundry, had to pull this out

English

959

Lucas Beyer (bl16)@giffmana·1d

@_arohan_ I'm scared of the day mine gets worn out

English

3.1K

rohan anil@_arohan_·1d

Collectors edition

Français

186

8.9K

rohan anil retweetledi

Aaron Defazio@aaron_defazio·1d

Decoupled and Dualized with a side of fries

will depue@willdepue

golden age for optimizers right now. every day you see another SoapyShampooGluon ^-1/2 (RMSMatched) drops

English

2.2K

rohan anil@_arohan_·1d

@DakshMalik47 @CoreAutoAI Conspiracy theory - approved by Jt

English

Daksh Malik@DakshMalik47·1d

@_arohan_ @CoreAutoAI i bet u read this 1st on slack , then Jerry approved to post this and here u r trying to get more likes than the original post. i understand the game Rohan

English

Core Automation@CoreAutoAI·1d

We have an agent that looks like a dog with a surfboard and thats making me uneasy

English

2.7K

rohan anil@_arohan_·1d

@typedfemale At least Günther Schulz gave credit to Newton.

Deutsch

249

typedfemale@typedfemale·1d

@_arohan_ i have a chrome extension that replaces "muon" with "shampoo"

English

687

typedfemale@typedfemale·1d

they just created a million muon variants

English

121

8.6K

rohan anil@_arohan_·1d

@typedfemale It was inevitable with such a good name

English

872

rohan anil@_arohan_·1d

@typedfemale #practical-shampoo

English

rohan anil retweetledi

will depue@willdepue·1d

golden age for optimizers right now. every day you see another SoapyShampooGluon ^-1/2 (RMSMatched) drops

English

259

25.8K

rohan anil@_arohan_·1d

@tonysilveti @dakovalev1 Nice!

English

Tony S.F.@tonysilveti·1d

@_arohan_ and DASGO for the instantaneous version; now I remember this paper by @dakovalev1 who made this connection explicit too (with analysis)

English

108

Tony S.F.@tonysilveti·1d

Wow! More theoretical analysis linking the spectral norm and row-norm. They make a nice argument using "row-block diagonal dominance" of the layer-wise Hessian to say that spectral LMO and row-norm LMO should give equivalent asymptotic dynamics (as width grows).

Shenyang Deng ✈️ ICML2026@DengShenyang24

1/n Please stop by👋. This is not just another ICML 2026 optimizer paper. We have rich intuition to share on why simple preconditioners like orthogonalization and row-normalization specifically benefit NNs optimization. Quick overview below 🧵

English

4.1K

rohan anil@_arohan_·1d

Uh oh. Captain, its just thursday

Andrew Curran@AndrewCurran_

Mythos has cracked MacOS. It took five days.

English

4.1K

rohan anil retweetledi

Goodfire@GoodfireAI·1d

Neural networks do math by rotating shapes. We found a shape-rotating calculator hidden inside an LLM – and it’s used for more than just math! (1/6)

Goodfire@GoodfireAI

Neural networks might speak English, but they think in shapes. Understanding their rich *neural geometry* is key to understanding how they work – and to debugging and controlling them with precision. Starting today, we’re releasing a series of posts on this research agenda. 🧵

English

108

492

3.9K

805.1K

rohan anil@_arohan_·1d

@DengShenyang24 @tonysilveti Aurora isn’t using diagonalizations at all.

English

Shenyang Deng ✈️ ICML2026@DengShenyang24·1d

This conclusion isn't unique to our work. In fact, a number of recent concurrent works have arrived at the same finding: [7] A Minimalist Optimizer Design for LLM Pretraining. [8] SRON: State-Free LLM Training via Row-Wise Gradient Normalization. [9] Mano: Restriking Manifold Optimization for LLM Training. [10] On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer. [13] Aurora: A Leverage-Aware Optimizer for Rectangular Matrices.

English

rohan anil@_arohan_·1d

@DengShenyang24 @tonysilveti Are you saying this diagonalization matches non diagonal update? Highly doubt it, and think your baselines are weak.

English

Shenyang Deng ✈️ ICML2026@DengShenyang24·1d

Prior work proposed fairly abstract frameworks and only adopted the diagonal-dominance approximation at implementation time, due to compute constraints. You can't say they understood why this trick works for neural networks. Common optimization-theory setups are not equivalent to the neural-network optimization problem. Nonconvex smooth optimization, OCO problem setups, and many overly abstract theoretical frameworks simply cannot explain why various tricks work specifically on neural networks. You can read our blog to understand why prior optimization-theory work fails to predict Muon's success.

English

rohan anil@_arohan_·1d

@tonysilveti It was called one sided in code :) github.com/google-researc…

English

115

Tony S.F.@tonysilveti·1d

@_arohan_ Yes, I also found it now here and indeed the name was "simple" rather than one-sided.

English

119

rohan anil@_arohan_·1d

@tonysilveti arxiv.org/pdf/2002.09018 Also in Appendix of distributed shampoo paper

English

195

Tony S.F.@tonysilveti·1d

@_arohan_ ...I didn't see that connection before. Yes, if you take as convention that one-sided shampoo changes the power from -1/4 to -1/2 (which makes sense). But I don't find one-sided shampoo mentioned in either the original or your followup paper. Is there a different name used?

English

290

rohan anil@_arohan_·1d

@t_cmtl @tonysilveti arxiv.org/abs/1901.11150 (search on Google :P )

English

tcml@t_cmtl·1d

@_arohan_ @tonysilveti What is your SM3 work? Could you kindly share it?

English

Keşfet

@LakshyAAAgrawal @agarwl_ @giffmana @DakshMalik47 @CoreAutoAI @typedfemale @tonysilveti @dakovalev1