rohan anil

10.1K posts

rohan anil banner
rohan anil

rohan anil

@_arohan_

member of technical staff & co-founder of @coreautoai - and continuing to aspire to understand deep learning.

Katılım Aralık 2017
2.3K Takip Edilen42.1K Takipçiler
Lakshya A Agrawal
Lakshya A Agrawal@LakshyAAAgrawal·
Figure 14 (in appendix) provides a breakdown of various ablations and comparative experiments. First, we see that FST-trained model, even without the co-optimized prompt (“Slow only (FST w/o prompt)”, green) gains +4.2pp over RL-only model. Next, FST with the co-optimized prompt (“Slow + Fast (FST)”, green) outperforms RL-only model with GEPA-optimized prompt (“Slow + Fast (RL + GEPA)”) by +3.8pp, which increases further if we run GEPA again on FST-optimized model instead of just taking the co-optimized prompt. Further, we see this on other datasets as well: ⦁ On HoVer: FST 30.8% vs RL+GEPA 24.9% (+5.9pp), even using 3x fewer data points ⦁ In easy-to-hard generalization, training on Polaris and testing on HMMT25, we see: FST 39.7% vs RL+GEPA 37.1% (+2.6pp), using 1.4x fewer datapoints. This is with the co-optimized prompt. Running GEPA on top of the FST model achieves 41.4% (+4.3pp).
Lakshya A Agrawal tweet media
English
1
1
9
463
rohan anil retweetledi
Rishabh Agarwal
Rishabh Agarwal@agarwl_·
Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights. So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA. Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models). I think this idea of learning both fast-slow weights would be a good foundation for continual learning. PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea. See more details here: gepa-ai.github.io/gepa/blog/2026…
Rishabh Agarwal tweet media
English
13
54
453
43.5K
rohan anil
rohan anil@_arohan_·
Collectors edition
rohan anil tweet media
Français
4
1
186
8.9K
Daksh Malik
Daksh Malik@DakshMalik47·
@_arohan_ @CoreAutoAI i bet u read this 1st on slack , then Jerry approved to post this and here u r trying to get more likes than the original post. i understand the game Rohan
English
1
0
1
22
Core Automation
Core Automation@CoreAutoAI·
We have an agent that looks like a dog with a surfboard and thats making me uneasy
English
4
0
23
2.7K
typedfemale
typedfemale@typedfemale·
@_arohan_ i have a chrome extension that replaces "muon" with "shampoo"
typedfemale tweet media
English
1
0
12
687
typedfemale
typedfemale@typedfemale·
they just created a million muon variants
English
6
6
121
8.6K
rohan anil retweetledi
will depue
will depue@willdepue·
golden age for optimizers right now. every day you see another SoapyShampooGluon ^-1/2 (RMSMatched) drops
English
7
8
259
25.8K
Tony S.F.
Tony S.F.@tonysilveti·
@_arohan_ and DASGO for the instantaneous version; now I remember this paper by @dakovalev1 who made this connection explicit too (with analysis)
Tony S.F. tweet media
English
1
0
2
108
Tony S.F.
Tony S.F.@tonysilveti·
Wow! More theoretical analysis linking the spectral norm and row-norm. They make a nice argument using "row-block diagonal dominance" of the layer-wise Hessian to say that spectral LMO and row-norm LMO should give equivalent asymptotic dynamics (as width grows).
Shenyang Deng ✈️ ICML2026@DengShenyang24

1/n Please stop by👋. This is not just another ICML 2026 optimizer paper. We have rich intuition to share on why simple preconditioners like orthogonalization and row-normalization specifically benefit NNs optimization. Quick overview below 🧵

English
2
2
33
4.1K
rohan anil retweetledi
Shenyang Deng ✈️ ICML2026
This conclusion isn't unique to our work. In fact, a number of recent concurrent works have arrived at the same finding: [7] A Minimalist Optimizer Design for LLM Pretraining. [8] SRON: State-Free LLM Training via Row-Wise Gradient Normalization. [9] Mano: Restriking Manifold Optimization for LLM Training. [10] On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer. [13] Aurora: A Leverage-Aware Optimizer for Rectangular Matrices.
English
2
0
0
72
Shenyang Deng ✈️ ICML2026
Prior work proposed fairly abstract frameworks and only adopted the diagonal-dominance approximation at implementation time, due to compute constraints. You can't say they understood why this trick works for neural networks. Common optimization-theory setups are not equivalent to the neural-network optimization problem. Nonconvex smooth optimization, OCO problem setups, and many overly abstract theoretical frameworks simply cannot explain why various tricks work specifically on neural networks. You can read our blog to understand why prior optimization-theory work fails to predict Muon's success.
English
1
0
0
68
Tony S.F.
Tony S.F.@tonysilveti·
@_arohan_ Yes, I also found it now here and indeed the name was "simple" rather than one-sided.
Tony S.F. tweet media
English
1
0
2
119
Tony S.F.
Tony S.F.@tonysilveti·
@_arohan_ ...I didn't see that connection before. Yes, if you take as convention that one-sided shampoo changes the power from -1/4 to -1/2 (which makes sense). But I don't find one-sided shampoo mentioned in either the original or your followup paper. Is there a different name used?
English
2
0
5
290