Robert M. Gower

568 posts

Robert M. Gower banner
Robert M. Gower

Robert M. Gower

@gowerrobert

Often found scribbling down math with intermittent bursts of bashing out code.

New York City, USA Katılım Haziran 2011
347 Takip Edilen1.7K Takipçiler
Sabitlenmiş Tweet
Robert M. Gower
Robert M. Gower@gowerrobert·
Do you want to do a Postdoc developing new methods/theory in Optimization for deep learning/ML? Do you enjoy bluesky open research and discussions on black boards? Then Apply to the Flatiron Fellowship in the Center of Computational Mathematics simonsfoundation.org/flatiron/caree… 1/3
Robert M. Gower tweet media
English
1
7
24
5.4K
Robert M. Gower
Robert M. Gower@gowerrobert·
@ruuustem_10 Yes good point! It still irks me that we don't fully understand non-Euclidean methods on quadratics. This is a must if we are to rely on smoothness assumptions to understand Muon
English
0
0
1
29
Rustem
Rustem@ruuustem_10·
So to me the question which set of assumptions gives the best predictiveness is still unclear. In fact, convergence of non-Euclideand descent methods is not studied well enough even on quadratics (see our work arxiv.org/pdf/2603.05002).
English
1
1
6
256
Rustem
Rustem@ruuustem_10·
This paper arxiv.org/pdf/2605.08980 studies spectral methods for nonsmooth functions, while most of prior works require smoothness in the analysis. @gowerrobert and the team provide counterexamples showing that spectral methods might not converge in the nonsmooth setting
English
1
4
29
5.1K
Robert M. Gower
Robert M. Gower@gowerrobert·
@Ji_Ha_Kim @YouJiacheng @noahamsel @ejarlebring What problem are you referring to? This example just shows that the optimal polynomial approx to sign under the L2 norm does not satisfy the equioscillation theorem. The equioscillation theorem is about the L infinity norm.
English
1
0
1
89
Robert M. Gower
Robert M. Gower@gowerrobert·
@elon_lit Nice work, this looks very interesting. Curiously, we showed that Adam explicitly tracks this same centered gradient variance, and this SNR threshold looks very similar to the square of Adam update, see here x.com/gowerrobert/st… Does this mean Adam is tracking this noise?
Robert M. Gower@gowerrobert

When β_1=β_2, we can first re-write Adam as below, where instead of the standard uncentered 2nd momentum, we have something that looks a weird variance estimator. Fun fact, it is an online estimate of variance! Let me explain ...

English
0
0
3
185
Elon Litman
Elon Litman@elon_lit·
This is all good theory, but we wanted a usable tool for training. We derived an exact measure of how much noise is leaking into the signal channel, showing it is the only source of overfitting. Even better, we can compute this directly on the optimizer's current batch via a specific Wiener filter / SNR threshold, letting your neural network do population risk minimization! 🔥 🔥
Elon Litman tweet media
English
3
1
55
6.1K
Elon Litman
Elon Litman@elon_lit·
We developed a unified theory of generalization in deep learning. It explains grokking, double descent, benign overfitting, and implicit bias. But theory is only half the story. It turns out that optimizing the population risk of any neural network amounts to a small change to your optimizer. 🧵
Elon Litman tweet media
English
21
127
1K
74.1K
Robert M. Gower
Robert M. Gower@gowerrobert·
And now we are very proud and humbled to have received the ICLR 2026 Honorable Mention award for this work blog.iclr.cc/2026/04/23/ann… Very fun to have found this useful math nugget that can actually speed-up LLM training.
Robert M. Gower@gowerrobert

Are you interested in the new Muon/Scion/Gluon method for training LLMs? To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

English
0
10
57
3.8K
Robert M. Gower
Robert M. Gower@gowerrobert·
@_arohan_ @tonysilveti Let me save you some time. If you keep following this logic of a closed form prox, and regularized secant equation, you get a new quasi-Newton method that works for non-convex. But it turns out, this was already done here: arxiv.org/pdf/2403.02448
English
1
0
3
83
rohan anil
rohan anil@_arohan_·
@gowerrobert @tonysilveti +1, this was also my main observation! Let me also try to see if I can share it sooner as well. I think this is the single point of improvement that can make it mainstream
English
1
0
4
129
Robert M. Gower
Robert M. Gower@gowerrobert·
@tonysilveti I thought only criteria 1 and 2 were directly motivated through the secant equation. I don't see any such direct link of criteria 3 to secant equation. In any case its simply E|| P dg - d \theta||_{P{-1}}^2.
English
0
0
6
3.3K
Tony S.F.
Tony S.F.@tonysilveti·
@gowerrobert the secant equation connection is there in the original paper when discussing the 3 criteria (unless im missing what you actually mean?)
English
1
0
2
217
Robert M. Gower
Robert M. Gower@gowerrobert·
@jeffreycider @CV_novel_plume Using the optimal polynomials instead would improve exactly iteration complexity, that is, require slightly fewer iterations to reach a desired loss
English
0
0
1
44
Yuxin Fang
Yuxin Fang@CV_novel_plume·
I’ve run a lot of experiments on Muon and its variants, and I’d bet that in this setting, the Muon baseline will be very hard to beat.
Keller Jordan@kellerjordan0

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English
10
4
62
18.1K
Robert M. Gower
Robert M. Gower@gowerrobert·
@CV_novel_plume I agree with this statement. Over tuning an optimizer to one problem doesn't really teach us anything. This is also why I find the AlgoPerf benchmark interesting for comparing optimizers mlcommons.org/benchmarks/alg… specially the self-tuning track
English
1
1
7
264
Yuxin Fang
Yuxin Fang@CV_novel_plume·
This is a very meaningful benchmark, but there is one caveat worth keeping in mind. In speedrun settings, there is now a clear trend toward using different optimizers and hyperparameters for different modules. I have to admit that this can bring real gains. But when comparing optimizers, we should not give hyperparameters unlimited freedom. For example, if I first run a strong optimizer, then reverse-engineer an SGD hyper parameter schedule that tunes every neuron at every step to match it, SGD may appear to “simulate” Adam, Muon, or almost any optimizer . But that would not tell us much about SGD. It only means the optimizer has been hidden inside the hyperparameter schedule. To me, the value of a good optimizer is the opposite: it should adapt internally, require fewer hand-tuned knobs, and transfer robustly across model scales. This kind of invariance across model scales is exactly what makes hyperparameter scaling laws meaningful. If we over-optimize the recipe for one particular scale, we may win that benchmark while losing the cross-scale structure we actually want to understand @kellerjordan0 @tonysilveti @wen_kaiyue
Yuxin Fang@CV_novel_plume

I’ve run a lot of experiments on Muon and its variants, and I’d bet that in this setting, the Muon baseline will be very hard to beat.

English
1
0
22
2.7K
Robert M. Gower
Robert M. Gower@gowerrobert·
@FengzhuoZhang About their Hybrid Newton-Schulz in the v4 report, I understand they change the polynomials after 8 steps to ensure convergence. But it would converge even faster if they just use the *optimal* sequence of 10 polynomials, as we proposed here: x.com/gowerrobert/st…
Robert M. Gower@gowerrobert

Are you interested in the new Muon/Scion/Gluon method for training LLMs? To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

English
1
0
1
42
Fengzhuo Zhang
Fengzhuo Zhang@FengzhuoZhang·
The Newton–Schulz iteration coefficients optimized by DeepSeek-V4 are surprisingly strong: they effectively normalize all singular values to 1. This matches our previous intuition: a well-balanced spectrum may help strike a better balance across long-tail knowledge. Plot code: github.com/FengzhuoZhang/…
Fengzhuo Zhang tweet mediaFengzhuo Zhang tweet media
Fengzhuo Zhang@FengzhuoZhang

Why does Muon outperform Adam—and how? 🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In heavy-tailed tasks, Muon significantly improves tail-class learning compared to Adam. Paper Link: arxiv.org/pdf/2509.26030 A thread 🧵

English
8
60
425
67.2K
Robert M. Gower
Robert M. Gower@gowerrobert·
@torchcompiled Nice idea! For the retraction map, you may want to try the optimal polynomials instead. For instance, you could just apply the first two optimal polynomials to correct the approximation. We had an iclr paper on this : iclr.cc/virtual/2026/o…
Robert M. Gower@gowerrobert

Are you interested in the new Muon/Scion/Gluon method for training LLMs? To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

English
0
0
3
419
Ethan
Ethan@torchcompiled·
What if we could do Muon with just one Newton step while also achieving better loss?
Ethan tweet media
English
6
23
224
28.6K
Robert M. Gower
Robert M. Gower@gowerrobert·
@bozavlado @giffmana Yeah, this is mad, and the same issue always of ADMM methods applied in this way. Unless these copies of models are distributed across different machines, in makes no sense!
English
0
0
1
30
Vlado Boza
Vlado Boza@bozavlado·
Nobody talks about the fact that these optimizers actually need M times more memory (where M is the number of data chunks). Also: There is a huge advantage in wall-clock time on the XL model (comes from parallelization). Small advantage on loss vs number of tokens. No advantage on Nano. Maybe baselines for XL were taken from Nano.
English
2
0
11
3.2K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
New optimizer with earth-shattering plots making the rounds, and published in Nature too (Machine Intelligence, but let's just drop that part.) So of course I had to take a quick look. A few things I noticed that make me a bit sus, though I'm not saying to outright discard it - Each point is caption of the corresponding screenshot below: 1. What on earth are these SGDM vs AdamW gaps? They are not normal -> untuned baselines? (Also: what good is a Nature MI editor, if they approve plots with "0M" everywhere on x-axis???) 2. For vision models they tune lr's, good. But not wd or other optim hparams, meh/sus. 3. For LLM, they select hparam on test. At least epochs, but given this and that they seem to use "validation" and "testing" as synonyms in the paper, probably everything. 4. I am not sure a Medium blogpost tutorial with an arbitrary hparam selection is a good starting point for the baseline of a Nature MI paper?? Maybe this new optimizer is as amazing as promised, but I'll need to see less suspicious evidence. I wish the reviewers had asked for that. Maybe someone put it to test on the nanogpt speedrun? At least that has heavily-tuned baselines, including optimizers.
Lucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet media
Ji-Ha@Ji_Ha_Kim

Woah, how did I never hear of this? An optimizer paper that got published in Nature, looks quite substantial

English
22
24
518
76.5K
Mher Safaryan
Mher Safaryan@mher_safaryan·
Life update! I’m excited to share that I’ve started a new role as an Assistant Professor in the School of Mathematical Sciences at @LancasterUni. Our section MARS (Mathematics for AI in Real-world Systems) is currently recruiting PhD students and Senior Research Associates. 👇
Mher Safaryan tweet media
English
3
1
6
201
Soufiane Hayou
Soufiane Hayou@hayou_soufiane·
@FrancoisChauba1 @agupta It would make more sense to look at GPUs/(students who use GPUs), not all students (e.g. most social sciences students don't use GPUs). At Hopkins, that ratio is much higher and compute is very available currently.
English
2
0
4
611
Francois Chaubard
Francois Chaubard@FrancoisChauba1·
Last night, @agupta and I hosted a great dinner with 14 professors at #NeurIPS2025 from leading academic labs across the US, and many cited compute in academia as "abhorrent". Out of curiosity I just pulled these stats. This is insane. To do meaningful AI research today you need at least 1 GPU/student. Likely 8+ to be honest. The best university (Princeton) is at 0.8 GPUs/student. Stanford is at 0.14 GPUs/student. Marlowe (Stanford's "super cluster") has only 248 H100s for the whole CS Dept to use. Every frontier lab has >100k. This needs to be fixed.
Francois Chaubard tweet media
English
68
61
490
264.8K
Robert M. Gower retweetledi
Diana Cai
Diana Cai@dianarycai·
Check out my poster today (Thurs) at 11am--2pm session. Exhibit Hall C,D,E Poster Location: #602 "Fisher meets Feynman: score-based variational inference with a product of experts" (NeurIPS spotlight) with @gowerrobert David Blei and Lawrence Saul @FlatironInst #NeurIPS2025
Diana Cai tweet media
English
2
10
63
5.3K
Robert M. Gower retweetledi
Jiequn Han
Jiequn Han@JiequnH·
We’re recruiting for both postdoc and open-rank positions. Learn more about ML@CCM 👉 users.flatironinstitute.org/~lsaul/ml_ccm.… I’ll also be in San Diego for NeurIPS — feel free to DM if you’re interested in #AIforScience or #GenerativeAI
Alberto Bietti@albertobietti

Want to do fundamental ML research in NYC? 🧠 The Center for Computational Mathematics @FlatironInst @SimonsFdn is hiring! – Flatiron Research Fellow (postdoc, by Dec 1): apply.interfolio.com/173401 – Open Rank (by Jan 15): apply.interfolio.com/173640

English
1
3
26
4.9K