Robert M. Gower

568 posts

Robert M. Gower

@gowerrobert

Often found scribbling down math with intermittent bursts of bashing out code.

New York City, USA Katılım Haziran 2011

347 Takip Edilen1.7K Takipçiler

Sabitlenmiş Tweet

Robert M. Gower@gowerrobert·19 Kas

Do you want to do a Postdoc developing new methods/theory in Optimization for deep learning/ML? Do you enjoy bluesky open research and discussions on black boards? Then Apply to the Flatiron Fellowship in the Center of Computational Mathematics simonsfoundation.org/flatiron/caree… 1/3

English

5.4K

Robert M. Gower@gowerrobert·1d

@ruuustem_10 Yes good point! It still irks me that we don't fully understand non-Euclidean methods on quadratics. This is a must if we are to rely on smoothness assumptions to understand Muon

English

Rustem@ruuustem_10·1d

So to me the question which set of assumptions gives the best predictiveness is still unclear. In fact, convergence of non-Euclideand descent methods is not studied well enough even on quadratics (see our work arxiv.org/pdf/2603.05002).

English

256

Rustem@ruuustem_10·1d

This paper arxiv.org/pdf/2605.08980 studies spectral methods for nonsmooth functions, while most of prior works require smoothness in the analysis. @gowerrobert and the team provide counterexamples showing that spectral methods might not converge in the nonsmooth setting

English

5.1K

Robert M. Gower@gowerrobert·8 May

@Ji_Ha_Kim @YouJiacheng @noahamsel @ejarlebring What problem are you referring to? This example just shows that the optimal polynomial approx to sign under the L2 norm does not satisfy the equioscillation theorem. The equioscillation theorem is about the L infinity norm.

English

Ji-Ha@Ji_Ha_Kim·6 May

@noahamsel @ejarlebring GPT found a counterexample to your problem 6.4

English

3.7K

Robert M. Gower@gowerrobert·5 May

@elon_lit Nice work, this looks very interesting. Curiously, we showed that Adam explicitly tracks this same centered gradient variance, and this SNR threshold looks very similar to the square of Adam update, see here x.com/gowerrobert/st… Does this mean Adam is tracking this noise?

Robert M. Gower@gowerrobert

When β_1=β_2, we can first re-write Adam as below, where instead of the standard uncentered 2nd momentum, we have something that looks a weird variance estimator. Fun fact, it is an online estimate of variance! Let me explain ...

English

185

Elon Litman@elon_lit·5 May

This is all good theory, but we wanted a usable tool for training. We derived an exact measure of how much noise is leaking into the signal channel, showing it is the only source of overfitting. Even better, we can compute this directly on the optimizer's current batch via a specific Wiener filter / SNR threshold, letting your neural network do population risk minimization! 🔥 🔥

English

6.1K

Elon Litman@elon_lit·5 May

We developed a unified theory of generalization in deep learning. It explains grokking, double descent, benign overfitting, and implicit bias. But theory is only half the story. It turns out that optimizing the population risk of any neural network amounts to a small change to your optimizer. 🧵

English

127

74.1K

Robert M. Gower@gowerrobert·5 May

And now we are very proud and humbled to have received the ICLR 2026 Honorable Mention award for this work blog.iclr.cc/2026/04/23/ann… Very fun to have found this useful math nugget that can actually speed-up LLM training.

Robert M. Gower@gowerrobert

Are you interested in the new Muon/Scion/Gluon method for training LLMs? To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

English

3.8K

Robert M. Gower@gowerrobert·4 May

@_arohan_ @tonysilveti Same here 🙃 maybe we should chat

English

rohan anil@_arohan_·4 May

@gowerrobert @tonysilveti Oo nice! There is still some practical aspects I can still thread on 😁

English

Tony S.F.@tonysilveti·3 May

+1 to PSGD. the original paper by Xi-Lin is super insightful and the followups extending the preconditioner design using lie groups are elegant

rohan anil@_arohan_

PSGD is indeed kino. I fear not the man who has implemented 10,000 optimizers on a single problem, but I fear the man who relentlessly improved the same optimizer 10,000 times. Xilin Li & Omead P sites.google.com/site/lixilinx/…

English

5.4K

Robert M. Gower@gowerrobert·4 May

@_arohan_ @tonysilveti Let me save you some time. If you keep following this logic of a closed form prox, and regularized secant equation, you get a new quasi-Newton method that works for non-convex. But it turns out, this was already done here: arxiv.org/pdf/2403.02448

English

rohan anil@_arohan_·4 May

@gowerrobert @tonysilveti +1, this was also my main observation! Let me also try to see if I can share it sooner as well. I think this is the single point of improvement that can make it mainstream

English

129

Robert M. Gower@gowerrobert·4 May

@tonysilveti I thought only criteria 1 and 2 were directly motivated through the secant equation. I don't see any such direct link of criteria 3 to secant equation. In any case its simply E|| P dg - d \theta||_{P{-1}}^2.

English

3.3K

Tony S.F.@tonysilveti·4 May

@gowerrobert the secant equation connection is there in the original paper when discussing the 3 criteria (unless im missing what you actually mean?)

English

217

Robert M. Gower@gowerrobert·4 May

Very happy that this has now been accepted to ICML2026! Great, systematic work done by @CrichaelMawshaw

Robert M. Gower@gowerrobert

We've just finished some work on improving the sensitivity of Muon to the learning rate, and exploring a lot of design choices. If you want to see how we did this, follow me ....1/x (Work lead by the amazing @CrichaelMawshaw)

English

3.6K

Robert M. Gower@gowerrobert·29 Nis

@jeffreycider @CV_novel_plume Using the optimal polynomials instead would improve exactly iteration complexity, that is, require slightly fewer iterations to reach a desired loss

English

cider@jeffreycider·29 Nis

@gowerrobert @CV_novel_plume i thought this new benchmark measures iteration complexity, not wall clock time

English

Yuxin Fang@CV_novel_plume·29 Nis

I’ve run a lot of experiments on Muon and its variants, and I’d bet that in this setting, the Muon baseline will be very hard to beat.

Keller Jordan@kellerjordan0

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English

18.1K

Robert M. Gower@gowerrobert·29 Nis

@CV_novel_plume I agree with this statement. Over tuning an optimizer to one problem doesn't really teach us anything. This is also why I find the AlgoPerf benchmark interesting for comparing optimizers mlcommons.org/benchmarks/alg… specially the self-tuning track

English

264

Yuxin Fang@CV_novel_plume·29 Nis

This is a very meaningful benchmark, but there is one caveat worth keeping in mind. In speedrun settings, there is now a clear trend toward using different optimizers and hyperparameters for different modules. I have to admit that this can bring real gains. But when comparing optimizers, we should not give hyperparameters unlimited freedom. For example, if I first run a strong optimizer, then reverse-engineer an SGD hyper parameter schedule that tunes every neuron at every step to match it, SGD may appear to “simulate” Adam, Muon, or almost any optimizer . But that would not tell us much about SGD. It only means the optimizer has been hidden inside the hyperparameter schedule. To me, the value of a good optimizer is the opposite: it should adapt internally, require fewer hand-tuned knobs, and transfer robustly across model scales. This kind of invariance across model scales is exactly what makes hyperparameter scaling laws meaningful. If we over-optimize the recipe for one particular scale, we may win that benchmark while losing the cross-scale structure we actually want to understand @kellerjordan0 @tonysilveti @wen_kaiyue

Yuxin Fang@CV_novel_plume

I’ve run a lot of experiments on Muon and its variants, and I’d bet that in this setting, the Muon baseline will be very hard to beat.

English

2.7K

Robert M. Gower@gowerrobert·29 Nis

@FengzhuoZhang About their Hybrid Newton-Schulz in the v4 report, I understand they change the polynomials after 8 steps to ensure convergence. But it would converge even faster if they just use the *optimal* sequence of 10 polynomials, as we proposed here: x.com/gowerrobert/st…

Robert M. Gower@gowerrobert

English

Fengzhuo Zhang@FengzhuoZhang·24 Nis

The Newton–Schulz iteration coefficients optimized by DeepSeek-V4 are surprisingly strong: they effectively normalize all singular values to 1. This matches our previous intuition: a well-balanced spectrum may help strike a better balance across long-tail knowledge. Plot code: github.com/FengzhuoZhang/…

Fengzhuo Zhang@FengzhuoZhang

Why does Muon outperform Adam—and how? 🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In heavy-tailed tasks, Muon significantly improves tail-class learning compared to Adam. Paper Link: arxiv.org/pdf/2509.26030 A thread 🧵

English

425

67.2K

Robert M. Gower@gowerrobert·29 Nis

@torchcompiled Nice idea! For the retraction map, you may want to try the optimal polynomials instead. For instance, you could just apply the first two optimal polynomials to correct the approximation. We had an iclr paper on this : iclr.cc/virtual/2026/o…

Robert M. Gower@gowerrobert

English

419

Ethan@torchcompiled·28 Nis

What if we could do Muon with just one Newton step while also achieving better loss?

English

224

28.6K

Robert M. Gower@gowerrobert·29 Nis

And now we got the Honorable paper mention of ICLR 2026 for our work on Muon+PolarExpress!

Robert M. Gower@gowerrobert

English

3.7K

Robert M. Gower@gowerrobert·25 Şub

@bozavlado @giffmana Yeah, this is mad, and the same issue always of ADMM methods applied in this way. Unless these copies of models are distributed across different machines, in makes no sense!

English

Vlado Boza@bozavlado·24 Şub

Nobody talks about the fact that these optimizers actually need M times more memory (where M is the number of data chunks). Also: There is a huge advantage in wall-clock time on the XL model (comes from parallelization). Small advantage on loss vs number of tokens. No advantage on Nano. Maybe baselines for XL were taken from Nano.

English

3.2K

Lucas Beyer (bl16)@giffmana·24 Şub

New optimizer with earth-shattering plots making the rounds, and published in Nature too (Machine Intelligence, but let's just drop that part.) So of course I had to take a quick look. A few things I noticed that make me a bit sus, though I'm not saying to outright discard it - Each point is caption of the corresponding screenshot below: 1. What on earth are these SGDM vs AdamW gaps? They are not normal -> untuned baselines? (Also: what good is a Nature MI editor, if they approve plots with "0M" everywhere on x-axis???) 2. For vision models they tune lr's, good. But not wd or other optim hparams, meh/sus. 3. For LLM, they select hparam on test. At least epochs, but given this and that they seem to use "validation" and "testing" as synonyms in the paper, probably everything. 4. I am not sure a Medium blogpost tutorial with an arbitrary hparam selection is a good starting point for the baseline of a Nature MI paper?? Maybe this new optimizer is as amazing as promised, but I'll need to see less suspicious evidence. I wish the reviewers had asked for that. Maybe someone put it to test on the nanogpt speedrun? At least that has heavily-tuned baselines, including optimizers.

Ji-Ha@Ji_Ha_Kim

Woah, how did I never hear of this? An optimizer paper that got published in Nature, looks quite substantial

English

518

76.5K

Robert M. Gower@gowerrobert·22 Oca

@mher_safaryan @LancasterUni Brilliant news, congrats on your new position!

English

Mher Safaryan@mher_safaryan·13 Ara

Life update! I’m excited to share that I’ve started a new role as an Assistant Professor in the School of Mathematical Sciences at @LancasterUni. Our section MARS (Mathematics for AI in Real-world Systems) is currently recruiting PhD students and Senior Research Associates. 👇

English

201

Robert M. Gower@gowerrobert·8 Ara

@hayou_soufiane @FrancoisChauba1 @agupta Even some cs students only do theory and don’t use gpus

English

Soufiane Hayou@hayou_soufiane·7 Ara

@FrancoisChauba1 @agupta It would make more sense to look at GPUs/(students who use GPUs), not all students (e.g. most social sciences students don't use GPUs). At Hopkins, that ratio is much higher and compute is very available currently.

English

611

Francois Chaubard@FrancoisChauba1·6 Ara

Last night, @agupta and I hosted a great dinner with 14 professors at #NeurIPS2025 from leading academic labs across the US, and many cited compute in academia as "abhorrent". Out of curiosity I just pulled these stats. This is insane. To do meaningful AI research today you need at least 1 GPU/student. Likely 8+ to be honest. The best university (Princeton) is at 0.8 GPUs/student. Stanford is at 0.14 GPUs/student. Marlowe (Stanford's "super cluster") has only 248 H100s for the whole CS Dept to use. Every frontier lab has >100k. This needs to be fixed.

English

490

264.8K

Robert M. Gower retweetledi

Diana Cai@dianarycai·4 Ara

Check out my poster today (Thurs) at 11am--2pm session. Exhibit Hall C,D,E Poster Location: #602 "Fisher meets Feynman: score-based variational inference with a product of experts" (NeurIPS spotlight) with @gowerrobert David Blei and Lawrence Saul @FlatironInst #NeurIPS2025

English

5.3K

Robert M. Gower retweetledi

Jiequn Han@JiequnH·24 Kas

We’re recruiting for both postdoc and open-rank positions. Learn more about ML@CCM 👉 users.flatironinstitute.org/~lsaul/ml_ccm.… I’ll also be in San Diego for NeurIPS — feel free to DM if you’re interested in #AIforScience or #GenerativeAI

Alberto Bietti@albertobietti

Want to do fundamental ML research in NYC? 🧠 The Center for Computational Mathematics @FlatironInst @SimonsFdn is hiring! – Flatiron Research Fellow (postdoc, by Dec 1): apply.interfolio.com/173401 – Open Rank (by Jan 15): apply.interfolio.com/173640

English

4.9K

Keşfet

@ruuustem_10 @Ji_Ha_Kim @YouJiacheng @noahamsel @ejarlebring @elon_lit @_arohan_ @tonysilveti