Robert M. Gower @ Neurips 2025

553 posts

Robert M. Gower @ Neurips 2025 banner
Robert M. Gower @ Neurips 2025

Robert M. Gower @ Neurips 2025

@gowerrobert

Often found scribbling down math with intermittent bursts of bashing out code.

New York City, USA Katılım Haziran 2011
346 Takip Edilen1.6K Takipçiler
Sabitlenmiş Tweet
Robert M. Gower @ Neurips 2025
Robert M. Gower @ Neurips 2025@gowerrobert·
Do you want to do a Postdoc developing new methods/theory in Optimization for deep learning/ML? Do you enjoy bluesky open research and discussions on black boards? Then Apply to the Flatiron Fellowship in the Center of Computational Mathematics simonsfoundation.org/flatiron/caree… 1/3
Robert M. Gower @ Neurips 2025 tweet media
English
1
7
23
5K
Robert M. Gower @ Neurips 2025
Robert M. Gower @ Neurips 2025@gowerrobert·
@bozavlado @giffmana Yeah, this is mad, and the same issue always of ADMM methods applied in this way. Unless these copies of models are distributed across different machines, in makes no sense!
English
0
0
1
29
Vlado Boza
Vlado Boza@bozavlado·
Nobody talks about the fact that these optimizers actually need M times more memory (where M is the number of data chunks). Also: There is a huge advantage in wall-clock time on the XL model (comes from parallelization). Small advantage on loss vs number of tokens. No advantage on Nano. Maybe baselines for XL were taken from Nano.
English
2
0
11
3.2K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
New optimizer with earth-shattering plots making the rounds, and published in Nature too (Machine Intelligence, but let's just drop that part.) So of course I had to take a quick look. A few things I noticed that make me a bit sus, though I'm not saying to outright discard it - Each point is caption of the corresponding screenshot below: 1. What on earth are these SGDM vs AdamW gaps? They are not normal -> untuned baselines? (Also: what good is a Nature MI editor, if they approve plots with "0M" everywhere on x-axis???) 2. For vision models they tune lr's, good. But not wd or other optim hparams, meh/sus. 3. For LLM, they select hparam on test. At least epochs, but given this and that they seem to use "validation" and "testing" as synonyms in the paper, probably everything. 4. I am not sure a Medium blogpost tutorial with an arbitrary hparam selection is a good starting point for the baseline of a Nature MI paper?? Maybe this new optimizer is as amazing as promised, but I'll need to see less suspicious evidence. I wish the reviewers had asked for that. Maybe someone put it to test on the nanogpt speedrun? At least that has heavily-tuned baselines, including optimizers.
Lucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet mediaLucas Beyer (bl16) tweet media
Ji-Ha@Ji_Ha_Kim

Woah, how did I never hear of this? An optimizer paper that got published in Nature, looks quite substantial

English
22
24
518
76.3K
Mher Safaryan
Mher Safaryan@mher_safaryan·
Life update! I’m excited to share that I’ve started a new role as an Assistant Professor in the School of Mathematical Sciences at @LancasterUni. Our section MARS (Mathematics for AI in Real-world Systems) is currently recruiting PhD students and Senior Research Associates. 👇
Mher Safaryan tweet media
English
3
1
6
195
Soufiane Hayou
Soufiane Hayou@hayou_soufiane·
@FrancoisChauba1 @agupta It would make more sense to look at GPUs/(students who use GPUs), not all students (e.g. most social sciences students don't use GPUs). At Hopkins, that ratio is much higher and compute is very available currently.
English
2
0
4
605
Francois Chaubard
Francois Chaubard@FrancoisChauba1·
Last night, @agupta and I hosted a great dinner with 14 professors at #NeurIPS2025 from leading academic labs across the US, and many cited compute in academia as "abhorrent". Out of curiosity I just pulled these stats. This is insane. To do meaningful AI research today you need at least 1 GPU/student. Likely 8+ to be honest. The best university (Princeton) is at 0.8 GPUs/student. Stanford is at 0.14 GPUs/student. Marlowe (Stanford's "super cluster") has only 248 H100s for the whole CS Dept to use. Every frontier lab has >100k. This needs to be fixed.
Francois Chaubard tweet media
English
68
61
495
264.4K
Robert M. Gower @ Neurips 2025 retweetledi
Diana Cai
Diana Cai@dianarycai·
Check out my poster today (Thurs) at 11am--2pm session. Exhibit Hall C,D,E Poster Location: #602 "Fisher meets Feynman: score-based variational inference with a product of experts" (NeurIPS spotlight) with @gowerrobert David Blei and Lawrence Saul @FlatironInst #NeurIPS2025
Diana Cai tweet media
English
2
9
62
5.2K
Robert M. Gower @ Neurips 2025 retweetledi
Jiequn Han
Jiequn Han@JiequnH·
We’re recruiting for both postdoc and open-rank positions. Learn more about ML@CCM 👉 users.flatironinstitute.org/~lsaul/ml_ccm.… I’ll also be in San Diego for NeurIPS — feel free to DM if you’re interested in #AIforScience or #GenerativeAI
Alberto Bietti@albertobietti

Want to do fundamental ML research in NYC? 🧠 The Center for Computational Mathematics @FlatironInst @SimonsFdn is hiring! – Flatiron Research Fellow (postdoc, by Dec 1): apply.interfolio.com/173401 – Open Rank (by Jan 15): apply.interfolio.com/173640

English
1
3
25
4.8K
Tony S.F.
Tony S.F.@tonysilveti·
@gowerrobert 0, which is not in the set of directions with norm = 1. If you take the inequality, you can define the LMO at 0 to just be 0.
English
1
0
2
117
Robert M. Gower @ Neurips 2025
We've just finished some work on improving the sensitivity of Muon to the learning rate, and exploring a lot of design choices. If you want to see how we did this, follow me ....1/x (Work lead by the amazing @CrichaelMawshaw)
Robert M. Gower @ Neurips 2025 tweet media
English
6
23
186
25.7K
Robert M. Gower @ Neurips 2025
Robert M. Gower @ Neurips 2025@gowerrobert·
@tonysilveti I don’t understand. If the input (gradient) is zero, what would be a good direction? Also, the inequality and equality constraint give the same solution when maximizing a linear function. Both solutions must be on the boundary (satisfy the equality constraint)
English
1
0
0
86
Tony S.F.
Tony S.F.@tonysilveti·
@gowerrobert The way you define the LMO (following old optimizer new norm) is not correct for the algorithm you want. If you take the minimization over the equality constraint, then you get a bad direction when the input is zero. The correct way is the frank-wolfe (Scion) way with inequality.
English
2
0
3
196
Francesco Orabona
Francesco Orabona@bremen79·
@roydanroy @2prime_PKU @YouJiacheng Sure, but I guess here it was more about "this idea cannot be that new" and in fact, from a certain point of view, it is not. A different approximation of proximal updates were present even in Vowpal Wabbit (MSR software), sooner or later people will rediscover that too
English
2
0
1
379
Yiping Lu
Yiping Lu@2prime_PKU·
@YouJiacheng This is polyak's stepsize proposed at 1963. The truncation view is already in textbooks e.g. Nesterov (2004), Ch. 2; Bubeck (2015), Sec. 3.3.
English
3
1
8
2.1K
Finn Busch
Finn Busch@fnnBsch·
@gowerrobert @CrichaelMawshaw I think so, see p.4 in the report ("Muon can directly reuse the learning rate and weight decay tuned for AdamW") and Appendix A. It seems that the RMS relationship is quite straightforward. Chances are the LR setup you found to work best is in the same ballpark?
English
1
0
1
58
Robert M. Gower @ Neurips 2025
@fnnBsch @CrichaelMawshaw We tuned the two lrs of the Adam and muon layers separately for all methods. Does the moonshot scaling really allow for one shared lr for Adam and muon ? I didn’t know that, and that would be great if it’s true. It made a big difference tuning both lrs for us
English
1
0
1
110
Robert M. Gower @ Neurips 2025
Our paper covers a lot ground, including exploring different product norms, formalizing MuonAdam as steepest descent, introducing the combination of truncation + Muon, and a lot of experiments! Here are the details -> arxiv.org/pdf/2510.09827 and ...
English
1
2
23
1.7K