mikail

2.1K posts

mikail

@Gradientdinner

Research Scientist @nvidia 🌁 | PhD @MIT

San Francisco, CA Katılım Ocak 2019

2.1K Takip Edilen2.6K Takipçiler

Sabitlenmiş Tweet

mikail@Gradientdinner·20 Şub

This paper got into @Nature!!! 🚀🚀🚀 Look at @SarthakChandra’s thread for a summary x.com/SarthakChandra…

mikail@Gradientdinner

🚨New Preprint! Wondered how grid cells form multiple discrete modules? Interested in continuous attractors and modularity? With @FieteGroup, we discover + generalize a physical mechanism for forming modules from smoothly varying parameters in a dynamical system!👇(1/15)

English

107

12.7K

mikail@Gradientdinner·2d

@stochasticchasm Post-neurips arxiving? Coinciding with @kellerjordan0 optimizer track getting momentum?

English

139

stochasm@stochasticchasm·3d

have there always been this many optimizer releases in such a short period of time

English

3.5K

mikail@Gradientdinner·3d

@tianylin @mikechrzano

QAM

216

TianyLin@tianylin·3d

turns out the loss improvement diminishes at around 10k step (~330B tokens)…

TianyLin@tianylin

I started a vibe run using the aurora optimizer. model is of param size 4B and is trained on 32k sequences. HPs are not carefully tuned (used the same lr & wd as muon baseline). i do observe some loss improvement over baseline early in training (see fig, -0.0093 at step 3k).

English

111

23.3K

mikail@Gradientdinner·4d

@_arohan_ @torchcompiled True (but figure 1 shows AdamW can’t be saved by good signal prop init) Pieces of the puzzle have been there before for sure — arxiv.org/abs/2305.09828

English

rohan anil@_arohan_·4d

@Gradientdinner @torchcompiled Wait a minute, you also need init shaping to get better optimizers to work

English

Ethan@torchcompiled·4d

I think this has been done a few times in the past, but either a pair of scalars or pair of dim-vectors to scale both residual magnitude and residual stream magnitude feels maybe helpful? Right now the residual stream has to continuously grow in norm. The output from a residual block, if it wants to have proper place in the residual stream, needs to make itself large enough for relevance. Even though it keeps growing, this isn’t a terrible thing given the final layer norm. I’d have to think there’s a way or regularizing learned scales such that 1. The residual stream doesnt evaporate older information, even 0.95^n_layers at 36 reduces magnitude of first layers contribution down to 0.15 for better or worse 2. You’d want to likely set it up to be variance preserving. something like clip(alpha,0,1)^0.5, (1-clip(alpha,0,1)^0.5. Probably with something of a straight through estimator, none of that sigmoid junk. 3. Possibly some additional regularization losses Which overall quickly feels like possibly hard to justify overhead

Core Automation@CoreAutoAI

Are residual connections a hack, or provably optimal way to shape your loss landscape?

English

3.9K

mikail retweetledi

Thinking Machines@thinkymachines·4d

With the model's simultaneous speech capability, Horace has gotten a lot easier to work with recently.

English

1.2K

259.5K

mikail@Gradientdinner·4d

@_arohan_ @torchcompiled How many such architectural choices have been made because everyone AdamW’ed everything by default

English

mikail@Gradientdinner·4d

@_arohan_ @torchcompiled 👀

QME

174

mikail@Gradientdinner·4d

@torchcompiled 404

Ethan@torchcompiled·4d

github.com/ethansmith2000…

ZXX

541

Ethan@torchcompiled·4d

If you liked tangent step + Stiefel manifold retraction for fast Muon, you'll love the same method applied for SOAP basis updates.

Ethan@torchcompiled

What if we could do Muon with just one Newton step while also achieving better loss?

English

9.5K

mikail@Gradientdinner·8 May

@_arohan_ This is from @evaninwords

English

100

rohan anil@_arohan_·8 May

@Gradientdinner that's equivalent to reading X.

English

rohan anil@_arohan_·8 May

Cool work! Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type: U being [[1, 0] [0, eps]] Which can lead to death spiral for second row updates and neurons selectively die. Following update could be a bit better. [[0.5, 0] [0.0, 0.5]] OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better. All roads lead to Adam/AdaGrad. Or does it …

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

English

210

21K

mikail@Gradientdinner·8 May

@_arohan_ Randomly re-initialize row/column of weight corresponding to dead neuron 🧠

English

108

rohan anil@_arohan_·8 May

Enjoy this vague post too

rohan anil@_arohan_

Neurons don’t die they just are just cryo sleeping. Melody of optimization can wake them up, ready to increase the representation strength.

English

3.2K

mikail@Gradientdinner·5 May

@darshil Alfonso

Español

darshil@darshil·5 May

last year we put some of the best minds at work to solve the mango shortage and we’re BACK the 4th annual sf mango party is happening end of month dm or reply with your fav kind of mango if you want to come 🥭

The Wall Street Journal@WSJ

Americans are going to extraordinary lengths to get their hands on Indian mangoes. “I literally stop whatever I'm doing.” on.wsj.com/3Rjydhd

English

233

950

193.6K

mikail retweetledi

Yuchen Jin@Yuchenj_UW·4 May

No Neocloud ever imagined they’d be renting out H100s today at higher prices than 3 years ago. Even if you have money, frontier labs and Neolabs have already locked up most of the 2026 GPU supply. There is basically infinite demand for artificial intelligence.

English

435

53.3K

mikail retweetledi

Pierfrancesco Beneventano@PierBeneventano·3 May

Our new paper was accepted at ICML! 1) Momentum isn’t just “SGD but faster”. It affects sharpness (of orders of magnitude!) 2) The usual story says momentum lets you train in sharper regions. That’s true for large batches only! The opposite is true for minibatches!

English

113

7.1K

mikail@Gradientdinner·2 May

@Besteuler @fujikanaeda

QAM

197

mikail retweetledi

Weiyang Liu@Besteuler·6 Şub

Orthogonal Finetuning (oft.wyliu.com; boft.wyliu.com) has a unique advantage of preventing catastrophic forgetting. Inspired by this property, we find that merging models within the orthogonal group can effectively reduce model conflicts and preserve both pretraining and downstream knowledge. This is our OrthoMerge framework. The idea behind OrthoMerge is extremely simple. For OFT-tuned models, we can first map the orthogonal adapters to Lie algebra with inverse Carley transform and then perform merging there. This guarantees the merged model differs from the pretrained model only up to an orthogonal transformation. A better news is that OrthoMerge can also be applied to non-OFT-tuned models. By solving the orthogonal procrustes problem, we can have the projected component of the adapter onto the orthogonal group. OrthoMerge will then be applied there and the residual component can be merged using conventional merging methods. That said, OrthoMerge can be used together with existing model merging methods! This is a great example of simple yet effective ideas. Great efforts by my PhD students Sihan Yang and Kexuan Shi. The project is already open-sourced and feel free to give it a try! Project: spherelab.ai/OrthoMerge/ Paper: arxiv.org/pdf/2602.05943 Code: github.com/Sphere-AI-Lab/…

English

383

48.1K

mikail retweetledi

alex peysakhovich@alex_peys·14 Oca

@kevinakwok jurgen did it first

English

3.9K

mikail retweetledi

Jonas Hübotter@jonashubotter·26 Nis

Today and tomorrow we’ll be presenting self-distillation with orals at ICLR in Rio 🇧🇷 1. “Self-Distillation enables Continual Learning” at lifelong agents workshop (Sun 11:30am) 2. “Reinforcement Learning via Self-Distillation” at scaling post-training workshop (Mon 2:40pm) 3. “Test-Time Self-Distillation” at test-time updates workshop (Mon 4:15pm)

English

430

100.9K

mikail retweetledi

George Grigorev@iamgrigorev·28 Nis

We just released our first models at @poolsideai – including Laguna XS.2 (open weights) that competes with Qwen3.6-35B. I worked across pretraining — happy to answer questions! We now have great understanding of exactly all components that went into training through principled ablations, and now we're confident to scale. This year would be 🔥 for us! More coming very soon

English

233

22.9K

mikail@Gradientdinner·25 Nis

@HessianFree @_arohan_ @dheevatsa importance sampling shampoo?

English

Omead Pooladzandi@HessianFree·25 Nis

@_arohan_ @Gradientdinner @dheevatsa Should do IS shampoo next

English

dheevatsa@dheevatsa·25 Nis

Muon is having its moment — Kimi K2, GLM 5, and now DeepSeek V4! More broadly, it feels like the time for advanced optimizers is finally here — reiterating that they are an important component for efficient training systems at scale! Our recent work: performant Muon/SOAP-class optimizers in NVIDIA NeMo/Megatron-Core — layer-wise distributed optimizer, TP-aware Newton-Schulz, SYRK kernels. Muon ≥ AdamW on GB300-NVL72. developer.nvidia.com/blog/advancing…

English

5.8K

mikail@Gradientdinner·25 Nis

@_arohan_ @dheevatsa

QME

rohan anil@_arohan_·25 Nis

@Gradientdinner @dheevatsa KL shampoo is ❤️ for many reasons lost to the history books

English

100

mikail@Gradientdinner·25 Nis

@_arohan_ @dheevatsa Yep that’s what we concluded

English

rohan anil@_arohan_·25 Nis

@Gradientdinner @dheevatsa Thats basically because your preconditioner changes more often as you are making lot more progress on loss per step.

English

Keşfet

@stochasticchasm @kellerjordan0 @tianylin @mikechrzano @_arohan_ @torchcompiled @evaninwords @darshil