mikail

2.1K posts

mikail banner
mikail

mikail

@Gradientdinner

Research Scientist @nvidia 🌁 | PhD @MIT

San Francisco, CA Katılım Ocak 2019
2.1K Takip Edilen2.6K Takipçiler
stochasm
stochasm@stochasticchasm·
have there always been this many optimizer releases in such a short period of time
English
7
0
57
3.5K
Ethan
Ethan@torchcompiled·
I think this has been done a few times in the past, but either a pair of scalars or pair of dim-vectors to scale both residual magnitude and residual stream magnitude feels maybe helpful? Right now the residual stream has to continuously grow in norm. The output from a residual block, if it wants to have proper place in the residual stream, needs to make itself large enough for relevance. Even though it keeps growing, this isn’t a terrible thing given the final layer norm. I’d have to think there’s a way or regularizing learned scales such that 1. The residual stream doesnt evaporate older information, even 0.95^n_layers at 36 reduces magnitude of first layers contribution down to 0.15 for better or worse 2. You’d want to likely set it up to be variance preserving. something like clip(alpha,0,1)^0.5, (1-clip(alpha,0,1)^0.5. Probably with something of a straight through estimator, none of that sigmoid junk. 3. Possibly some additional regularization losses Which overall quickly feels like possibly hard to justify overhead
Core Automation@CoreAutoAI

Are residual connections a hack, or provably optimal way to shape your loss landscape?

English
1
0
20
3.9K
mikail retweetledi
Thinking Machines
Thinking Machines@thinkymachines·
With the model's simultaneous speech capability, Horace has gotten a lot easier to work with recently.
English
42
62
1.2K
259.5K
mikail
mikail@Gradientdinner·
@_arohan_ @torchcompiled How many such architectural choices have been made because everyone AdamW’ed everything by default
English
1
0
1
64
rohan anil
rohan anil@_arohan_·
Cool work! Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type: U being [[1, 0] [0, eps]] Which can lead to death spiral for second row updates and neurons selectively die. Following update could be a bit better. [[0.5, 0] [0.0, 0.5]] OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better. All roads lead to Adam/AdaGrad. Or does it …
Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

English
12
14
210
21K
mikail
mikail@Gradientdinner·
@_arohan_ Randomly re-initialize row/column of weight corresponding to dead neuron 🧠
English
1
0
2
108
mikail retweetledi
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
No Neocloud ever imagined they’d be renting out H100s today at higher prices than 3 years ago. Even if you have money, frontier labs and Neolabs have already locked up most of the 2026 GPU supply. There is basically infinite demand for artificial intelligence.
English
32
16
435
53.3K
mikail retweetledi
Pierfrancesco Beneventano
Pierfrancesco Beneventano@PierBeneventano·
Our new paper was accepted at ICML! 1) Momentum isn’t just “SGD but faster”. It affects sharpness (of orders of magnitude!) 2) The usual story says momentum lets you train in sharper regions. That’s true for large batches only! The opposite is true for minibatches!
Pierfrancesco Beneventano tweet media
English
3
14
113
7.1K
mikail retweetledi
Weiyang Liu
Weiyang Liu@Besteuler·
Orthogonal Finetuning (oft.wyliu.com; boft.wyliu.com) has a unique advantage of preventing catastrophic forgetting. Inspired by this property, we find that merging models within the orthogonal group can effectively reduce model conflicts and preserve both pretraining and downstream knowledge. This is our OrthoMerge framework. The idea behind OrthoMerge is extremely simple. For OFT-tuned models, we can first map the orthogonal adapters to Lie algebra with inverse Carley transform and then perform merging there. This guarantees the merged model differs from the pretrained model only up to an orthogonal transformation. A better news is that OrthoMerge can also be applied to non-OFT-tuned models. By solving the orthogonal procrustes problem, we can have the projected component of the adapter onto the orthogonal group. OrthoMerge will then be applied there and the residual component can be merged using conventional merging methods. That said, OrthoMerge can be used together with existing model merging methods! This is a great example of simple yet effective ideas. Great efforts by my PhD students Sihan Yang and Kexuan Shi. The project is already open-sourced and feel free to give it a try! Project: spherelab.ai/OrthoMerge/ Paper: arxiv.org/pdf/2602.05943 Code: github.com/Sphere-AI-Lab/…
Weiyang Liu tweet mediaWeiyang Liu tweet media
English
7
57
383
48.1K
mikail retweetledi
Jonas Hübotter
Jonas Hübotter@jonashubotter·
Today and tomorrow we’ll be presenting self-distillation with orals at ICLR in Rio 🇧🇷 1. “Self-Distillation enables Continual Learning” at lifelong agents workshop (Sun 11:30am) 2. “Reinforcement Learning via Self-Distillation” at scaling post-training workshop (Mon 2:40pm) 3. “Test-Time Self-Distillation” at test-time updates workshop (Mon 4:15pm)
Jonas Hübotter tweet mediaJonas Hübotter tweet media
English
10
48
430
100.9K
mikail retweetledi
George Grigorev
George Grigorev@iamgrigorev·
We just released our first models at @poolsideai – including Laguna XS.2 (open weights) that competes with Qwen3.6-35B. I worked across pretraining — happy to answer questions! We now have great understanding of exactly all components that went into training through principled ablations, and now we're confident to scale. This year would be 🔥 for us! More coming very soon
George Grigorev tweet media
English
18
26
233
22.9K
dheevatsa
dheevatsa@dheevatsa·
Muon is having its moment — Kimi K2, GLM 5, and now DeepSeek V4! More broadly, it feels like the time for advanced optimizers is finally here — reiterating that they are an important component for efficient training systems at scale! Our recent work: performant Muon/SOAP-class optimizers in NVIDIA NeMo/Megatron-Core — layer-wise distributed optimizer, TP-aware Newton-Schulz, SYRK kernels. Muon ≥ AdamW on GB300-NVL72. developer.nvidia.com/blog/advancing…
English
2
14
81
5.8K
rohan anil
rohan anil@_arohan_·
@Gradientdinner @dheevatsa Thats basically because your preconditioner changes more often as you are making lot more progress on loss per step.
English
1
0
1
70