Tilde

125 posts

Tilde banner
Tilde

Tilde

@tilderesearch

We build foundational understanding of models to advance the frontier of intelligence.

Katılım Temmuz 2024
10 Takip Edilen4.6K Takipçiler
Sabitlenmiş Tweet
Tilde
Tilde@tilderesearch·
Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
Tilde@tilderesearch

x.com/i/article/2052…

English
41
176
1.5K
514.9K
Tilde retweetledi
Alexander Doria
Alexander Doria@Dorialexander·
Seems like I managed to independently confirm Aurora results on SYNTH (600M parameters). Very early run but promising lead and suggests reproducibility in a very different learning environment.
Alexander Doria tweet media
Alexander Doria@Dorialexander

Ok directly relevant for ongoing work (on memorization): avoiding a "huge percentage of neurons to effectively die early in training (…) so that many parameters no longer meaningfully contribute to network outputs". This optimizer is going to see some SYNTH data.

English
11
34
385
42.7K
Tilde
Tilde@tilderesearch·
Huge thank you to the folks who made this possible, to name a few: @NVIDIAAI for the open-source dataset we used, @kellerjordan0 for Muon + NanoGPT Speedrun, and @jxbz for Muon theory.
English
1
0
45
8K
Tilde
Tilde@tilderesearch·
Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
Tilde@tilderesearch

x.com/i/article/2052…

English
41
176
1.5K
514.9K
Vlado Boza
Vlado Boza@bozavlado·
@tilderesearch I think you can use ADMM to satisfy both constraints and get faster convergence.
English
2
0
23
6.3K
Tilde
Tilde@tilderesearch·
👀 Aurora dropping tomorrow. 3175 steps → beating NanoGPT Track 3 SOTA by 50 steps. And it scales 🚀
Tilde tweet media
English
4
8
154
16.8K
Tilde
Tilde@tilderesearch·
@sun_hanchi Yes, it was a cool convergence. See post #7 where we acknowledge DSv4 :)
English
1
0
2
166
Tilde
Tilde@tilderesearch·
Distillation (especially on-policy) has become a pivotal component of the post-training stack. ☕ To dramatically accelerate distillation at scale, we open-source Nitrobrew, a communication-efficient, fused strategy for logit distillation. It’s built for both on- and off-policy distillation with: 100x faster loss computation 50% peak memory savings 3x faster on-policy distillation and more! A 🧵 (1/8)
Tilde tweet media
English
6
41
285
27.3K