Tilde (@tilderesearch) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Tilde@tilderesearch·8 May

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

Tilde@tilderesearch

x.com/i/article/2052…

English

41

176

1.5K

514.9K

Tilde retweetledi

Dhruv π@dhruv31415·4d

New aurora record!

elie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more primeintellect.ai/auto-nanogpt

Italiano

0

1

30

3.6K

Tilde retweetledi

Alexander Doria@Dorialexander·9 May

Seems like I managed to independently confirm Aurora results on SYNTH (600M parameters). Very early run but promising lead and suggests reproducibility in a very different learning environment.

Alexander Doria@Dorialexander

Ok directly relevant for ongoing work (on memorization): avoiding a "huge percentage of neurons to effectively die early in training (…) so that many parameters no longer meaningfully contribute to network outputs". This optimizer is going to see some SYNTH data.

English

11

34

385

42.7K

Tilde@tilderesearch·8 May

Huge thank you to the folks who made this possible, to name a few: @NVIDIAAI for the open-source dataset we used, @kellerjordan0 for Muon + NanoGPT Speedrun, and @jxbz for Muon theory.

English

1

0

45

8K

Tilde@tilderesearch·8 May

Read the entire blog post here: blog.tilderesearch.com/blog/aurora

English

1

4

74

14.3K

Tilde@tilderesearch·8 May

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

Tilde@tilderesearch

x.com/i/article/2052…

English

41

176

1.5K

514.9K

Tilde@tilderesearch·8 May

@bozavlado 👀

QME

0

6

5.5K

Vlado Boza@bozavlado·8 May

@tilderesearch I think you can use ADMM to satisfy both constraints and get faster convergence.

English

2

0

23

6.3K

Tilde@tilderesearch·8 May

x.com/i/article/2052…

ZXX

3

31

279

144K

Tilde@tilderesearch·7 May

PR: github.com/KellerJordan/m… Stay tuned for technical release tomorrow!

English

0

19

1.6K

Tilde@tilderesearch·7 May

👀 Aurora dropping tomorrow. 3175 steps → beating NanoGPT Track 3 SOTA by 50 steps. And it scales 🚀

English

4

8

154

16.8K

Tilde@tilderesearch·29 Nis

@sun_hanchi Yes, it was a cool convergence. See post #7 where we acknowledge DSv4 :)

English

1

0

2

166

Hanchi Sun @MLSys@sun_hanchi·29 Nis

I think DeepSeek V4 did that too

Tilde@tilderesearch

~3/8~ We introduce Nitrobrew to solve these issues. Nitrobrew stems from a very simple observation: the unembedding matrix is low-rank. It consist of two steps: 1. Sending hidden states as a lossless compression of logits 2. A lightweight, chunked online KL divergence implementation

English

2

0

2

549

Tilde@tilderesearch·29 Nis

@hellofromjames Thanks! Replied with a working link

English

0

1

56

James Peterson@hellofromjames·29 Nis

@tilderesearch Nice post! FYI I'm getting a 404 for your blog post link

English

1

0

1

37

Tilde@tilderesearch·28 Nis

Distillation (especially on-policy) has become a pivotal component of the post-training stack. ☕ To dramatically accelerate distillation at scale, we open-source Nitrobrew, a communication-efficient, fused strategy for logit distillation. It’s built for both on- and off-policy distillation with: 100x faster loss computation 50% peak memory savings 3x faster on-policy distillation and more! A 🧵 (1/8)

English

6

41

285

27.3K

Tilde@tilderesearch·29 Nis

Link above is broken, here is the link: blog.tilderesearch.com/blog/nitrobrew

English

0

1

16

1.2K

Tilde@tilderesearch·28 Nis

~8/8~ Read the full post here: tilderesearch.com/blog/nitrobrew. We're hiring - if you like finding simple tricks hiding in plain sight in ML systems, come work with us: tilderesearch.com/join

English

2

0

16

1.5K

Tilde

Keşfet