Alec Dewulf

4

27

4.2K

Alec Dewulf@AlecDewulf·4d

really nice framework and direction. I'm generally very enthusiastic about principled parameter-specific update rules. No more general optimizers

Tim Lau@timlautk

1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…

English

3

115

Alec Dewulf retweetledi

Federico Cassano@ellev3n11·5d

we finally trained a model with muon. we now have a real research effort.

English

10

11

477

40.7K

Alec Dewulf retweetledi

Michael Truell@mntruell·6d

Composer 2.5 is a significant step up from Composer 2. This is the very start of our work with SpaceXAI. Hope to have more improvements out soon.

Cursor@cursor_ai

Introducing Composer 2.5, our most powerful model yet. It's more intelligent, better at sustained work on long-running tasks, and more reliable at following complex instructions. For the next week, we’re doubling the included usage of the model.

English

370

1K

4.8K

1.1M

Alec Dewulf@AlecDewulf·6d

@gitipahwa what IS it doing when I ask it how to cook rice?

English

0

1

41

gitika@cupertinohoops·6d

True AI literacy should mean giving people a mental model of how these systems work, why they matter, and how to use them responsibly. Not just signing your grandma up for ChatGPT.

English

Prime Intellect@PrimeIntellect

5

11

547

Alec Dewulf@AlecDewulf·15 May

very cool. sota with aurora+contra muon using very small damping

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

English

Prime Intellect@PrimeIntellect

0

21

1.7K

Alec Dewulf retweetledi

elie@eliebakouch·15 May

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more primeintellect.ai/auto-nanogpt

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

English

36

81

799

113.4K

Alec Dewulf@AlecDewulf·14 May

@iamgrigorev I think there is a little bit of sauce required to scale aurora nicely (along the lines of muon lr scaling, ensuring iteration converges, etc.). I'll write-up something on this soon with some larger scale results

English

1

189

George Grigorev@iamgrigorev·14 May

lol there’s been so many optimizers this week I can’t test them all. (btw Aurora doesnt work on large scale)

English

6

0

73

6.7K

Alec Dewulf@AlecDewulf·13 May

aurora's performance can be hurt a fair bit by using less precise iterations as it's unlikely to converge close to the row-uniform stiefel intersection in that case. but we are generally very interested in results that differ from ours. would be cool if you shared more training details (hps, batch size, model arch, MLP expansion, etc.) so we can take a look

English

Alexander Doria@Dorialexander

0

3

376

Tianyang Lin@tianylin·13 May

turns out the loss improvement diminishes at around 10k step (~330B tokens)…

Tianyang Lin@tianylin

I started a vibe run using the aurora optimizer. model is of param size 4B and is trained on 32k sequences. HPs are not carefully tuned (used the same lr & wd as muon baseline). i do observe some loss improvement over baseline early in training (see fig, -0.0093 at step 3k).

English

12

4

111

23.5K

Alec Dewulf retweetledi

Alexander Doria@Dorialexander·9 May

Seems like I managed to independently confirm Aurora results on SYNTH (600M parameters). Very early run but promising lead and suggests reproducibility in a very different learning environment.

Ok directly relevant for ongoing work (on memorization): avoiding a "huge percentage of neurons to effectively die early in training (…) so that many parameters no longer meaningfully contribute to network outputs". This optimizer is going to see some SYNTH data.

English

11

34

385

42.8K

Alec Dewulf retweetledi

Saket Tiwari@SaketTiwari14·8 May

Super cool work on optimizer architecture codesign. Its always refreshing to see strong empirical results follow from a technical insight: row normalization can prevent dead neurons, unlike muon.

x.com/i/article/2052…

English

2

9

1.4K

Alec Dewulf@AlecDewulf·9 May

@atticuswzf @tilderesearch good point. The 1B runs for U-NorMuon didn't finish in time to include in the post but they all turned out to do a little worse than Aurora. We'll publish a revision with the updated results soon

English

0

217

Atticus Wang@atticuswzf·8 May

@tilderesearch I do not actually see a comparison between Aurora and U-NorMuon in terms of pretraining loss? How does it do compared to U-NorMuon?

English

0

2

1.4K

Tilde@tilderesearch·8 May

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

@vikhyatk @tilderesearch @rosinality github.com/KellerJordan/m…

x.com/i/article/2052…

English

41

177

1.6K

517.6K

Alec Dewulf@AlecDewulf·8 May

QME

3

2

23

584

vik@vikhyatk·8 May

@tilderesearch @rosinality try nanogpt speedrun

English

2

0

31

2.4K

Alec Dewulf retweetledi

Ali Behrouz@behrouz_ali·8 May

Great work!

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

English

2

37

6.6K

Alec Dewulf retweetledi

Vinod Khosla@vkhosla·8 May

More proof from one of our companies innovation continues unabated around LLM's...

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

English

7

13

187

38.9K

Alec Dewulf@AlecDewulf·8 May

no more dead neurons

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

English

6

342

Alec Dewulf@AlecDewulf·7 May

👀

👀 Aurora dropping tomorrow. 3175 steps → beating NanoGPT Track 3 SOTA by 50 steps. And it scales 🚀

ART

2

180

Alec Dewulf@AlecDewulf·29 Nis

distillation go brrr

Distillation (especially on-policy) has become a pivotal component of the post-training stack. ☕ To dramatically accelerate distillation at scale, we open-source Nitrobrew, a communication-efficient, fused strategy for logit distillation. It’s built for both on- and off-policy distillation with: 100x faster loss computation 50% peak memory savings 3x faster on-policy distillation and more! A 🧵 (1/8)

English