Alec Dewulf

39 posts

Alec Dewulf

Alec Dewulf

@AlecDewulf

training language models @tilderesearch

Katılım Şubat 2019
119 Takip Edilen112 Takipçiler
Alec Dewulf retweetledi
Tilde
Tilde@tilderesearch·
A nice piece on architecture-optimizer codesign ft. Aurora and others 🚀
Tim Lau@timlautk

1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…

English
1
4
27
4.2K
Alec Dewulf
Alec Dewulf@AlecDewulf·
really nice framework and direction. I'm generally very enthusiastic about principled parameter-specific update rules. No more general optimizers
Tim Lau@timlautk

1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…

English
0
0
3
115
Alec Dewulf retweetledi
Federico Cassano
Federico Cassano@ellev3n11·
we finally trained a model with muon. we now have a real research effort.
English
10
11
477
40.7K
gitika
gitika@cupertinohoops·
True AI literacy should mean giving people a mental model of how these systems work, why they matter, and how to use them responsibly. Not just signing your grandma up for ChatGPT.
English
1
5
11
547
Alec Dewulf retweetledi
elie
elie@eliebakouch·
we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more primeintellect.ai/auto-nanogpt
elie tweet media
Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

English
36
81
799
113.4K
Alec Dewulf
Alec Dewulf@AlecDewulf·
@iamgrigorev I think there is a little bit of sauce required to scale aurora nicely (along the lines of muon lr scaling, ensuring iteration converges, etc.). I'll write-up something on this soon with some larger scale results
English
0
0
1
189
George Grigorev
George Grigorev@iamgrigorev·
lol there’s been so many optimizers this week I can’t test them all. (btw Aurora doesnt work on large scale)
English
6
0
73
6.7K
Alec Dewulf
Alec Dewulf@AlecDewulf·
aurora's performance can be hurt a fair bit by using less precise iterations as it's unlikely to converge close to the row-uniform stiefel intersection in that case. but we are generally very interested in results that differ from ours. would be cool if you shared more training details (hps, batch size, model arch, MLP expansion, etc.) so we can take a look
English
1
0
3
376
Alec Dewulf retweetledi
Alexander Doria
Alexander Doria@Dorialexander·
Seems like I managed to independently confirm Aurora results on SYNTH (600M parameters). Very early run but promising lead and suggests reproducibility in a very different learning environment.
Alexander Doria tweet media
Alexander Doria@Dorialexander

Ok directly relevant for ongoing work (on memorization): avoiding a "huge percentage of neurons to effectively die early in training (…) so that many parameters no longer meaningfully contribute to network outputs". This optimizer is going to see some SYNTH data.

English
11
34
385
42.8K
Alec Dewulf retweetledi
Saket Tiwari
Saket Tiwari@SaketTiwari14·
Super cool work on optimizer architecture codesign. Its always refreshing to see strong empirical results follow from a technical insight: row normalization can prevent dead neurons, unlike muon.
Tilde@tilderesearch

x.com/i/article/2052…

English
0
2
9
1.4K
Alec Dewulf
Alec Dewulf@AlecDewulf·
@atticuswzf @tilderesearch good point. The 1B runs for U-NorMuon didn't finish in time to include in the post but they all turned out to do a little worse than Aurora. We'll publish a revision with the updated results soon
English
1
0
0
217
Atticus Wang
Atticus Wang@atticuswzf·
@tilderesearch I do not actually see a comparison between Aurora and U-NorMuon in terms of pretraining loss? How does it do compared to U-NorMuon?
English
1
0
2
1.4K
Tilde
Tilde@tilderesearch·
Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
Tilde@tilderesearch

x.com/i/article/2052…

English
41
177
1.6K
517.6K
Alec Dewulf retweetledi
Alec Dewulf retweetledi
Alec Dewulf retweetledi