Tim Lau

424 posts

Tim Lau

Tim Lau

@timlautk

AI Researcher @DRWTrading; Past Postdoc @Penn @PennMedicine @Wharton @ChicagoBooth; PhD @NorthwesternU Statistics & Data Science; Opinions are my own

Palo Alto, CA Katılım Ocak 2014
2.1K Takip Edilen635 Takipçiler
Sabitlenmiş Tweet
Tim Lau
Tim Lau@timlautk·
1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…
English
3
22
102
17.9K
Jiaxuan Zou
Jiaxuan Zou@SmartPig_Joe·
@timlautk @weijie444 As analyzed in our work, the orthogonal constraint eliminates radial jitter and preserves weight norms, theoretically preventing dead neurons. We would greatly appreciate it if you could discuss and cite Nora as a concrete instance in your revisions!😀
English
1
0
1
45
Tim Lau
Tim Lau@timlautk·
1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…
English
3
22
102
17.9K
Tim Lau
Tim Lau@timlautk·
We indeed have discussed how our work is related to the work you mentioned in Section A.3 of the appendix in the paper. The main distinction is that we do not assume that the layerwise loss function itself is rotationally invariant. We also do not advocate a single update rule for all matrix-valued parameters, but architecture--optimizer co-design, meaning that optimizer updates have to follow the parameters' symmetry groups. We believe that the assumptions made in our paper are very minimal: matrix-valued parameters (as linear operators) should possess natural symmetries according to their definitions (so not at all an assumption). Other than that, we only assume the layerwise loss function is Lipschitz-differentiable. We also have four sets of pre-training experiments, showing consistent results for different model architectures and model sizes. We are not proposing a specific optimizer, but an end-to-end optimizer stack with optimizers that respect the symmetry groups of their corresponding parameters.
English
0
0
0
26
James MMatrix
James MMatrix@JamesWhate89993·
@tonysilveti Another work also claimed to derive new optimizers from symmetry arxiv.org/abs/2602.09006 I am still a bit sceptical about those symmetry based approaches. There were lots of work in this direction few years ago but none of them worked
English
1
0
1
60
Tony S.F.
Tony S.F.@tonysilveti·
Really elegant idea to use symmetry as a guiding principle to derive several popular optimizers, incuding Muon. Looking forward to reading this more deeply!
Tim Lau@timlautk

1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…

English
1
3
21
2.9K
Tim Lau
Tim Lau@timlautk·
Aurora motivates our development in Section 3.4. We did not empirically use the version of Aurora in their blog (beta = 0.5; K = 2) but its another instance with beta = 0 and K = 1. In our paper, we call it HybridPolarGradM (row-norm/right-spectral). Aurora itself also fits in the left-permutation-right-orthogonal (LPRO) equivariant optimizer class in our framework.
English
0
1
3
165
Tim Lau
Tim Lau@timlautk·
Finally, we can discuss what we have derived recently! Our new paper is at arxiv.org/abs/2605.18106, introduced in this post x.com/timlautk/statu…. Take-aways: Most tall-skinny matrix parameters in LLMs (embeddings, LM heads, SwiGLU MLP gate/up/down projections) are left-permutation-right-orthogonal (LPRO) equivariant, not bi-orthogonally equivariant like standard linear and attention layers. The right-spectral update G(GᵀG)⁻¹ᐟ² is also LPRO-equivariant and equals the orthogonal polar factor UVᵀ in exact form. However, polynomial iterations for computing either, such as Newton-Schulz or Polar Express, can diverge in the tall-skinny regime due to ill-conditioning. Rational iterations like QDWH and ZOLO-PD resolve this but require high precision and matrix decompositions (QR or Cholesky) that are not GPU-friendly, so they may not scale to large pre-training runs (see our earlier PolarGrad paper: arxiv.org/abs/2505.21799). Fortunately, the LPRO-equivariant class also contains row-normalized momentum optimizers and hybrid row-norm/spectral optimizers. Row-normalizing the momentum first can also improve the conditioning of the right Gram matrix computed in the subsequent spectral step. The row-norm variant alone is more computationally efficient but typically gives slightly worse validation loss than the hybrid, producing a time-performance trade-off within the same equivariance class. The figures below show this trade-off (Gemma 3 1B-style pre-training; AdamW for scalars/vectors, Muon for linear/attention matrices, different optimizers for the embedding and LM head): the hybrid optimizer reaches lower validation loss than the row-norm optimizer at the same token budget, but takes longer wall-clock time.
Tim Lau tweet mediaTim Lau tweet media
Tim Lau@timlautk

@_arohan_ The issue here is QDWH itself is slow on GPUs and probably won't scale to large pre-training runs. We are deriving something else which also seems to be able to explain why this optimizer specifically for gate/up proj would work better than Muon alone.

English
1
4
37
4.4K
Tim Lau
Tim Lau@timlautk·
@tilderesearch Thanks for the repost! Materials of Section 3.4 of our paper are indeed motivated by the Aurora blogpost during the final preparation of the paper!
English
0
0
3
74
Tim Lau
Tim Lau@timlautk·
4/4 We pre-trained Qwen3-0.6B-, Gemma 3 1B-, OLMoE-1B-7B-, and gpt-oss-style models with this stack. Replacing AdamW on vocabulary-indexed matrices consistently improves validation loss. Symmetry-compatible router updates also reduce training-loss spikes in MoEs. The attached figure shows a downsized gpt-oss run where symmetry-compatible configurations outperform AdamW-heavy baselines. Thanks to @PrimeIntellect for the GPU compute!
Tim Lau tweet media
English
0
1
8
511
Tim Lau
Tim Lau@timlautk·
3/4 Once you write down the symmetry, the optimizer is almost forced on you. A small zoo of new updates falls out: • RowNormM: local, cheap, permutation-equivariant • RightPolarGradM / LeftPolarGradM: one-sided spectral via the smaller Gram matrix • HybridPolarGradM: row-norm composed with a spectral step And Muon drops out as the bi-orthogonal special case.
English
1
1
8
672
Tim Lau
Tim Lau@timlautk·
Of course, in terms of wall-clock time, QDWH isn't as competitive as NS iteration (or Polar Express, etc.). It has matrix decompositions (QR or Chloesky) that aren't as GPU-friendly, and requires higher numerical precision, leading to a trade-off for different use cases.
English
0
0
0
82
Tim Lau
Tim Lau@timlautk·
We did try to understand how the choice of the inexact polar oracles would affect spectral gradient descent (for Muon it's specifically NS iteration) both theoretically and empirically in our paper, not just for DL/LLM pre-training but as a general matrix optimization algorithm. QDWH is particularly useful when the gradient/momentum is ill-conditioned. It works much better than NS iteration even when solving the strongly convex matrix quadratic regression problem. For certain (deterministic) problems, we also found that the nuclear norm scaling term is necessary for convergence (even without any learning decay).
Tony S.F.@tonysilveti

This new paper on Freon mentions using QDWH for orthogonalization - very nice - but it's a shame they don't cite PolarGrad. PolarGrad used QDWH first! Check out the PolarGrad paper and QDWH impl. below. x.com/timlautk/statu… github.com/timlautk/polar… arxiv.org/pdf/2505.21799

English
1
0
6
1.5K
Tim Lau
Tim Lau@timlautk·
@_arohan_ The issue here is QDWH itself is slow on GPUs and probably won't scale to large pre-training runs. We are deriving something else which also seems to be able to explain why this optimizer specifically for gate/up proj would work better than Muon alone.
English
0
0
0
4.5K
rohan anil
rohan anil@_arohan_·
@timlautk It would be cool if you can numerically simulate all 3 and show that this indeed will resolve the pathology.
English
1
0
1
103
rohan anil
rohan anil@_arohan_·
Cool work! Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type: U being [[1, 0] [0, eps]] Which can lead to death spiral for second row updates and neurons selectively die. Following update could be a bit better. [[0.5, 0] [0.0, 0.5]] OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better. All roads lead to Adam/AdaGrad. Or does it …
Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

English
12
14
210
21.1K
Tim Lau
Tim Lau@timlautk·
@_arohan_ Cholesky or QR then use DWH will solve this issue
English
1
0
1
82
rohan anil
rohan anil@_arohan_·
@timlautk I wish Cholesky would just work then we wouldn’t have needed all this explosion of variants.
English
1
0
2
99
Tim Lau
Tim Lau@timlautk·
@_arohan_ It seems to me this is also a numerical linear algebra problem but we also want the algorithm to be GPU-friendly. Polynomial-based iterations are GPU-friendly but fundamentally will accumulate numerical errors for low-rank inits anyway
English
1
0
1
93
rohan anil
rohan anil@_arohan_·
@timlautk What is the right optimization solution in this case, that can avoid this pathology they describe is very interesting question
English
1
0
1
86
Tim Lau
Tim Lau@timlautk·
@_arohan_ So you meant this U is a potentially low-rank gradient for eps close to 0? In this case I guess this is the issue of NS iteration for ill-conditioned initializations.
English
1
0
2
86
rohan anil
rohan anil@_arohan_·
@timlautk I meant having gradients that causes a privileged basis sort of, and muon reinforcing that causes this death spiral they are taking about
English
1
0
0
86
Tim Lau
Tim Lau@timlautk·
@tilderesearch NS iteration in Muon might fail for tall matrices due to ill-conditioned initializations (the momentum), so it could be the computational pathology of polynomial iterations for polar decomposition. We discuss this in Sections 3.6, 3.7 and A.3 of our paper: arxiv.org/abs/2505.21799
English
0
2
43
5.4K
Tim Lau
Tim Lau@timlautk·
@_arohan_ The (exact) polar factor itself is orthogonal, so the first example would not happen if you mean U is the polar factor of the gradient/momentum unless eps is ±1. The only real orthogonal diagonal matrices have its diagonal entries ±1.
English
1
0
1
130
Tim Lau
Tim Lau@timlautk·
@_arohan_ Do you mean in this example U is the polar factor of the gradient/momentum?
English
1
0
1
326
Tilde
Tilde@tilderesearch·
Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
Tilde@tilderesearch

x.com/i/article/2052…

English
41
176
1.5K
515.6K
Tim Lau
Tim Lau@timlautk·
Actually it is quite unclear why they made this choice of hybrid coefficients because Polar Express (openreview.net/forum?id=yRtgZ…; ICLR 2026) converges in 7 steps and it is only slightly better than the constant coefficients (a, b, c) = (1.875, -1.25, 0.375) in 10 steps. Of course, it is different when applied to computing orthogonal polar factors in lower precision.
Tim Lau tweet media
English
0
0
1
217
Zhuoran Yang
Zhuoran Yang@zhuoran_yang·
The purple curve is the implementation of Dpsk-v4 -- first 8 steps with coefficients (3.4445, -4.7750, 2.0315) and last 2 steps with (2.0, -1.5, 0.5) -- is an extremely accurate approximation of the indicator function! Fengzhuo's muon paper shows that muon's **spectral sign function** learns all directions in the same pace. And when features are heavy-tailed, muon learns them better than Adam (GD).
Fengzhuo Zhang@FengzhuoZhang

The Newton–Schulz iteration coefficients optimized by DeepSeek-V4 are surprisingly strong: they effectively normalize all singular values to 1. This matches our previous intuition: a well-balanced spectrum may help strike a better balance across long-tail knowledge. Plot code: github.com/FengzhuoZhang/…

English
1
4
37
5.6K