
Tim Lau
424 posts

Tim Lau
@timlautk
AI Researcher @DRWTrading; Past Postdoc @Penn @PennMedicine @Wharton @ChicagoBooth; PhD @NorthwesternU Statistics & Data Science; Opinions are my own







1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…




@_arohan_ The issue here is QDWH itself is slow on GPUs and probably won't scale to large pre-training runs. We are deriving something else which also seems to be able to explain why this optimizer specifically for gate/up proj would work better than Muon alone.


1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update! 📝 arxiv.org/abs/2605.18106 💻 github.com/timlautk/equiv…




This new paper on Freon mentions using QDWH for orthogonalization - very nice - but it's a shame they don't cite PolarGrad. PolarGrad used QDWH first! Check out the PolarGrad paper and QDWH impl. below. x.com/timlautk/statu… github.com/timlautk/polar… arxiv.org/pdf/2505.21799


Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.








The Newton–Schulz iteration coefficients optimized by DeepSeek-V4 are surprisingly strong: they effectively normalize all singular values to 1. This matches our previous intuition: a well-balanced spectrum may help strike a better balance across long-tail knowledge. Plot code: github.com/FengzhuoZhang/…
