noahamsel (@noahamsel) - Twitter Profili | Zamantika Mersobahis Locabet

noahamsel retweetledi

Tri Dao@tri_dao·2d

It's my favorite kind of work: linear algebra insight + fast kernels. When playing w Muon a while ago, we were thinking why not speed it up by operating on the small square matrix X X^T instead of the large rectangular matrix X. Jack, Noah, and Berlin spent many months understanding eigenvalues/vectors of the intermediate matrices in Muon, and finally came up with a simple and elegant algo to make this work.

Jack Zhang@jcz42

We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

English

2

94

1.1K

83.8K

noahamsel@noahamsel·2d

Announcing Gram Newton-Schulz, a new way to implement Muon that's 2x faster Trick 1: rejigger Newton-Schulz to replace rectangular matmuls with square symmetric ones Trick 2: collaborate with CUDA kings @jcz42 and @_berlinchen from @tri_dao's lab Blog: dao-ailab.github.io/blog/2026/gram…

Jack Zhang@jcz42

We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

English

1

14

110

14.6K

noahamsel@noahamsel·6 Haz

How can classical numerical analysis help train deep nets faster? Climb aboard the Polar Express to find out... arxiv.org/abs/2505.16932 joint with @davpersson @gowerrobert + Chris Musco

English

0

2

9

667

noahamsel retweetledi

Robert M. Gower @ Neurips 2025@gowerrobert·4 Haz

Are you interested in the new Muon/Scion/Gluon method for training LLMs? To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

Robert M. Gower @ Neurips 2025 tweet media

English

3

22

196

14.9K

noahamsel

Keşfet