Kwangjun Ahn

73 posts

Kwangjun Ahn

@KwangjunA

Researcher at NVIDIA // ex-Researcher at Microsoft, PhD from MIT EECS

Cambridge, MA Katılım Şubat 2020

317 Takip Edilen718 Takipçiler

Kwangjun Ahn@KwangjunA·5 Oca

@JohnCLangford Updates on the Dion codebase (github.com/microsoft/dion), please check them out! - Dion2 (arxiv.org/abs/2512.16928), which has much simpler math than Dion. - NorMuon (arxiv.org/abs/2510.05491) thanks to @li_zichong.

English

John Langford@JohnCLangford·23 Ara

@KwangjunA and Noah Amsel came up with Dion2 arxiv.org/pdf/2512.16928 , a radical computational improvement on Dion which builds on insights from Trion openreview.net/pdf/1d31ed870f… about how to effectively do a scalable orthonormal update in the Muon style.

English

1.9K

Kwangjun Ahn retweetledi

Seunghyun Seo@SeunghyunSEO7·13 Kas

sacled up to 12.7B dense, 5.5T tokens. - polynorm (optimized kernel) - grouped diff attn (their work) - parallel muonclip (adopt alltoall like mainhorse, essential, dion) - 80M batch it's still non-reasoning, also not moe though... keep pushing guys! arxiv.org/abs/2511.07464

Seunghyun Seo@SeunghyunSEO7

they also dropped fsdp2 optimized muon. though they don't use muon for 2.6b dense model, i think it's just beginning and they are preparing larger one. they pipeline muon's comm-comp with calc flops and the code is neat. not sure if it's existing method. huggingface.co/Motif-Technolo…

English

123

16.8K

Kwangjun Ahn@KwangjunA·9 Eki

@ionutmodo Thanks! Looks great, let me read through it

English

Ionut-Vlad Modoranu@ionutmodo·8 Eki

@KwangjunA This seems to be similar to our recently shared work, where we use Discrete Cosine Transform (DCT) to perform a cheap low-rank projection of the momentum buffer, followed by Newton-Schulz orthogonalization. Check out our paper here: arxiv.org/abs/2505.17967

English

Kwangjun Ahn@KwangjunA·29 Eyl

New improvement in Dion leads to a speedup that makes orthonormal updates (eg. Muon) more scalable for larger matrices. The trick: carefully using Newton-Schulz (on smaller matrices) as Dion's backend. Updates to our microsoft/dion codebase are coming soon---stay tuned!

English

2.4K

Kwangjun Ahn retweetledi

Microsoft Research@MSFTResearch·11 Eyl

Join us on Sept 24 at 8 AM PT for Microsoft Research Forum Season 2 – a virtual series highlighting purposeful research and its real-world impact, from fundamental exploration to advancing AI responsibly, scaling innovation through products and open source, and driving positive change for society. Register now: msft.it/6011scy27

English

5.7K

Kwangjun Ahn retweetledi

Andrej Karpathy@karpathy·3 Ağu

@jxbz love the repo! clean code, good practices but still not overly over-engineered, triton kernels, well documented, simple reference implementations alongside optimized code. nice

English

207

30.7K

Kwangjun Ahn retweetledi

elie@eliebakouch·3 Ağu

Lot, lot of alpha here

Jeremy Bernstein@jxbz

I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice (1/2)

English

170

23.9K

Kwangjun Ahn@KwangjunA·3 Ağu

@eliebakouch @jxbz @JohnCLangford @GagMagakyan Thank you!! Glad to hear it

English

elie@eliebakouch·3 Ağu

@KwangjunA @jxbz @JohnCLangford @GagMagakyan It's super helpful + the repo have a lot of details!! thanks 🙏

English

105

Kwangjun Ahn retweetledi

Jeremy Bernstein@jxbz·3 Ağu

English

336

75K

Kwangjun Ahn@KwangjunA·3 Ağu

@jxbz @JohnCLangford @GagMagakyan Thank you for the kind words! Hope this proves useful to the community!

English

400

Kwangjun Ahn retweetledi

Jeremy Bernstein@jxbz·3 Ağu

Looks like extremely exciting and useful work by @KwangjunA, Byron Xu, Natalie Abreu, @JohnCLangford and @GagMagakyan github.com/microsoft/dion/ (2/2)

English

141

9.8K

Kwangjun Ahn retweetledi

Laker Newhouse@LakerNewhouse·21 Tem

[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.

English

342

35.7K

Kwangjun Ahn retweetledi

John Langford@JohnCLangford·20 Tem

Apparently Dion is now being worked on for Torch Titan: github.com/pytorch/torcht… :-)

Mikhail Parakhin@MParakhin

Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...

English

101

21.3K

Kwangjun Ahn@KwangjunA·19 Tem

@MParakhin Thanks for advertising Dion! :)

English

1.8K

Kwangjun Ahn retweetledi

Mikhail Parakhin@MParakhin·18 Tem

English

427

68.6K

Kwangjun Ahn@KwangjunA·16 Tem

@orvieto_antonio @micahgoldblum @teodorasrec @jonasgeiping Nice results! One question: wouldn’t large (global-)batch size be more practical for distributed training? Does that mean still SGD is not effective for large scale?

English

188

Antonio Orvieto@orvieto_antonio·16 Tem

As @micahgoldblum and coauthors, we also found that small batches make SGD effective in LM training. It's cool that our papers came out around the same time, and each has a different perspective! Below, our take on why this happens. Our awesome team: @teodorasrec @jonasgeiping Papers: arxiv.org/pdf/2506.12543 arxiv.org/pdf/2507.07101

English

147

8.7K

Kwangjun Ahn@KwangjunA·16 Tem

@seungwookh @jxbz Go Jeremy and Laker!!

English

273

Kwangjun Ahn retweetledi

Seungwook Han@seungwookh·16 Tem

But actually this is the og way of doing it and should stop by E-2103 to see @jxbz and Laker Newhouse whiteboard the whole paper.

Jeremy Bernstein@jxbz

Laker and I are presenting this work in an hour at ICML poster E-2103. It’s on a theoretical framework and language (modula) for optimizers that are fast (like Shampoo) and scalable (like muP). You can think of modula as Muon extended to general layer types and network topologies

English

8.3K

Kwangjun Ahn@KwangjunA·15 Tem

@konstmish @aaron_defazio Thanks Konstantin!

English

407

Kwangjun Ahn retweetledi

Konstantin Mishchenko@konstmish·15 Tem

Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.

English

142

13.4K

Keşfet

@JohnCLangford @li_zichong @ionutmodo @jxbz @eliebakouch @GagMagakyan @MParakhin @orvieto_antonio