Kwangjun Ahn

73 posts

Kwangjun Ahn

Kwangjun Ahn

@KwangjunA

Researcher at NVIDIA // ex-Researcher at Microsoft, PhD from MIT EECS

Cambridge, MA Katılım Şubat 2020
317 Takip Edilen718 Takipçiler
Kwangjun Ahn retweetledi
Seunghyun Seo
Seunghyun Seo@SeunghyunSEO7·
sacled up to 12.7B dense, 5.5T tokens. - polynorm (optimized kernel) - grouped diff attn (their work) - parallel muonclip (adopt alltoall like mainhorse, essential, dion) - 80M batch it's still non-reasoning, also not moe though... keep pushing guys! arxiv.org/abs/2511.07464
Seunghyun Seo@SeunghyunSEO7

they also dropped fsdp2 optimized muon. though they don't use muon for 2.6b dense model, i think it's just beginning and they are preparing larger one. they pipeline muon's comm-comp with calc flops and the code is neat. not sure if it's existing method. huggingface.co/Motif-Technolo…

English
3
19
123
16.8K
Ionut-Vlad Modoranu
Ionut-Vlad Modoranu@ionutmodo·
@KwangjunA This seems to be similar to our recently shared work, where we use Discrete Cosine Transform (DCT) to perform a cheap low-rank projection of the momentum buffer, followed by Newton-Schulz orthogonalization. Check out our paper here: arxiv.org/abs/2505.17967
English
1
0
0
58
Kwangjun Ahn
Kwangjun Ahn@KwangjunA·
New improvement in Dion leads to a speedup that makes orthonormal updates (eg. Muon) more scalable for larger matrices. The trick: carefully using Newton-Schulz (on smaller matrices) as Dion's backend. Updates to our microsoft/dion codebase are coming soon---stay tuned!
Kwangjun Ahn tweet media
English
1
4
27
2.4K
Kwangjun Ahn retweetledi
Microsoft Research
Microsoft Research@MSFTResearch·
Join us on Sept 24 at 8 AM PT for Microsoft Research Forum Season 2 – a virtual series highlighting purposeful research and its real-world impact, from fundamental exploration to advancing AI responsibly, scaling innovation through products and open source, and driving positive change for society. Register now: msft.it/6011scy27
English
1
3
25
5.7K
Kwangjun Ahn retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
@jxbz love the repo! clean code, good practices but still not overly over-engineered, triton kernels, well documented, simple reference implementations alongside optimized code. nice
English
2
5
207
30.7K
Kwangjun Ahn retweetledi
Jeremy Bernstein
Jeremy Bernstein@jxbz·
I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice (1/2)
English
7
18
336
75K
Kwangjun Ahn retweetledi
Laker Newhouse
Laker Newhouse@LakerNewhouse·
[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.
Laker Newhouse tweet media
English
7
43
342
35.7K
Kwangjun Ahn retweetledi
Mikhail Parakhin
Mikhail Parakhin@MParakhin·
Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...
English
6
32
427
68.6K
Kwangjun Ahn retweetledi
Kwangjun Ahn retweetledi
Konstantin Mishchenko
Konstantin Mishchenko@konstmish·
Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.
Konstantin Mishchenko tweet media
English
4
18
142
13.4K