Hao-Jun Michael Shi

39 posts

Hao-Jun Michael Shi

Hao-Jun Michael Shi

@hjmshi

Research Scientist, Meta Superintelligence Labs @AIatMeta | Previous: Ph.D. @NU_IEMS, B.S. @uclamath | Numerical Optimization, Deep Learning

Menlo Park, CA Katılım Ağustos 2016
202 Takip Edilen144 Takipçiler
Hao-Jun Michael Shi
Hao-Jun Michael Shi@hjmshi·
9/10 By removing the two-loop behavior and decoupling the smoothing parameter from the number of local steps, GPA should provide a new foundation for re-thinking distributed cross-regional training algorithms.
English
1
1
3
215
Hao-Jun Michael Shi
Hao-Jun Michael Shi@hjmshi·
1/10 Are DiLoCo and Schedule-Free actually related? A brief history and unusually late advertisement for our work: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs (see arxiv.org/abs/2512.17131).
Hao-Jun Michael Shi tweet media
English
1
12
37
9.5K
Hao-Jun Michael Shi
Hao-Jun Michael Shi@hjmshi·
@konstmish @runame_ @giffmana @JFPuget @AlexShtf All of this will not diminish anything about the work that we are about to announce. 😜 As you have mentioned, these are both important orthogonal directions. Anyway, apologies, I'm still figuring out how to use Twitter (no dollar signs 😅).
English
0
0
1
45
Hao-Jun Michael Shi
Hao-Jun Michael Shi@hjmshi·
@konstmish @runame_ @giffmana @JFPuget @AlexShtf Our paper emphasizes that what is known as "variance adaptation" or "whitening" (i.e., what makes Adam > Signum) offers a similar benefit to Shampoo. Shampoo belongs to the same family of methods as Muon. However, there is a conceptual incongruency wrt how we view these methods.
English
1
1
2
73
dheevatsa
dheevatsa@dheevatsa·
Awesome to see the Distributed Shampoo optimizer top AlgoPerf ! “28% faster training than baseline ... 19% faster than 2nd place " Kudos to the team's tenacity for persistently improving over many months, not only surpassing strong baselines but also making it practically viable!
MLCommons@MLCommons

@MLCommons #AlgoPerf results are in! 🏁 $50K prize competition yielded 28% faster neural net training with non-diagonal preconditioning beating Nesterov Adam. New SOTA for hyperparameter-free algorithms too! Full details in our blog. mlcommons.org/2024/08/mlc-al… #AIOptimization #AI

English
1
0
1
215
Hao-Jun Michael Shi retweetledi
Runa Eschenhagen
Runa Eschenhagen@runame_·
1/14 Is Muon “better” than Shampoo? We argue that their relationship parallels Adam's relationship with Signum. Analogous to @lukas_balles and Hennig’s (2018) decomposition of Adam into element-wise scaled Signum, we can decompose Shampoo as left- and right-adapted Muon.
Runa Eschenhagen tweet media
English
3
45
264
32.3K
Hao-Jun Michael Shi retweetledi
Vinay S Rao
Vinay S Rao@vinaysrao·
While at Meta, I worked on this optimizer-wrapper (outer step lookahead momentum) we're calling Snoo (arxiv.org/abs/2510.15830). You can use it with AdamW or Muon and see really strong scaling. Here's a plot where we ran it against (tuned) AdamW up to 1e23 training flop scales. The "x"s in the plot are compute-factors i.e the baseline needs "x" more flops to reach the same loss (instead of simply measuring in steps). - We further established a medium-track WR on modded-nanogpt (github.com/KellerJordan/m…) With amazing co-authors (Dominik,Vishal,Michael).
Vinay S Rao tweet media
English
6
23
234
19.7K
Hao-Jun Michael Shi retweetledi
Runa Eschenhagen
Runa Eschenhagen@runame_·
1/9 In practice, the Shampoo optimizer crucially relies on several heuristics. In our NeurIPS 2025 spotlight paper, we investigate the role of learning rate grafting and infrequent preconditioner updates in Shampoo by decomposing its preconditioner. arxiv.org/abs/2506.03595
Runa Eschenhagen tweet media
English
3
21
91
13.2K
Hao-Jun Michael Shi
Hao-Jun Michael Shi@hjmshi·
Rejoining Twitter (now X) in order to publicize some of our recent work (and give @runame_ and others some hype)! This is work specifically coming out of the MSL Infrastructure Kernels and Optimizations and AI and Systems Co-Design teams @Meta.
English
0
1
4
318
Hao-Jun Michael Shi retweetledi
Charlotte Abrahamson
Charlotte Abrahamson@chabrahamson·
Me and @hjmshi but he's the computational researcher 😂😂😂
English
1
1
7
0