Hao-Jun Michael Shi

39 posts

Hao-Jun Michael Shi

@hjmshi

Research Scientist, Meta Superintelligence Labs @AIatMeta | Previous: Ph.D. @NU_IEMS, B.S. @uclamath | Numerical Optimization, Deep Learning

Menlo Park, CA Katılım Ağustos 2016

202 Takip Edilen144 Takipçiler

Hao-Jun Michael Shi@hjmshi·26 Mar

10/10 Both our paper (arxiv.org/abs/2512.17131) and code (github.com/facebookresear…) are available online. All the credit belongs to my awesome collaborators and teammates across MSL Kernels & Optimizations and FAIR: @paramsraman, @aaron_defazio, @konstmisch, and Lin Xiao.

English

274

Hao-Jun Michael Shi@hjmshi·26 Mar

9/10 By removing the two-loop behavior and decoupling the smoothing parameter from the number of local steps, GPA should provide a new foundation for re-thinking distributed cross-regional training algorithms.

English

215

Hao-Jun Michael Shi@hjmshi·26 Mar

1/10 Are DiLoCo and Schedule-Free actually related? A brief history and unusually late advertisement for our work: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs (see arxiv.org/abs/2512.17131).

English

9.5K

Hao-Jun Michael Shi@hjmshi·19 Şub

@konstmish @runame_ @giffmana @JFPuget @AlexShtf All of this will not diminish anything about the work that we are about to announce. 😜 As you have mentioned, these are both important orthogonal directions. Anyway, apologies, I'm still figuring out how to use Twitter (no dollar signs 😅).

English

Hao-Jun Michael Shi@hjmshi·19 Şub

@konstmish @runame_ @giffmana @JFPuget @AlexShtf Our paper emphasizes that what is known as "variance adaptation" or "whitening" (i.e., what makes Adam > Signum) offers a similar benefit to Shampoo. Shampoo belongs to the same family of methods as Muon. However, there is a conceptual incongruency wrt how we view these methods.

English

Konstantin Mishchenko@konstmish·11 Şub

AdamW's time is over.

English

125

47.4K

Hao-Jun Michael Shi@hjmshi·16 Şub

@dheevatsa Super late thanks @dheevatsa! 😁

English

dheevatsa@dheevatsa·2 Ağu

Awesome to see the Distributed Shampoo optimizer top AlgoPerf ! “28% faster training than baseline ... 19% faster than 2nd place " Kudos to the team's tenacity for persistently improving over many months, not only surpassing strong baselines but also making it practically viable!

MLCommons@MLCommons

@MLCommons #AlgoPerf results are in! 🏁 $50K prize competition yielded 28% faster neural net training with non-diagonal preconditioning beating Nesterov Adam. New SOTA for hyperparameter-free algorithms too! Full details in our blog. mlcommons.org/2024/08/mlc-al… #AIOptimization #AI

English

215

Hao-Jun Michael Shi retweetledi

Runa Eschenhagen@runame_·16 Şub

1/14 Is Muon “better” than Shampoo? We argue that their relationship parallels Adam's relationship with Signum. Analogous to @lukas_balles and Hennig’s (2018) decomposition of Adam into element-wise scaled Signum, we can decompose Shampoo as left- and right-adapted Muon.

English

264

32.3K

Hao-Jun Michael Shi retweetledi

Vinay S Rao@vinaysrao·21 Eki

While at Meta, I worked on this optimizer-wrapper (outer step lookahead momentum) we're calling Snoo (arxiv.org/abs/2510.15830). You can use it with AdamW or Muon and see really strong scaling. Here's a plot where we ran it against (tuned) AdamW up to 1e23 training flop scales. The "x"s in the plot are compute-factors i.e the baseline needs "x" more flops to reach the same loss (instead of simply measuring in steps). - We further established a medium-track WR on modded-nanogpt (github.com/KellerJordan/m…) With amazing co-authors (Dominik,Vishal,Michael).

English

234

19.7K

Hao-Jun Michael Shi retweetledi

Runa Eschenhagen@runame_·20 Kas

1/9 In practice, the Shampoo optimizer crucially relies on several heuristics. In our NeurIPS 2025 spotlight paper, we investigate the role of learning rate grafting and infrequent preconditioner updates in Shampoo by decomposing its preconditioner. arxiv.org/abs/2506.03595

English

13.2K

Hao-Jun Michael Shi@hjmshi·16 Şub

Rejoining Twitter (now X) in order to publicize some of our recent work (and give @runame_ and others some hype)! This is work specifically coming out of the MSL Infrastructure Kernels and Optimizations and AI and Systems Co-Design teams @Meta.

English

318

Hao-Jun Michael Shi retweetledi

Charlotte Abrahamson@chabrahamson·24 Nis

Me and @hjmshi but he's the computational researcher 😂😂😂

English

Keşfet

@paramsraman @aaron_defazio @konstmish @runame_ @giffmana @JFPuget @AlexShtf @dheevatsa