Satoki Ishikawa

661 posts

Satoki Ishikawa

@SisForCollege

TokyoTech 25D Dept. of Computer Science | R.Yokota lab | DNN https://t.co/BEJmlAWZD8: https://t.co/3NoUYlliTa

가입일 Ağustos 2018

1.1K 팔로잉530 팔로워

Satoki Ishikawa@SisForCollege·23 Mar

@YukiTakeza おめでとうございます🎉

日本語

189

Yuki Takezawa@YukiTakeza·23 Mar

I have successfully completed my PhD today! Grateful for all the support along the way.

English

9.1K

Satoki Ishikawa 리트윗함

Taishi Nakamura🇧🇷ICLR2026@taishinakamura_·26 Oca

Our paper "Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks" has been accepted to ICLR 2026! 🎉 See you in Brazil! 🇧🇷

English

189

38K

Satoki Ishikawa 리트윗함

Andrew Gordon Wilson@andrewgwils·8 Kas

Bach is so timeless because he wasn't writing for people, he was writing for a higher power. Try writing your next paper for God. Imagine how many rubbish papers we wouldn't see anymore. Your audience sees your every thought and intention. There would be no ego, no pretense.

English

286

36.2K

Satoki Ishikawa@SisForCollege·6 Kas

I can accept that the max LR transfers well with μP. However, the optimal LR seems far more complex. It's influenced by many other factors, such as finding a rate that avoids "forgetting" or instability. Of course, alignment between vectors would be also important...

English

255

Satoki Ishikawa@SisForCollege·6 Kas

One thing I've been wondering about HP transfer in μP is what criterion they're using to define "transfer." For instance, TP4 seems to state that the max LR (the maximum LR that doesn't diverge) transfers. But then, TP5 claims that the optimal LR transfers. Which is correct?

Jason Lee@jasondeanlee

Proof by picture of why lr convergence is not useful unless it is fast relative to loss/predictions. Credit to nikhil Ghosh, Denny Wu, and Alberto for studying this and critical of the muP series of conclusions and overclaims.

English

1.1K

Satoki Ishikawa@SisForCollege·30 Eyl

@myai100 Thank you for pointing this out! The AdaGrad paper is a very important paper!!

English

inikishev@myai100·29 Eyl

@SisForCollege jmlr.org/papers/volume1… Perhaps Adagrad itself could be added because many methods (Shampoo, KFAC) approximate the full-matrix version

English

Satoki Ishikawa@SisForCollege·27 Eyl

I'm updating awesome-second-order optimization. If you find important / interesting papers not cited in this repository, please let me know. github.com/riverstone496/…

English

1.3K

Satoki Ishikawa@SisForCollege·27 Eyl

@tmoellenhoff Thank you for pointing that out! I completely forgot that important paper. I’ve updated the list

English

Thomas Möllenhoff@tmoellenhoff·27 Eyl

@SisForCollege Shameless plug of my own works :) arxiv.org/abs/1706.04638 (equivalent to LocoProp-S, but from 2018) arxiv.org/abs/2402.17641 (could be added to Bayesian section)

English

153

Satoki Ishikawa 리트윗함

Torsten Hoefler 🇨🇭@thoefler·25 Nis

Rio Yokota from Tokyo Tech talks about scaling laws for #HPC, #AI training, inference, and spending 💸. We're in the exponential scaling part of a logistic curve - when will we hit the bottom? Nice discussion and analogies between the fields 🤔.

English

1.5K

Satoki Ishikawa@SisForCollege·18 Eyl

ACT-X「次世代AIを築く数理・情報科学の革新」に採択されました．引き続き，ニューラルネットワークの最適化について，深い理解を得られるよう研究していきます😁

日本語

6.6K

Satoki Ishikawa@SisForCollege·10 Ağu

@cloneofsimo You’re in Tokyo! Nice! If you’re interested and have a moment, feel free to drop by our lab. No art here, sadly, so it might not be that exciting😭

English

999

Simo Ryu@cloneofsimo·10 Ağu

Highly recommend Manten-sushi (there are 3) if you visit tokyo Best omakase for the price.

English

4.8K

Satoki Ishikawa 리트윗함

Taishi Nakamura🇧🇷ICLR2026@taishinakamura_·18 Tem

I won’t make it to ICML this year, but our work will be presented at the 2nd AI for Math Workshop @ ICML 2025 (@ai4mathworkshop). Huge thanks to my co‑author @SisForCollege for presenting on my behalf. please drop by if you’re around!

English

5.7K

Satoki Ishikawa 리트윗함

Soumith Chintala@soumithchintala·16 Tem

considering Muon is so popular and validated at scale, we've just decided to welcome a PR for it in PyTorch core by default. If anyone wants to take a crack at it... #issuecomment-3070108227" target="_blank" rel="nofollow noopener">github.com/pytorch/pytorc…

English

848

81.8K

Satoki Ishikawa@SisForCollege·14 Tem

@borisdayma @JesseFarebro @_arohan_ 2505.02222 doesn’t seem to include any theoretical derivations, so it’s hard to know whether it really implements muP. At the very least, Bernstein’s learning‐rate scaling uses the same parameterization as our muP, and since both are derived mathematically, they should be muP.

English

Boris Dayma 🖍️@borisdayma·14 Tem

@SisForCollege @JesseFarebro @_arohan_ Thanks for referencing Bernstein, will check better. I had looked at "Practical Efficiency of Muon for Pretraining" and it seemed that they applied equivalent rules as the original ones for Adam: arxiv.org/abs/2505.02222

English

116

Boris Dayma 🖍️@borisdayma·14 Tem

MUP has been on my mind forever! Now I came across this gem from @JesseFarebro : github.com/JesseFarebro/f… It automatically handles it on JAX/Flax 😍 Just need to see what to adjust for Muon / Shampoo / PSGD-kron (init params + LR scaling)

English

6.7K

Satoki Ishikawa@SisForCollege·14 Tem

@borisdayma @JesseFarebro @_arohan_ I agree that Muon = Shampoo and I think Muon needs the same kind of scaling as Shampoo. （I believe my muP scaling is the same as the scaling of Muon by Bernstein

English

149

Boris Dayma 🖍️@borisdayma·14 Tem

@SisForCollege @JesseFarebro Oh interesting so you do have coeffs for Shampoo… I thought that there was no need since I understood they are not needed for Muon and @_arohan_ keeps on saying that Muon = Shampoo

English

209

Satoki Ishikawa@SisForCollege·27 Haz

@evaninwords In my view, grokking occurs when certain conditions involving init scale, weight decay, and learning rate are met. In other words, grokking usually happens due to insufficient hyperparameter tuning. Therefore, it would be important to plot how large the grokking area is.

English

Evan Walters@evaninwords·27 Haz

@SisForCollege I wonder if various weight constraints would affect grokking then?

English

Evan Walters@evaninwords·26 Haz

Toy grokking problems can be pretty sensitive, but turns out PSGD is robust! 💪 For modular arithmetic, PSGD does better than AdamW across the board for batch size, model depth, dim, and num heads. These two plots are best runs for each. Sweeps below 👇

Essential AI@essential_ai

[1/5] We have a quick update to share, which contradicts our hypothesis regarding the abilities of Muon and Adam vis-a-vis Grokking.

English

4.9K

Satoki Ishikawa@SisForCollege·26 Haz

@evaninwords arxiv.org/abs/2210.01117 arxiv.org/abs/2310.06110

QME

Satoki Ishikawa@SisForCollege·26 Haz

@evaninwords I think grokking happens when the weight scale is too large for the task and the optimizer. The optimal weight scale depends on the optimizer, so I’d love to see the results of sweeping weight scale or weight decay. (And, I believe PSGD outperforms on that.)

English

Satoki Ishikawa@SisForCollege·25 Haz

I find a very interesting μP paper on the embedding LR. They propose new embedding LR scale when vocab size is much larger than width. arxiv.org/abs/2506.15025

English

9.8K

탐색

@YukiTakeza @myai100 @tmoellenhoff @cloneofsimo @ai4mathworkshop @borisdayma @JesseFarebro @_arohan_