Satoki Ishikawa

661 posts

Satoki Ishikawa banner
Satoki Ishikawa

Satoki Ishikawa

@SisForCollege

TokyoTech 25D Dept. of Computer Science | R.Yokota lab | DNN https://t.co/BEJmlAWZD8: https://t.co/3NoUYlliTa

เข้าร่วม Ağustos 2018
1.1K กำลังติดตาม530 ผู้ติดตาม
Yuki Takezawa
Yuki Takezawa@YukiTakeza·
I have successfully completed my PhD today! Grateful for all the support along the way.
Yuki Takezawa tweet media
English
7
6
95
9.1K
Satoki Ishikawa รีทวีตแล้ว
Taishi Nakamura🇧🇷ICLR2026
Taishi Nakamura🇧🇷ICLR2026@taishinakamura_·
Our paper "Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks" has been accepted to ICLR 2026! 🎉 See you in Brazil! 🇧🇷
Taishi Nakamura🇧🇷ICLR2026 tweet media
English
5
16
189
38K
Satoki Ishikawa รีทวีตแล้ว
Andrew Gordon Wilson
Andrew Gordon Wilson@andrewgwils·
Bach is so timeless because he wasn't writing for people, he was writing for a higher power. Try writing your next paper for God. Imagine how many rubbish papers we wouldn't see anymore. Your audience sees your every thought and intention. There would be no ego, no pretense.
English
5
21
286
36.2K
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
I can accept that the max LR transfers well with μP. However, the optimal LR seems far more complex. It's influenced by many other factors, such as finding a rate that avoids "forgetting" or instability. Of course, alignment between vectors would be also important...
English
1
0
1
255
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
One thing I've been wondering about HP transfer in μP is what criterion they're using to define "transfer." For instance, TP4 seems to state that the max LR (the maximum LR that doesn't diverge) transfers. But then, TP5 claims that the optimal LR transfers. Which is correct?
Satoki Ishikawa tweet mediaSatoki Ishikawa tweet media
Jason Lee@jasondeanlee

Proof by picture of why lr convergence is not useful unless it is fast relative to loss/predictions. Credit to nikhil Ghosh, Denny Wu, and Alberto for studying this and critical of the muP series of conclusions and overclaims.

English
1
1
1
1.1K
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
@myai100 Thank you for pointing this out! The AdaGrad paper is a very important paper!!
English
0
0
0
22
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
I'm updating awesome-second-order optimization. If you find important / interesting papers not cited in this repository, please let me know. github.com/riverstone496/…
English
2
3
12
1.3K
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
@tmoellenhoff Thank you for pointing that out! I completely forgot that important paper. I’ve updated the list
English
0
0
1
51
Satoki Ishikawa รีทวีตแล้ว
Torsten Hoefler 🇨🇭
Torsten Hoefler 🇨🇭@thoefler·
Rio Yokota from Tokyo Tech talks about scaling laws for #HPC, #AI training, inference, and spending 💸. We're in the exponential scaling part of a logistic curve - when will we hit the bottom? Nice discussion and analogies between the fields 🤔.
Torsten Hoefler 🇨🇭 tweet mediaTorsten Hoefler 🇨🇭 tweet mediaTorsten Hoefler 🇨🇭 tweet mediaTorsten Hoefler 🇨🇭 tweet media
English
0
5
14
1.5K
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
ACT-X「次世代AIを築く数理・情報 科学の革新」に採択されました.引き続き,ニューラルネットワークの最適化について,深い理解を得られるよう研究していきます😁
日本語
2
4
63
6.6K
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
@cloneofsimo You’re in Tokyo! Nice! If you’re interested and have a moment, feel free to drop by our lab. No art here, sadly, so it might not be that exciting😭
English
1
0
2
999
Simo Ryu
Simo Ryu@cloneofsimo·
Highly recommend Manten-sushi (there are 3) if you visit tokyo Best omakase for the price.
Simo Ryu tweet media
English
1
0
48
4.8K
Satoki Ishikawa รีทวีตแล้ว
Taishi Nakamura🇧🇷ICLR2026
Taishi Nakamura🇧🇷ICLR2026@taishinakamura_·
I won’t make it to ICML this year, but our work will be presented at the 2nd AI for Math Workshop @ ICML 2025 (@ai4mathworkshop). Huge thanks to my co‑author @SisForCollege for presenting on my behalf. please drop by if you’re around!
Taishi Nakamura🇧🇷ICLR2026 tweet media
English
1
7
50
5.7K
Satoki Ishikawa รีทวีตแล้ว
Soumith Chintala
Soumith Chintala@soumithchintala·
considering Muon is so popular and validated at scale, we've just decided to welcome a PR for it in PyTorch core by default. If anyone wants to take a crack at it... #issuecomment-3070108227" target="_blank" rel="nofollow noopener">github.com/pytorch/pytorc…
Soumith Chintala tweet media
English
34
57
848
81.8K
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
@borisdayma @JesseFarebro @_arohan_ 2505.02222 doesn’t seem to include any theoretical derivations, so it’s hard to know whether it really implements muP. At the very least, Bernstein’s learning‐rate scaling uses the same parameterization as our muP, and since both are derived mathematically, they should be muP.
English
0
0
1
86
Boris Dayma 🖍️
Boris Dayma 🖍️@borisdayma·
MUP has been on my mind forever! Now I came across this gem from @JesseFarebro : github.com/JesseFarebro/f… It automatically handles it on JAX/Flax 😍 Just need to see what to adjust for Muon / Shampoo / PSGD-kron (init params + LR scaling)
English
2
5
64
6.7K
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
@borisdayma @JesseFarebro @_arohan_ I agree that Muon = Shampoo and I think Muon needs the same kind of scaling as Shampoo. (I believe my muP scaling is the same as the scaling of Muon by Bernstein
English
1
0
4
149
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
@evaninwords In my view, grokking occurs when certain conditions involving init scale, weight decay, and learning rate are met. In other words, grokking usually happens due to insufficient hyperparameter tuning. Therefore, it would be important to plot how large the grokking area is.
Satoki Ishikawa tweet media
English
0
0
1
74
Evan Walters
Evan Walters@evaninwords·
@SisForCollege I wonder if various weight constraints would affect grokking then?
English
1
0
0
57
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
@evaninwords I think grokking happens when the weight scale is too large for the task and the optimizer. The optimal weight scale depends on the optimizer, so I’d love to see the results of sweeping weight scale or weight decay. (And, I believe PSGD outperforms on that.)
English
2
0
2
93
Satoki Ishikawa
Satoki Ishikawa@SisForCollege·
I find a very interesting μP paper on the embedding LR. They propose new embedding LR scale when vocab size is much larger than width. arxiv.org/abs/2506.15025
English
0
12
66
9.8K