Fabian Schaipp

508 posts

Fabian Schaipp banner
Fabian Schaipp

Fabian Schaipp

@FSchaipp

working on optimization for machine learning. currently postdoc @inria_paris.

Paris, France Katılım Temmuz 2020
746 Takip Edilen1.2K Takipçiler
Sabitlenmiş Tweet
JFPuget 🇺🇦🇨🇦🇬🇱
Interestingly, people responding to this comment stress that Muon is faster than AdamW, even for CNN. But they don't provide evidence. While I trust them that Muon is faster than AdamW, what I care about is the quality of the resulting model. Is a Muon trained CNN better than a CNN trained with AdamW? In the CIFAR-10 speedrun, Muon is compared to SGD, not AdamW. And it yields a worse model, with 94% accuracy vs 96% with SGD. github.com/KellerJordan/c… I welcome any evidence of a CNN trained with Muon being better than the same CNN trained with AdamW.
JFPuget 🇺🇦🇨🇦🇬🇱@JFPuget

@VukRosic99 Not sure Muon would help for CNNs.

English
8
2
109
21.8K
Rishabh Anand
Rishabh Anand@rishabh16_·
Interesting to see how optimiser research hasn’t really hit AI4Bio (yet). Not a single recent paper I’ve seen uses second-order optimisers like Shampoo or KFAC, let alone fancier stuff like Muon. Guess Adam(W) is here to stay … for a while We kinda knew BioML lags behind trad ML by 1-2 years but I thought this’d have happened by now
English
7
1
23
3.6K
Axel Darmouni
Axel Darmouni@ADarmouni·
I think that it’s three things: —> the method makes sense as an improvement so they don’t go further when they see the improvement —> while around 1e-4 seems consistent, it would appear that varying the LR on small models provoke WILD discrepancies, which was indeed less explored —> a hyperparameter sweep takes time and considering the variance, it means lots of runs ; if your training takes at least one day it will definitely slow down publication
English
1
0
0
1.4K
Fabian Schaipp
Fabian Schaipp@FSchaipp·
"please don't call it matrix sign"
Fabian Schaipp tweet media
English
1
0
9
582
Fabian Schaipp
Fabian Schaipp@FSchaipp·
> 4.5k citations, but two typos in one of the central equations
Fabian Schaipp tweet media
English
3
0
14
3.1K
Fabian Schaipp
Fabian Schaipp@FSchaipp·
@dayal_kalra EoS to me describes growing sharpness despite constant LR until stabilization kicks in. the 7B experiment can't confirm that because the growing sharpness could be only due to cosine. also it seems there is no oscillatory phase in your train run, right?
English
1
0
0
36
Dayal Kalra
Dayal Kalra@dayal_kalra·
@FSchaipp Thanks for the thoughtful comment! Here's my reasoning: As the LR decays, the EoS threshold increases. While critical sharpness could have stayed constant or decreased, it increases continuously. Curious about your intuition here.
English
2
0
1
50
Fabian Schaipp
Fabian Schaipp@FSchaipp·
@xidulu WSD in abstract, doesn't cite any of the relevant papers 🫣
English
2
0
2
148
Rosinality
Rosinality@rosinality·
Scaling law vs. 𝜇P, and optimization dynamics for problems like module-wise LR.
Rosinality tweet media
English
3
7
69
4.1K
Fabian Schaipp
Fabian Schaipp@FSchaipp·
those old Weltatlas maps are just so aesthetic
Fabian Schaipp tweet mediaFabian Schaipp tweet mediaFabian Schaipp tweet media
English
0
0
3
287