Fabian Schaipp

517 posts

Fabian Schaipp banner
Fabian Schaipp

Fabian Schaipp

@FSchaipp

working on optimization for machine learning. currently postdoc @inria_paris.

Paris, France Katılım Temmuz 2020
755 Takip Edilen1.3K Takipçiler
Sabitlenmiş Tweet
Max Zimmer
Max Zimmer@maxzimmerberlin·
I successfully defended my PhD (Dr. rer. nat.) in Mathematics and Deep Learning with distinction! Thanks a lot everyone for coming, and in particular to my reviewers! - @spokutta - Prof. Dr. Gabriele Steidl - Prof. Dr. Alexandre d'Aspremont - Prof. Dr. Yao Xie
Max Zimmer tweet media
English
2
0
17
363
Fabian Schaipp
Fabian Schaipp@FSchaipp·
NeurIPS 2025 Proceedings are finally online!
Fabian Schaipp tweet media
English
0
0
7
441
Fabian Schaipp
Fabian Schaipp@FSchaipp·
@konstmish when I tested CWD, it looked much better than the baseline for a long time, then lost all its advantage during cooldown. 😢 Was for a relatively short run though (D~20N).
English
0
0
4
223
Fabian Schaipp
Fabian Schaipp@FSchaipp·
Nice result! (from arxiv.org/pdf/2604.13870) no anytime-schedule can obtain the optimal rate for (S)GD. to my knowledge, WSD is the closest candidate we know of, as it removes the log-factor in the rate for any cooldown length proportional to T.
Fabian Schaipp tweet media
English
1
6
57
4.5K
Fabian Schaipp
Fabian Schaipp@FSchaipp·
Going to Zurich for a couple of days. I will give a talk on recent optimization stuff @zurichnlp. Always happy to chat 🍫
Fabian Schaipp tweet media
English
3
1
29
3.1K
Rustem
Rustem@ruuustem_10·
@YouJiacheng @FSchaipp @CevherLIONS If you work in the regime when TPP is fixed (T_1/T_0=D_1/D_0) we should use eta_1 = eta_0 * (D_1 / D_0)^{-0.713} (T_1 / T_0)^{0.307} = eta_0 * (T_1 / T_0)^{-0.406}, so eta decreases. Also, 0.406 is quite close to 1/3. So in some sense we should write everything in terms of T
English
1
0
1
90
Volkan Cevher
Volkan Cevher@CevherLIONS·
@YouJiacheng @FSchaipp Good point but I think there are two separate issues here. In our theory, the quantity scaling like 1/K is the Scion/FW step-size β, not the AdamW peak LR used in StepLaw with the cosine decay (which decays along the path). So this is not a direct apples-to-apples comparison.
English
2
0
2
152
JFPuget 🇫🇷🇺🇦🇨🇦🇬🇱
Interestingly, people responding to this comment stress that Muon is faster than AdamW, even for CNN. But they don't provide evidence. While I trust them that Muon is faster than AdamW, what I care about is the quality of the resulting model. Is a Muon trained CNN better than a CNN trained with AdamW? In the CIFAR-10 speedrun, Muon is compared to SGD, not AdamW. And it yields a worse model, with 94% accuracy vs 96% with SGD. github.com/KellerJordan/c… I welcome any evidence of a CNN trained with Muon being better than the same CNN trained with AdamW.
JFPuget 🇫🇷🇺🇦🇨🇦🇬🇱@JFPuget

@VukRosic99 Not sure Muon would help for CNNs.

English
8
2
112
23.1K
Rishabh Anand
Rishabh Anand@rishabh16_·
Interesting to see how optimiser research hasn’t really hit AI4Bio (yet). Not a single recent paper I’ve seen uses second-order optimisers like Shampoo or KFAC, let alone fancier stuff like Muon. Guess Adam(W) is here to stay … for a while We kinda knew BioML lags behind trad ML by 1-2 years but I thought this’d have happened by now
English
7
1
23
3.7K
Axel Darmouni
Axel Darmouni@ADarmouni·
I think that it’s three things: —> the method makes sense as an improvement so they don’t go further when they see the improvement —> while around 1e-4 seems consistent, it would appear that varying the LR on small models provoke WILD discrepancies, which was indeed less explored —> a hyperparameter sweep takes time and considering the variance, it means lots of runs ; if your training takes at least one day it will definitely slow down publication
English
1
0
0
1.4K
Fabian Schaipp
Fabian Schaipp@FSchaipp·
"please don't call it matrix sign"
Fabian Schaipp tweet media
English
1
0
9
615
Fabian Schaipp
Fabian Schaipp@FSchaipp·
> 4.5k citations, but two typos in one of the central equations
Fabian Schaipp tweet media
English
3
0
14
3.2K
Fabian Schaipp
Fabian Schaipp@FSchaipp·
@dayal_kalra EoS to me describes growing sharpness despite constant LR until stabilization kicks in. the 7B experiment can't confirm that because the growing sharpness could be only due to cosine. also it seems there is no oscillatory phase in your train run, right?
English
1
0
0
36
Dayal Kalra
Dayal Kalra@dayal_kalra·
@FSchaipp Thanks for the thoughtful comment! Here's my reasoning: As the LR decays, the EoS threshold increases. While critical sharpness could have stayed constant or decreased, it increases continuously. Curious about your intuition here.
English
2
0
1
51