Fabian Schaipp

517 posts

Fabian Schaipp

@FSchaipp

working on optimization for machine learning. currently postdoc @inria_paris.

Paris, France Katılım Temmuz 2020

755 Takip Edilen1.3K Takipçiler

Sabitlenmiş Tweet

Fabian Schaipp@FSchaipp·5 Şub

Learning rate schedules seem mysterious? Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper 🚇 arxiv.org/abs/2501.18965

Aaron Defazio@aaron_defazio

The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting! 1/2

English

141

31.5K

Fabian Schaipp@FSchaipp·4d

Polyak step size is back!

Aaron Defazio@aaron_defazio

🚨 New Paper 🚨 ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models A few modifications to Schedule-Free Learning make it completely LR tuning free, and allow it to greatly outperform schedules for long duration training! arxiv.org/abs/2605.19095…

English

6.4K

Fabian Schaipp@FSchaipp·15 May

@maxzimmerberlin @spokutta Glückwunsch, Dr. Zimmer!

Deutsch

Max Zimmer@maxzimmerberlin·15 May

I successfully defended my PhD (Dr. rer. nat.) in Mathematics and Deep Learning with distinction! Thanks a lot everyone for coming, and in particular to my reviewers! - @spokutta - Prof. Dr. Gabriele Steidl - Prof. Dr. Alexandre d'Aspremont - Prof. Dr. Yao Xie

English

363

Fabian Schaipp@FSchaipp·12 May

NeurIPS 2025 Proceedings are finally online!

English

441

Fabian Schaipp@FSchaipp·4 May

@konstmish when I tested CWD, it looked much better than the baseline for a long time, then lost all its advantage during cooldown. 😢 Was for a relatively short run though (D~20N).

English

223

Konstantin Mishchenko@konstmish·4 May

Actually, I also never had much success testing cautious Adam gradient update or cautios weight decay. Is it a skill issue?

Naga/Abhi@NagaSaiAbhinay

Adding caution to the update step also doesn't help. makes it worse actually. does anyone actually use cautious update or cautious weight decay in practice ?

English

6.2K

Fabian Schaipp@FSchaipp·16 Nis

Nice result! (from arxiv.org/pdf/2604.13870) no anytime-schedule can obtain the optimal rate for (S)GD. to my knowledge, WSD is the closest candidate we know of, as it removes the log-factor in the rate for any cooldown length proportional to T.

English

4.5K

Fabian Schaipp@FSchaipp·31 Mar

Going to Zurich for a couple of days. I will give a talk on recent optimization stuff @zurichnlp. Always happy to chat 🍫

English

3.1K

Fabian Schaipp@FSchaipp·24 Mar

@ruuustem_10 @YouJiacheng @CevherLIONS not sure i follow. the steplaw is not restricted to a fixed TPP.

English

Rustem@ruuustem_10·24 Mar

@YouJiacheng @FSchaipp @CevherLIONS If you work in the regime when TPP is fixed (T_1/T_0=D_1/D_0) we should use eta_1 = eta_0 * (D_1 / D_0)^{-0.713} (T_1 / T_0)^{0.307} = eta_0 * (T_1 / T_0)^{-0.406}, so eta decreases. Also, 0.406 is quite close to 1/3. So in some sense we should write everything in terms of T

English

You Jiacheng@YouJiacheng·24 Mar

This theory *derives* BS ~ T^(2/3). This is impressively close to an empirical law (steplaw). 2/3 vs. 0.571

Volkan Cevher@CevherLIONS

From our analysis, the achievable error satisfies: ε ∼ max { BS / T, T^(−1/3), 1 / (T² BS)^(1/6) } These terms correspond to limiting terms due to 1) iterations 2) optimization efficiency 3) stochastic noise Balancing them gives the scaling law.

English

15.6K

Fabian Schaipp@FSchaipp·24 Mar

@CevherLIONS @YouJiacheng thanks for clarifying! so the batch size scaling is also only applicable to Scion?

English

Volkan Cevher@CevherLIONS·24 Mar

@YouJiacheng @FSchaipp Good point but I think there are two separate issues here. In our theory, the quantity scaling like 1/K is the Scion/FW step-size β, not the AdamW peak LR used in StepLaw with the cosine decay (which decays along the path). So this is not a direct apples-to-apples comparison.

English

152

Fabian Schaipp@FSchaipp·3 Mar

@JFPuget yes! github.com/fabian-sp/sda

373

JFPuget 🇫🇷🇺🇦🇨🇦🇬🇱@JFPuget·3 Mar

@FSchaipp Is your code available?

English

772

JFPuget 🇫🇷🇺🇦🇨🇦🇬🇱@JFPuget·3 Mar

Interestingly, people responding to this comment stress that Muon is faster than AdamW, even for CNN. But they don't provide evidence. While I trust them that Muon is faster than AdamW, what I care about is the quality of the resulting model. Is a Muon trained CNN better than a CNN trained with AdamW? In the CIFAR-10 speedrun, Muon is compared to SGD, not AdamW. And it yields a worse model, with 94% accuracy vs 96% with SGD. github.com/KellerJordan/c… I welcome any evidence of a CNN trained with Muon being better than the same CNN trained with AdamW.

JFPuget 🇫🇷🇺🇦🇨🇦🇬🇱@JFPuget

@VukRosic99 Not sure Muon would help for CNNs.

English

112

23.1K

Fabian Schaipp@FSchaipp·18 Şub

@rishabh16_ Muon for material foundation model: arxiv.org/abs/2508.16067 Not Bio, but closely related: arxiv.org/abs/2510.19376

English

450

Rishabh Anand@rishabh16_·18 Şub

Interesting to see how optimiser research hasn’t really hit AI4Bio (yet). Not a single recent paper I’ve seen uses second-order optimisers like Shampoo or KFAC, let alone fancier stuff like Muon. Guess Adam(W) is here to stay … for a while We kinda knew BioML lags behind trad ML by 1-2 years but I thought this’d have happened by now

English

3.7K

Fabian Schaipp@FSchaipp·13 Şub

After LLMs and diffusion, Muon also shines on tabular foundation models! Also nice to see they used cautious weight decay 🥌

David Holzmüller@DHolzmueller

Super excited that TabICLv2 is out 🎉 🚀Beats RealTabPFN-2.5 with no tuning and purely synthetic pre-training data. 👉Introduces QASSMax for long-context generalization, early target embedding, repeated feature grouping, Muon, etc., and a much diversified synthetic data prior.

English

Fabian Schaipp@FSchaipp·9 Şub

@ADarmouni that's not how research works

English

1.3K

Axel Darmouni@ADarmouni·9 Şub

I think that it’s three things: —> the method makes sense as an improvement so they don’t go further when they see the improvement —> while around 1e-4 seems consistent, it would appear that varying the LR on small models provoke WILD discrepancies, which was indeed less explored —> a hyperparameter sweep takes time and considering the variance, it means lots of runs ; if your training takes at least one day it will definitely slow down publication

English

1.4K

Fabian Schaipp@FSchaipp·8 Şub

not to offend anyone, but how tf do these papers get through review when not even the LR of the baseline is properly tuned?

Zain@ZainHasan6

Learning rate matters more than your LoRA variant. In this study they sweep LR hard across LoRA variants (DoRA, Init[AB], PiSSA, MiLoRA) and find: > If you tune LR properly, they all converge to approx the same peak perf. > Rank still matters and can flip which variant looks best depending on dataset. > Optimal learning rate is a function of how steeply curved the loss is: >> more curved → smaller steps (lower LR) >> less curve → larger steps (higher LR)

English

154

28.7K

Fabian Schaipp@FSchaipp·5 Şub

"please don't call it matrix sign"

English

615

Fabian Schaipp@FSchaipp·30 Oca

@staghado W_q^T

430

Said Taghadouini@staghado·30 Oca

@FSchaipp x_m and bold R?

English

496

Fabian Schaipp@FSchaipp·30 Oca

> 4.5k citations, but two typos in one of the central equations

English

3.2K

Fabian Schaipp@FSchaipp·28 Oca

@snowclipsed not a MoE, but:

English

212

snow@snowclipsed·28 Oca

Can you believe that? That's one of the smoothest pretrain curves I've seen, on an incredibly sparse MoE!

Arcee.ai@arcee_ai

To keep training stable at high sparsity, we increased dense layers, applied momentum-based expert load balancing, and used z-loss to control logit scale during training. The loss curve stayed smooth throughout the run.

English

156

10K

Fabian Schaipp@FSchaipp·26 Oca

@dayal_kalra EoS to me describes growing sharpness despite constant LR until stabilization kicks in. the 7B experiment can't confirm that because the growing sharpness could be only due to cosine. also it seems there is no oscillatory phase in your train run, right?

English

Dayal Kalra@dayal_kalra·26 Oca

@FSchaipp Thanks for the thoughtful comment! Here's my reasoning: As the LR decays, the EoS threshold increases. While critical sharpness could have stayed constant or decreased, it increases continuously. Curious about your intuition here.

English

Fabian Schaipp@FSchaipp·26 Oca

This looks like a useful tool for optimization research arxiv.org/pdf/2601.16979

English

663

Keşfet

@maxzimmerberlin @spokutta @konstmish @zurichnlp @ruuustem_10 @YouJiacheng @CevherLIONS @JFPuget