Atli Kosson

74 posts

Atli Kosson

@AtliKosson

PhD student at @EPFL🇨🇭working on improved understanding of deep neural networks and their optimization.

Lausanne, Switzerland Katılım Temmuz 2022

515 Takip Edilen460 Takipçiler

Sabitlenmiş Tweet

Atli Kosson@AtliKosson·4 Mar

Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for arxiv.org/abs/2305.17212 1/10

English

209

22.6K

Atli Kosson@AtliKosson·9 Mar

@francoisfleuret Our hypothesis from arxiv.org/abs/2410.23922 is that LR warmup prevents large early updates resulting from our optimizers not normalizing the update in the right way. This is effect gets worse at large batch sizes and wide networks for simple optimizers like AdamW.

English

135

François Fleuret@francoisfleuret·8 Mar

Warm-up's role is mostly to (A) pick up a good direction in an initially awful loss landscape or (B) calibrate Adam rescaling without hurting the parameters ?

English

12.8K

Atli Kosson@AtliKosson·22 Oca

We do similar training on the sphere in arxiv.org/abs/2305.17212 and arxiv.org/abs/2410.23922. For AdamW we found that we could often roughly predict the learning rate on the sphere by matching the predicted update size of the baseline in equilibrium, the same probably applies to Muon (especially for longer training runs where weight decay dominates).

English

2.3K

Kaiyue Wen@wen_kaiyue·22 Oca

@giffmana @SanghyukChun @coallaoh We do require a re-tuning of learning rate over AdamW / MuonW 😂. I agree the 'transition cost' is higher here.

English

912

Kaiyue Wen@wen_kaiyue·21 Oca

(1/n) Introducing Hyperball — an optimizer wrapper that keeps weight & update norm constant and lets you control the effective (angular) step size directly. Result: sustained speedups across scales + strong hyperparameter transfer.

English

118

685

195.5K

Atli Kosson@AtliKosson·11 Kas

@xidulu Yeah accumulating directly into the momentum buffer wouldn’t be sufficient if you need the gradient for the second moment or Nesterov. The layerwise trick always works I think, but requires alt grad clipping strategies and is a bit of a pain to implement in distributed settings.

English

Xidulu@xidulu·11 Kas

@AtliKosson > reduce this by essentially not storing the gradient. Oh this is a nice trick! And I believe this is a Muon-specific trick? Since the gradient is only used to update the momentum in Muon

English

101

Xidulu@xidulu·10 Kas

Muon saves optimizer memory overhead from 3 x model_size to approximately 2 x model_size, can we further reduce it to 1 - 1.5 model size?

English

565

Atli Kosson@AtliKosson·27 Eki

No you are right, in this case we fully replace weight decay by constraining ‖W‖ and controlling the size of ‖ΔW‖/‖W‖ directly. The findings are the same in the sense that ‖ΔW‖/‖W‖ should decrease with width in early training (as muP predicts) but not later due to alignment changes. In our earlier work arxiv.org/abs/2305.17212 we argue that the practical benefits of WD are almost entirely due to its effect on ‖ΔW‖/‖W‖ which can be achieved explicitly by optimizer modifications instead (eliminating some of the complex dynamics from WD).

English

199

Lucas Beyer (bl16)@giffmana·25 Eki

@AtliKosson @QuanquanGu But when you constrain the norm like this, then wd had no effect? So you mean you tried this norm constraint instead of wd? Then what does "the findings were the same mean? Or i think I'm confused?

English

279

Lucas Beyer (bl16)@giffmana·24 Eki

I have a thing for empirical deep dive into learning dynamics like done in this paper. Sounds like muP mostly helps the early training, while wd affects the long term.

Atli Kosson@AtliKosson

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

English

273

32.9K

Atli Kosson@AtliKosson·27 Eki

Thanks! This does indeed make an argument for independent WD. Our work is mainly about understanding why it helps, we don't claim to be the first to say it is needed. The core argument here seems to be that "if λ → 0 with n, then weight decay has no effect in the limit" (for IWD). It's not clear to me whether this would be undesirable in itself. With µP the norm of the gradient update becomes zero as the width goes to infinity, so why couldn't the weight decay norm change also go to zero? If that is not the case, then WD dominates the update and the network can not maintain the initial weight norms at all. They would always shrink, which seems likely to interfere with µP's initialization strategy. In any case we take a very different approach in the manuscript where we relate both variants of weight decay to the core goal of µP in terms of controlling the rate of feature learning (but throughout training, not only at initialization).

English

180

Shikai Qiu@ShikaiQiu·24 Eki

@AtliKosson Isn't constant independent wd exactly what µP prescribes? From 2.10.1 in arxiv.org/pdf/2308.01814

English

319

Atli Kosson@AtliKosson·23 Eki

English

338

77K

Atli Kosson@AtliKosson·27 Eki

@lorenzo_noci I think standard weight decay should be modified like you describe. This is the approach we actually employ in the paper, i.e. standard weight decay where we scale the WD inversely with the LR. This is equivalent to IWD when scaling with muP (but not when tuning the LR or WD).

English

229

Lorenzo Noci@lorenzo_noci·24 Eki

@AtliKosson Nice work! Could you clarify why standard weight decay should not be modified to follow muP intuition? In our work, we provided a prescription for weight decay scaling rules based on this (arxiv.org/pdf/2505.01618).

English

454

Atli Kosson@AtliKosson·24 Eki

@SeunghyunSEO7 Hmm interesting, yeah I think if the adafactor is weight proportional most of the insights about weight decay stop holding (in my view WD is about making non-proportional optimizers proportional). Controlling the norms can still matter but that also depends on normalization etc

English

Seunghyun Seo@SeunghyunSEO7·24 Eki

@AtliKosson oh sorry, they used adafactor where it's equivalent to adamw with param scaling lr, so it is more complicated. in addition, google's framework like tnesorflow gonna perform independent WD unlike torch when using adamw.

English

427

Seunghyun Seo@SeunghyunSEO7·24 Eki

there were some experimental and theoretical results about the importance of independent WD when using muP like arxiv.org/abs/2405.13698, and here's another interesting work! (and here's concurrent work studying WD and muP with another perspective, arxiv.org/pdf/2510.15262 )

Atli Kosson@AtliKosson

Surprisingly independent WD works because it overrides µP's scaling! µP makes the updates proportionally smaller for wider models, but independent WD eventually makes them equally large across widths. This turns out to be exactly what's needed for stable feature learning! 🧵5/8

English

4.8K

Atli Kosson@AtliKosson·24 Eki

Interesting, one thing we tried but didn’t include in the manuscript was constraining the weight norms to the initialization values (scaling each weight matrix to have the init norm after every step). That way the norms are well behaved throughout. The findings were essentially the same. The muP update scaling should only be applied as a warmup, not throughout training or the LR won’t transfer.

English

269

Quanquan Gu@QuanquanGu·24 Eki

I see it differently. The learning rate and weight decay jointly determine the norm of the weight matrices when the optimization converges. If weight decay is not scaled correctly with μP, then the model is effectively outside the μP regime, and the benefits of μP can be diminished or even negated.

English

2.3K

Atli Kosson@AtliKosson·24 Eki

So the hypothesis is that the peak LR doesn’t control the rate of feature learning anymore and can transfer for that reason. This lack of control could contribute to the typical loss of performance without WD. All the findings about broken alignment assumptions etc seem to hold without WD.

English

175

Atli Kosson@AtliKosson·24 Eki

@cloneofsimo It’s not necessarily opposite, we discuss this a bit in the manuscript. For short runs WD doesn’t really matter, but it also turns out without WD all LRs kind of work the same in the long run. The size of the relative feature updates loses its dependence on the peak LR.

English

752

Simo Ryu@cloneofsimo·24 Eki

Interesting, I have a nearly opposite result of this 🤔🤔 i.e., one can use weight-decay-free optimizer and still get learning rate transfer.

Atli Kosson@AtliKosson

English

10.1K

Atli Kosson@AtliKosson·23 Eki

Tagging a couple of people who worked on µP before and might find this interesting: @TheGregYang @jxbz @xidulu @laurence_ai @_katieeverett @Mitchnw @thecharlieblake @ShaneBergsma @DeyNolan @QuanquanGu @cloneofsimo @SeunghyunSEO7

English

1.8K

Atli Kosson@AtliKosson·23 Eki

See the full paper for additional details and insights about LR transfer in practice! Paper link: arxiv.org/abs/2510.19093 Grateful for the opportunity to work on this project during my internship at Amazon FAR with @jerwelborn @largelymfs @peterxichen! 🧵8/8

English

2.5K

Keşfet

@francoisfleuret @giffmana @SanghyukChun @coallaoh @xidulu @QuanquanGu @lorenzo_noci @SeunghyunSEO7