Atli Kosson

74 posts

Atli Kosson

Atli Kosson

@AtliKosson

PhD student at @EPFL🇨🇭working on improved understanding of deep neural networks and their optimization.

Lausanne, Switzerland Katılım Temmuz 2022
515 Takip Edilen460 Takipçiler
Sabitlenmiş Tweet
Atli Kosson
Atli Kosson@AtliKosson·
Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for arxiv.org/abs/2305.17212 1/10
Atli Kosson tweet media
English
5
42
209
22.6K
Atli Kosson
Atli Kosson@AtliKosson·
@francoisfleuret Our hypothesis from arxiv.org/abs/2410.23922 is that LR warmup prevents large early updates resulting from our optimizers not normalizing the update in the right way. This is effect gets worse at large batch sizes and wide networks for simple optimizers like AdamW.
Atli Kosson tweet media
English
0
0
1
135
François Fleuret
François Fleuret@francoisfleuret·
Warm-up's role is mostly to (A) pick up a good direction in an initially awful loss landscape or (B) calibrate Adam rescaling without hurting the parameters ?
English
13
2
50
12.8K
Atli Kosson
Atli Kosson@AtliKosson·
We do similar training on the sphere in arxiv.org/abs/2305.17212 and arxiv.org/abs/2410.23922. For AdamW we found that we could often roughly predict the learning rate on the sphere by matching the predicted update size of the baseline in equilibrium, the same probably applies to Muon (especially for longer training runs where weight decay dominates).
English
1
2
26
2.3K
Kaiyue Wen
Kaiyue Wen@wen_kaiyue·
(1/n) Introducing Hyperball — an optimizer wrapper that keeps weight & update norm constant and lets you control the effective (angular) step size directly. Result: sustained speedups across scales + strong hyperparameter transfer.
Kaiyue Wen tweet media
English
27
118
685
195.5K
Atli Kosson
Atli Kosson@AtliKosson·
@xidulu Yeah accumulating directly into the momentum buffer wouldn’t be sufficient if you need the gradient for the second moment or Nesterov. The layerwise trick always works I think, but requires alt grad clipping strategies and is a bit of a pain to implement in distributed settings.
English
0
0
3
65
Xidulu
Xidulu@xidulu·
@AtliKosson > reduce this by essentially not storing the gradient. Oh this is a nice trick! And I believe this is a Muon-specific trick? Since the gradient is only used to update the momentum in Muon
English
1
0
3
101
Xidulu
Xidulu@xidulu·
Muon saves optimizer memory overhead from 3 x model_size to approximately 2 x model_size, can we further reduce it to 1 - 1.5 model size?
English
1
1
4
565
Atli Kosson
Atli Kosson@AtliKosson·
No you are right, in this case we fully replace weight decay by constraining ‖W‖ and controlling the size of ‖ΔW‖/‖W‖ directly. The findings are the same in the sense that ‖ΔW‖/‖W‖ should decrease with width in early training (as muP predicts) but not later due to alignment changes. In our earlier work arxiv.org/abs/2305.17212 we argue that the practical benefits of WD are almost entirely due to its effect on ‖ΔW‖/‖W‖ which can be achieved explicitly by optimizer modifications instead (eliminating some of the complex dynamics from WD).
English
1
0
5
199
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
@AtliKosson @QuanquanGu But when you constrain the norm like this, then wd had no effect? So you mean you tried this norm constraint instead of wd? Then what does "the findings were the same mean? Or i think I'm confused?
English
1
0
1
279
Atli Kosson
Atli Kosson@AtliKosson·
Thanks! This does indeed make an argument for independent WD. Our work is mainly about understanding why it helps, we don't claim to be the first to say it is needed. The core argument here seems to be that "if λ → 0 with n, then weight decay has no effect in the limit" (for IWD). It's not clear to me whether this would be undesirable in itself. With µP the norm of the gradient update becomes zero as the width goes to infinity, so why couldn't the weight decay norm change also go to zero? If that is not the case, then WD dominates the update and the network can not maintain the initial weight norms at all. They would always shrink, which seems likely to interfere with µP's initialization strategy. In any case we take a very different approach in the manuscript where we relate both variants of weight decay to the core goal of µP in terms of controlling the rate of feature learning (but throughout training, not only at initialization).
English
0
0
4
180
Atli Kosson
Atli Kosson@AtliKosson·
The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
Atli Kosson tweet media
English
12
47
338
77K
Atli Kosson
Atli Kosson@AtliKosson·
@lorenzo_noci I think standard weight decay should be modified like you describe. This is the approach we actually employ in the paper, i.e. standard weight decay where we scale the WD inversely with the LR. This is equivalent to IWD when scaling with muP (but not when tuning the LR or WD).
English
0
0
3
229
Lorenzo Noci
Lorenzo Noci@lorenzo_noci·
@AtliKosson Nice work! Could you clarify why standard weight decay should not be modified to follow muP intuition? In our work, we provided a prescription for weight decay scaling rules based on this (arxiv.org/pdf/2505.01618).
English
2
0
5
454
Atli Kosson
Atli Kosson@AtliKosson·
@SeunghyunSEO7 Hmm interesting, yeah I think if the adafactor is weight proportional most of the insights about weight decay stop holding (in my view WD is about making non-proportional optimizers proportional). Controlling the norms can still matter but that also depends on normalization etc
English
1
0
1
95
Seunghyun Seo
Seunghyun Seo@SeunghyunSEO7·
@AtliKosson oh sorry, they used adafactor where it's equivalent to adamw with param scaling lr, so it is more complicated. in addition, google's framework like tnesorflow gonna perform independent WD unlike torch when using adamw.
English
2
0
1
427
Seunghyun Seo
Seunghyun Seo@SeunghyunSEO7·
there were some experimental and theoretical results about the importance of independent WD when using muP like arxiv.org/abs/2405.13698, and here's another interesting work! (and here's concurrent work studying WD and muP with another perspective, arxiv.org/pdf/2510.15262 )
Seunghyun Seo tweet mediaSeunghyun Seo tweet mediaSeunghyun Seo tweet mediaSeunghyun Seo tweet media
Atli Kosson@AtliKosson

Surprisingly independent WD works because it overrides µP's scaling! µP makes the updates proportionally smaller for wider models, but independent WD eventually makes them equally large across widths. This turns out to be exactly what's needed for stable feature learning! 🧵5/8

English
2
7
66
4.8K
Atli Kosson
Atli Kosson@AtliKosson·
Interesting, one thing we tried but didn’t include in the manuscript was constraining the weight norms to the initialization values (scaling each weight matrix to have the init norm after every step). That way the norms are well behaved throughout. The findings were essentially the same. The muP update scaling should only be applied as a warmup, not throughout training or the LR won’t transfer.
English
1
0
1
269
Quanquan Gu
Quanquan Gu@QuanquanGu·
I see it differently. The learning rate and weight decay jointly determine the norm of the weight matrices when the optimization converges. If weight decay is not scaled correctly with μP, then the model is effectively outside the μP regime, and the benefits of μP can be diminished or even negated.
English
3
2
21
2.3K
Atli Kosson
Atli Kosson@AtliKosson·
So the hypothesis is that the peak LR doesn’t control the rate of feature learning anymore and can transfer for that reason. This lack of control could contribute to the typical loss of performance without WD. All the findings about broken alignment assumptions etc seem to hold without WD.
English
1
0
1
175
Atli Kosson
Atli Kosson@AtliKosson·
@cloneofsimo It’s not necessarily opposite, we discuss this a bit in the manuscript. For short runs WD doesn’t really matter, but it also turns out without WD all LRs kind of work the same in the long run. The size of the relative feature updates loses its dependence on the peak LR.
Atli Kosson tweet media
English
1
0
10
752