Thomas Pethick

153 posts

Thomas Pethick

Thomas Pethick

@tmpethick

Katılım Temmuz 2011
91 Takip Edilen262 Takipçiler
Sabitlenmiş Tweet
Thomas Pethick
Thomas Pethick@tmpethick·
1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay. The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep.
Thomas Pethick tweet mediaThomas Pethick tweet media
English
2
21
128
9K
Thomas Pethick
Thomas Pethick@tmpethick·
@tonysilveti Thank you for sharing! you beat me to it 😂 I'm really excited about this direction - will share some more thoughts tomorrow
English
3
0
5
413
Tony S.F.
Tony S.F.@tonysilveti·
SODA shows that Muon, Lion, NAdam, and others are all special cases of Optimistic Dual Averaging! A simple wrapper around any base optimizer can replace weight decay tuning with a principled 1/(k+2) schedule; no new HPs. Practical results seem promising! arxiv.org/abs/2605.11172
Tony S.F. tweet mediaTony S.F. tweet mediaTony S.F. tweet mediaTony S.F. tweet media
English
6
28
179
28.8K
Thomas Pethick
Thomas Pethick@tmpethick·
7/ Thanks to @CevherLIONS for supporting it and getting Roman Macháček on board, and @WanyunXie for scaling up the experiments and debugging runs together – it’s always a joy
English
0
1
6
612
Thomas Pethick
Thomas Pethick@tmpethick·
1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay. The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep.
Thomas Pethick tweet mediaThomas Pethick tweet media
English
2
21
128
9K
Thomas Pethick
Thomas Pethick@tmpethick·
@EIFY @tonysilveti One reason I find z0 appearing interesting is for finetuning where weight decay is otherwise typically not used (I added a comment on this in the conclusion) - but that’s a story in itself
English
0
0
2
43
Thomas Pethick
Thomas Pethick@tmpethick·
@EIFY @tonysilveti We haven’t ablated this beyond 1 x chinchilla with a 124M model (the main setting where I tested things before we extrapolated across horizon and model size)
English
1
0
2
49
Thomas Pethick
Thomas Pethick@tmpethick·
@CV_novel_plume With that said I think the ODA perspective is interesting in itself and there's more to extract
English
0
0
1
68
Thomas Pethick
Thomas Pethick@tmpethick·
@CV_novel_plume Yes, this is exactly the SODA wrapper! I wanted to extract something concrete from the perspective and I was at the time trying to understand weight decay - surprisingly the first thing I tried (just using params from theory) worked without any retuning of lr etc of the base opt
English
1
0
1
107
Yuxin Fang
Yuxin Fang@CV_novel_plume·
Great work! I think the core idea is simple: replace tuned weight decay toward zero with an annealed pullback toward initialization. - Baseline WD shrinks weights toward 0 with a tuned hyperparameter (usually coupled w/ lr schedule). - SODA instead recenters weights toward x_0 at initialization with a fixed 1/(k+2) schedule.
Tony S.F.@tonysilveti

SODA shows that Muon, Lion, NAdam, and others are all special cases of Optimistic Dual Averaging! A simple wrapper around any base optimizer can replace weight decay tuning with a principled 1/(k+2) schedule; no new HPs. Practical results seem promising! arxiv.org/abs/2605.11172

English
2
2
25
5K
Thomas Pethick
Thomas Pethick@tmpethick·
@_arohan_ Yes, non-constant is interesting to explore. The GPA paper by @aaron_defazio has a very nice perspective on diloco also through schedule-free, which makes the delta/diff from SODA easier to understand (I've compared in the related work section)
Thomas Pethick tweet media
English
0
0
1
52
Thomas Pethick
Thomas Pethick@tmpethick·
@wen_kaiyue @tonysilveti Your comment about optimism is interesting. I mainly focused on extracting a schedule for weight decay in this work, but there is an interesting question on how to schedule optimism and weight decay in tandem to better exploit smoothness
English
0
0
1
33
Thomas Pethick
Thomas Pethick@tmpethick·
@wen_kaiyue @tonysilveti Its actually not quite batch size independent - if you squint the convergence theorem suggests a constant then 1/√k (the changepoint will depend on noise), but we didn't investigate this much empirically yet
English
1
0
2
60
Thomas Pethick
Thomas Pethick@tmpethick·
@tonysilveti Yeah, except even for MLP blocks RMSNorm(W2σ(W1)) its ok for both matrices as long as the activation function is positively homogeneous (e.g., true for ReLU and ReLU^2)
English
0
0
1
84
Thomas Pethick
Thomas Pethick@tmpethick·
It is common to exclude the unembedding layer from using weight decay, but why? Essentially, applying weight decay to the last layer will constrain the logits of the model and consequently prevent high-confident output. 👇 x.com/tmpethick/stat…
Thomas Pethick@tmpethick

@scottjmaddox @SmartPig_Joe @b_rich_now @wen_kaiyue For unembedding, logits becomes constraint when using weight decay, so the model cannot be high confident. E.g., with RMSNorm before last layer (=>input bounded) and weight decay (=>weights bounded through FW perspective), the inner product is bounded (for RowNorm exactly [-1,1])

English
1
0
3
530