Thomas Pethick

21

128

9K

Thomas Pethick@tmpethick·5d

@tonysilveti I got a bit delayed but here we are: x.com/tmpethick/stat…

Thomas Pethick@tmpethick

1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay. The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep.

English

1

47

Thomas Pethick@tmpethick·13 May

@tonysilveti Thank you for sharing! you beat me to it 😂 I'm really excited about this direction - will share some more thoughts tomorrow

English

3

0

5

413

Tony S.F.@tonysilveti·13 May

SODA shows that Muon, Lion, NAdam, and others are all special cases of Optimistic Dual Averaging! A simple wrapper around any base optimizer can replace weight decay tuning with a principled 1/(k+2) schedule; no new HPs. Practical results seem promising! arxiv.org/abs/2605.11172

English

6

28

179

28.8K

Thomas Pethick@tmpethick·5d

7/ Thanks to @CevherLIONS for supporting it and getting Roman Macháček on board, and @WanyunXie for scaling up the experiments and debugging runs together – it’s always a joy

English

1

6

612

Thomas Pethick@tmpethick·5d

6/ There are a lot of interesting questions one can ask from this perspective — please check out the paper! Paper: arxiv.org/pdf/2605.11172 Code: github.com/tmpethick/soda…

English

1

8

707

Thomas Pethick@tmpethick·5d

1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay. The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep.

English

21

128

9K

Thomas Pethick@tmpethick·6d

@EIFY @tonysilveti One reason I find z0 appearing interesting is for finetuning where weight decay is otherwise typically not used (I added a comment on this in the conclusion) - but that’s a story in itself

English

2

43

Thomas Pethick@tmpethick·6d

@EIFY @tonysilveti We haven’t ablated this beyond 1 x chinchilla with a 124M model (the main setting where I tested things before we extrapolated across horizon and model size)

English

0

2

49

Thomas Pethick@tmpethick·6d

@CV_novel_plume With that said I think the ODA perspective is interesting in itself and there's more to extract

English

1

68

Thomas Pethick@tmpethick·6d

@CV_novel_plume Yes, this is exactly the SODA wrapper! I wanted to extract something concrete from the perspective and I was at the time trying to understand weight decay - surprisingly the first thing I tried (just using params from theory) worked without any retuning of lr etc of the base opt

English

SODA shows that Muon, Lion, NAdam, and others are all special cases of Optimistic Dual Averaging! A simple wrapper around any base optimizer can replace weight decay tuning with a principled 1/(k+2) schedule; no new HPs. Practical results seem promising! arxiv.org/abs/2605.11172

0

1

107

Yuxin Fang@CV_novel_plume·13 May

Great work! I think the core idea is simple: replace tuned weight decay toward zero with an annealed pullback toward initialization. - Baseline WD shrinks weights toward 0 with a tuned hyperparameter (usually coupled w/ lr schedule). - SODA instead recenters weights toward x_0 at initialization with a fixed 1/(k+2) schedule.

Tony S.F.@tonysilveti

English

25

5K

Thomas Pethick@tmpethick·6d

@_arohan_ Yes, non-constant is interesting to explore. The GPA paper by @aaron_defazio has a very nice perspective on diloco also through schedule-free, which makes the delta/diff from SODA easier to understand (I've compared in the related work section)

English

SODA shows that Muon, Lion, NAdam, and others are all special cases of Optimistic Dual Averaging! A simple wrapper around any base optimizer can replace weight decay tuning with a principled 1/(k+2) schedule; no new HPs. Practical results seem promising! arxiv.org/abs/2605.11172

1

52

rohan anil@_arohan_·13 May

This is very nice, diloco will likely find it useful for outer steps.

Tony S.F.@tonysilveti

English

3

1

37

4.7K

Thomas Pethick@tmpethick·13 May

@wen_kaiyue @tonysilveti Your comment about optimism is interesting. I mainly focused on extracting a schedule for weight decay in this work, but there is an interesting question on how to schedule optimism and weight decay in tandem to better exploit smoothness

English

1

33

Thomas Pethick@tmpethick·13 May

@wen_kaiyue @tonysilveti Its actually not quite batch size independent - if you squint the convergence theorem suggests a constant then 1/√k (the changepoint will depend on noise), but we didn't investigate this much empirically yet

English

0

2

60

Thomas Pethick@tmpethick·13 May

@tonysilveti Yeah, except even for MLP blocks RMSNorm(W2σ(W1)) its ok for both matrices as long as the activation function is positively homogeneous (e.g., true for ReLU and ReLU^2)

English

Why is Frobenius weight normalization ok when combined with non-Euclidean steepest descent methods? A short note: pethick.dk/posts/2026-05-…

1

84

Tony S.F.@tonysilveti·13 May

Makes sense! Also gives a suggestions of which parameters it is okay to regularize as in MuonH and which you really ought to allow to grow in an unconstrained fashion. If they are not followed by some kind of layer/RMSnorm, then maybe don't do the Frobenius normalization?

Thomas Pethick@tmpethick

English

0

11

1.5K

Thomas Pethick@tmpethick·13 May

Why is Frobenius weight normalization ok when combined with non-Euclidean steepest descent methods? A short note: pethick.dk/posts/2026-05-…

English

2

21

2.4K

Thomas Pethick@tmpethick·13 May

I wrote up a quick note here expanding a bit on it: pethick.dk/posts/2026-05-…

English

3

5

679

Thomas Pethick@tmpethick·13 May

It is common to exclude the unembedding layer from using weight decay, but why? Essentially, applying weight decay to the last layer will constrain the logits of the model and consequently prevent high-confident output. 👇 x.com/tmpethick/stat…

Thomas Pethick@tmpethick

@scottjmaddox @SmartPig_Joe @b_rich_now @wen_kaiyue For unembedding, logits becomes constraint when using weight decay, so the model cannot be high confident. E.g., with RMSNorm before last layer (=>input bounded) and weight decay (=>weights bounded through FW perspective), the inner product is bounded (for RowNorm exactly [-1,1])

English