Thomas Pethick
153 posts


1/ We introduce SODA: a simple optimizer wrapper that improves a base optimizer, adds no hyperparameters, and removes the need to tune weight decay. The wrapper provides consistent improvement. Most notably, SODA(Muon) beats Muon even when Muon gets a tuned weight decay sweep.













SODA shows that Muon, Lion, NAdam, and others are all special cases of Optimistic Dual Averaging! A simple wrapper around any base optimizer can replace weight decay tuning with a principled 1/(k+2) schedule; no new HPs. Practical results seem promising! arxiv.org/abs/2605.11172



SODA shows that Muon, Lion, NAdam, and others are all special cases of Optimistic Dual Averaging! A simple wrapper around any base optimizer can replace weight decay tuning with a principled 1/(k+2) schedule; no new HPs. Practical results seem promising! arxiv.org/abs/2605.11172




Why is Frobenius weight normalization ok when combined with non-Euclidean steepest descent methods? A short note: pethick.dk/posts/2026-05-…



@scottjmaddox @SmartPig_Joe @b_rich_now @wen_kaiyue For unembedding, logits becomes constraint when using weight decay, so the model cannot be high confident. E.g., with RMSNorm before last layer (=>input bounded) and weight decay (=>weights bounded through FW perspective), the inner product is bounded (for RowNorm exactly [-1,1])

