Thomas Pethick

124 posts

Thomas Pethick

@tmpethick

เข้าร่วม Temmuz 2011

85 กำลังติดตาม207 ผู้ติดตาม

codex feature request: continue chatgpt conversation in codex / import chatgpt conversation into codex session would be extremely convenient @thsottiaux (even simply an "export to codex" button in chatgpt)

English

128

Thomas Pethick@tmpethick·5 Mar

I close with a bunch of open questions/directions I find interesting both of very technical nature and high level, which might be interesting to some, and if not, you can take a look at the graphics from Hundertwasser/Escher that I managed to get permission to! n/n

English

346

Thomas Pethick@tmpethick·5 Mar

To explain the hyperplane approach I build up gradually: first with a very minimal proof of the proximal point for nonmonotone problems to show why this does not suffice, then by introducing a relax version and showing how this still suffers for inexact proximal operators. 4/n

English

485

Thomas Pethick@tmpethick·5 Mar

My thesis is now accessible online! I've tried to make it the introduction to non-Euclidean methods and monotone operators that I wish I had when starting out. 1/n infoscience.epfl.ch/entities/publi…

English

10.9K

Thomas Pethick@tmpethick·25 Şub

@DimitrisPapail @sytelus instead of benchmark granularity maybe do matrix completion on (dataexample, method) which is a binary matrix and then do greedy forward selection (this does not allow coming up with new examples but would simply mix from existing benchmark)

English

Dimitris Papailiopoulos@DimitrisPapail·25 Şub

@sytelus let me think about it!!

English

112

Dimitris Papailiopoulos@DimitrisPapail·25 Şub

Running Terminal-Bench 2.0 on expensive frontier models costs $1K–$50K or more. BenchPress Predicts Gemini 3.1 Pro and Claude Opus 4.6's scores within ±2 points after 15 randomly selected benchmarks. .... using zero agentic benchmark data!! Cost: $0.

Dimitris Papailiopoulos@DimitrisPapail

x.com/i/article/2026…

English

291

51.7K

Thomas Pethick@tmpethick·18 Şub

@msarthak29 @Abel_grg Nice work! You might want to reference arxiv.org/pdf/2502.06742 which also uses Sinkhorn but without momentum

English

173

Sarthak Mangla@msarthak29·18 Şub

How far can balancing an update matrix’s norms take you? @Abel_grg and I explored this by replacing Muon’s Newton-Schulz step with alternating row/col normalization. In our experiments, that alone is competitive with AdamW, despite no orthogonalization or second EMA tracking 🧵

GIF

English

9.1K

Thomas Pethick@tmpethick·12 Şub

@typedfemale This also aligns well with what we observe in the work on Scion where both RowNorm and Sign empirically works well for the last layer (both provides control of linf of the output) arxiv.org/pdf/2502.07529

English

Thomas Pethick@tmpethick·12 Şub

@typedfemale I think there is some evidence it has to do with the loss and classes being heavy-tailed. linf is a favorable geometry under logsumexp and after all CE(softmax)=-linear+logsumexp. The heavy-tailed property was used specifically for CE(softmax) in arxiv.org/pdf/2512.00763

English

471

typedfemale@typedfemale·12 Şub

it doesn't make sense to me why people don't apply muon to the unembed - it's not obvious to me why you shouldn't treat it as a normal linear layer... L1 norm does not seem like the right pick to me

English

7.8K

Thomas Pethick@tmpethick·11 Şub

@AlexShtf @konstmish @kellerjordan0 also originally speedran Muon on cifar10 (CNN). The main thing to keep in mind is that batchsize needs to be sufficiently large across all of the setting (for batchsize=1 Spectral=SGD).

English

Thomas Pethick@tmpethick·11 Şub

@AlexShtf @konstmish I would say that spectral based method works for a wide range of architectures (not just transformers). The stochastic spectral descent work was originally for Restricted Boltzmann Machines and Graphical Models and the preconditioned version was tested on CNNs and MLPs

English

Konstantin Mishchenko@konstmish·11 Şub

AdamW's time is over.

English

125

47K

Thomas Pethick@tmpethick·30 Oca

@tonysilveti @jasondeanlee My impression is that this makes it closer to powersgd arxiv.org/pdf/1905.13727

English

Tony S.F.@tonysilveti·30 Oca

@jasondeanlee so if K is 3 you would get, for singular values of [6,5,4,3,2,1], [1,5/6,4/6,0,0,0]

English

108

Tony S.F.@tonysilveti·29 Oca

A paper from 2023 about training neural networks using an update similar to Muon (keep the top-K singular value). It was eventually rejected (bad beat imo). It's a shame they didn't do more tests on K - maybe they would have seen it works best with just the spectral ball.

English

8.8K

Thomas Pethick@tmpethick·30 Oca

@jonashuebotter This looks super interesting! It reminds me vaguely of BYOL which also uses an EMA as the teacher arxiv.org/abs/2006.07733

English

691

Jonas Hübotter@jonashubotter·29 Oca

Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed. Today, we introduce a simple algorithm that enables the model to learn from any rich feedback! And then turns it into dense supervision. (1/n)

English

110

797

129K

Thomas Pethick@tmpethick·13 Ara

@sameQCU Recordings should be up here (~40min in) neurips.cc/virtual/2025/l…

English

サメQCU@sameQCU·7 Ara

@tmpethick owo is there a recording of the oral for this anywhere? i'm very interested the spectrals after poking around at gluon

English

Thomas Pethick@tmpethick·5 Ara

For anyone interested in understanding orthogonalization/spectral based methods here’s the slides from our #neurips25 oral that I tried to make more broadly about the topic.

English

189

10.4K

Thomas Pethick@tmpethick·5 Ara

It open up some new ways of possibly building faster algorithms than the otherwise SOTA spectral methods such as Muon. Here's the slides: drive.google.com/file/d/1X2EQ6s…

English

462

Thomas Pethick@tmpethick·5 Ara

Damek and Dima coincidentally tracked this quantity for the spectral norm just 3 days ago and saw a striking agreement with performance. You should check out their work! x.com/damekdavis/sta…

Damek@damekdavis

New paper studies when spectral gradient methods (e.g., Muon) help in deep learning: 1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank. 2. We then explain why spectral methods can perform well despite this. Long thread

English

679

Thomas Pethick@tmpethick·3 Ara

@damekdavis @akhaledv2 @KwangjunA Regarding stable rank I’m aware of the spectral condition paper of @jxbz demonstrating it is small and also @AnimaAnandkumar tracking it in some of her works (maybe worth checking for you) - I haven’t seen effective rank though

English

146

Thomas Pethick@tmpethick·3 Ara

@damekdavis It also seems to possibly explain the benefit of Block-Periodic Orthogonalization by @akhaledv2 et al. Maybe try simply running n steps of GD/Adam then Muon alternatively? Might also be intersting for @KwangjunA who made me aware of the method today

English

814

Damek@damekdavis·3 Ara

English

335

96.6K

ค้นพบ

@thsottiaux @DimitrisPapail @sytelus @msarthak29 @Abel_grg @typedfemale @AlexShtf @konstmish