Thomas Pethick

124 posts

Thomas Pethick

Thomas Pethick

@tmpethick

เข้าร่วม Temmuz 2011
85 กำลังติดตาม207 ผู้ติดตาม
Thomas Pethick
Thomas Pethick@tmpethick·
codex feature request: continue chatgpt conversation in codex / import chatgpt conversation into codex session would be extremely convenient @thsottiaux (even simply an "export to codex" button in chatgpt)
English
0
0
1
128
Thomas Pethick
Thomas Pethick@tmpethick·
I close with a bunch of open questions/directions I find interesting both of very technical nature and high level, which might be interesting to some, and if not, you can take a look at the graphics from Hundertwasser/Escher that I managed to get permission to! n/n
Thomas Pethick tweet media
English
0
1
6
346
Thomas Pethick
Thomas Pethick@tmpethick·
To explain the hyperplane approach I build up gradually: first with a very minimal proof of the proximal point for nonmonotone problems to show why this does not suffice, then by introducing a relax version and showing how this still suffers for inexact proximal operators. 4/n
Thomas Pethick tweet media
English
1
1
3
485
Thomas Pethick
Thomas Pethick@tmpethick·
My thesis is now accessible online! I've tried to make it the introduction to non-Euclidean methods and monotone operators that I wish I had when starting out. 1/n infoscience.epfl.ch/entities/publi…
English
3
11
61
10.9K
Thomas Pethick
Thomas Pethick@tmpethick·
@DimitrisPapail @sytelus instead of benchmark granularity maybe do matrix completion on (dataexample, method) which is a binary matrix and then do greedy forward selection (this does not allow coming up with new examples but would simply mix from existing benchmark)
English
0
0
0
25
Sarthak Mangla
Sarthak Mangla@msarthak29·
How far can balancing an update matrix’s norms take you? @Abel_grg and I explored this by replacing Muon’s Newton-Schulz step with alternating row/col normalization. In our experiments, that alone is competitive with AdamW, despite no orthogonalization or second EMA tracking 🧵
GIF
English
7
5
70
9.1K
Thomas Pethick
Thomas Pethick@tmpethick·
@typedfemale This also aligns well with what we observe in the work on Scion where both RowNorm and Sign empirically works well for the last layer (both provides control of linf of the output) arxiv.org/pdf/2502.07529
English
0
0
2
72
Thomas Pethick
Thomas Pethick@tmpethick·
@typedfemale I think there is some evidence it has to do with the loss and classes being heavy-tailed. linf is a favorable geometry under logsumexp and after all CE(softmax)=-linear+logsumexp. The heavy-tailed property was used specifically for CE(softmax) in arxiv.org/pdf/2512.00763
Thomas Pethick tweet media
English
1
0
12
471
typedfemale
typedfemale@typedfemale·
it doesn't make sense to me why people don't apply muon to the unembed - it's not obvious to me why you shouldn't treat it as a normal linear layer... L1 norm does not seem like the right pick to me
English
5
0
56
7.8K
Thomas Pethick
Thomas Pethick@tmpethick·
@AlexShtf @konstmish @kellerjordan0 also originally speedran Muon on cifar10 (CNN). The main thing to keep in mind is that batchsize needs to be sufficiently large across all of the setting (for batchsize=1 Spectral=SGD).
English
0
0
1
33
Thomas Pethick
Thomas Pethick@tmpethick·
@AlexShtf @konstmish I would say that spectral based method works for a wide range of architectures (not just transformers). The stochastic spectral descent work was originally for Restricted Boltzmann Machines and Graphical Models and the preconditioned version was tested on CNNs and MLPs
English
1
0
1
61
Tony S.F.
Tony S.F.@tonysilveti·
@jasondeanlee so if K is 3 you would get, for singular values of [6,5,4,3,2,1], [1,5/6,4/6,0,0,0]
English
2
0
0
108
Tony S.F.
Tony S.F.@tonysilveti·
A paper from 2023 about training neural networks using an update similar to Muon (keep the top-K singular value). It was eventually rejected (bad beat imo). It's a shame they didn't do more tests on K - maybe they would have seen it works best with just the spectral ball.
Tony S.F. tweet mediaTony S.F. tweet mediaTony S.F. tweet media
English
6
3
57
8.8K
Jonas Hübotter
Jonas Hübotter@jonashubotter·
Training LLMs with verifiable rewards uses 1bit signal per generated response. This hides why the model failed. Today, we introduce a simple algorithm that enables the model to learn from any rich feedback! And then turns it into dense supervision. (1/n)
Jonas Hübotter tweet media
English
20
110
797
129K
サメQCU
サメQCU@sameQCU·
@tmpethick owo is there a recording of the oral for this anywhere? i'm very interested the spectrals after poking around at gluon
English
1
0
0
89
Thomas Pethick
Thomas Pethick@tmpethick·
For anyone interested in understanding orthogonalization/spectral based methods here’s the slides from our #neurips25 oral that I tried to make more broadly about the topic.
Thomas Pethick tweet media
English
3
25
189
10.4K
Thomas Pethick
Thomas Pethick@tmpethick·
@damekdavis It also seems to possibly explain the benefit of Block-Periodic Orthogonalization by @akhaledv2 et al. Maybe try simply running n steps of GD/Adam then Muon alternatively? Might also be intersting for @KwangjunA who made me aware of the method today
English
1
0
2
814
Damek
Damek@damekdavis·
New paper studies when spectral gradient methods (e.g., Muon) help in deep learning: 1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank. 2. We then explain why spectral methods can perform well despite this. Long thread
Damek tweet media
English
11
67
335
96.6K