Nilin

49 posts

Nilin banner
Nilin

Nilin

@nilinabra

RL enjoyer | MIT PhD | prev Simons Institute | quant

Berkeley, CA Katılım Mart 2025
255 Takip Edilen108 Takipçiler
Nilin
Nilin@nilinabra·
@TianyuPang327 Very cool, and similar for sure. Added citation and happy to talk more
English
1
0
1
30
Tianyu Pang
Tianyu Pang@TianyuPang327·
Hi Nilin, Thanks for sharing these nice results on Contra-Muon and Power-Muon. They are really interesting. I also wanted to share our recent ACL paper, HTMuon: arxiv.org/abs/2603.10067. Motivated by Heavy-Tailed Self-Regularization theory, we also consider a generalized update of the form U\Sigma^p V^T which is similar to Power-Muon. We further prove that, under the LMO framework, HTMuon corresponds to steepest descent under a Schatten-q norm constraint, which generalizes Muon as steepest descent under the Schatten-∞ norm constraint. Recently, we have also been exploring faster numerical methods for computing the HTMuon update matrix for arbitrary p. I would be very happy to discuss this with you and exchange ideas!
English
1
0
0
102
Nilin
Nilin@nilinabra·
Contra-Muon: Converge faster by going *opposite* of SGD. Replace Muon=NS(g) with NS(g) - c/2*g/||g||_op where g is the momentum gradient estimate and ||_op is the operator norm. github.com/nilin/contra-m…
English
3
3
42
12.2K
Nilin
Nilin@nilinabra·
@YouJiacheng I should fix the wording but here I intended "small relative to the top value" whereas I think you're talking about the mid/small values in a quantile sense
English
0
0
1
22
You Jiacheng
You Jiacheng@YouJiacheng·
I think it's more like suppressing top SVs (or boost mid SVs; I think bottom SVs are kinda noisy) this principle seems to be validated by at least 3 different optimizers: 1. Spectra (2602.11185; set TopK to RMS of rest SVs) 2. Contra-Muon 3. Freon (2605.1118)
You Jiacheng tweet media
Nilin@nilinabra

Contra-Muon: Converge faster by going *opposite* of SGD. Replace Muon=NS(g) with NS(g) - c/2*g/||g||_op where g is the momentum gradient estimate and ||_op is the operator norm. github.com/nilin/contra-m…

English
2
4
43
3.6K
Nilin
Nilin@nilinabra·
Another challenge was that conversations often don't have a lot of pauses that are long enough to interject. And the device does not know how long a pause will be. So the ear piece would often talk over someone which was distracting for the user
English
0
0
0
104
Nilin
Nilin@nilinabra·
The biggest challenge was calibrating when to speak. It would come up with many useful hints, but it was challenging to avoid it being a constant stream of comments which would overwhelm the user. Our solution was a reranker that ranked the recent proposed hints. And only when the latest proposed hint is best does the device speak.
English
0
0
0
91
Nilin
Nilin@nilinabra·
More generally, the device would be useful whenever there's an information asymmetry and you need to make a decision. Another potential use case was to help understand social cues, for people with ASD.
English
0
0
0
108
Nilin
Nilin@nilinabra·
One thing I like about hyperball optimization is that it doesn't mess with the existing weights like weight decay. But the manifold constraint is a little strict for my liking. I like to just shrink the outward component of the gradient update. A similar existing method is AdamP which sets this component to 0.
Nilin tweet media
English
2
1
9
381
Nilin
Nilin@nilinabra·
@kellerjordan0 I'm experimenting with different functions of the singular values. One interpretation of Contra-Muon is that the top directions are easy and will be learned anyway, so we underweight them in favor of learning more novel directions.
Nilin tweet media
English
1
3
18
897
Nilin
Nilin@nilinabra·
@kellerjordan0 I also tried power functions of the spectrum. Slightly negative powers worked terribly, so we want to boost intermediate singular values and not the bottom
Nilin tweet media
English
0
1
7
142
Nilin retweetledi
Keller Jordan
Keller Jordan@kellerjordan0·
Modded-NanoGPT optimization result #11: @nilinabra has achieved a new record of 3225 steps (-25) via a novel technique dubbed Contra-Muon, in which top SVD components are somewhat suppressed. This result builds on #9.
Keller Jordan tweet mediaKeller Jordan tweet media
English
5
9
146
26.4K
Nilin
Nilin@nilinabra·
We can also take negative powers of the singular values to boost smaller singular modes even further
Nilin tweet media
English
0
1
1
629
Nilin
Nilin@nilinabra·
Here 0<c<=1. When c=1, leading singular modes contribute an equal amount to the loss delta.
English
0
0
1
745
Nilin retweetledi
Keller Jordan
Keller Jordan@kellerjordan0·
New modded-NanoGPT optimization benchmark result: @nilinabra and Ali Naeimi have found a hyperparameter improvement for the Muon baseline, increasing weight decay from 0.0125 to 0.025. The baseline now runs in 3375 steps (-125).
Keller Jordan tweet media
English
2
8
111
13.5K