Nilin

49 posts

Nilin

@nilinabra

RL enjoyer | MIT PhD | prev Simons Institute | quant

Berkeley, CA Katılım Mart 2025

255 Takip Edilen108 Takipçiler

Nilin@nilinabra·1d

@PrimeIntellect let 2 AGIs run 10,000 experiments and it kept 2/3 of my innovations. A bit relieved actually

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

English

1.1K

Nilin@nilinabra·2d

@TianyuPang327 Very cool, and similar for sure. Added citation and happy to talk more

English

Tianyu Pang@TianyuPang327·2d

Hi Nilin, Thanks for sharing these nice results on Contra-Muon and Power-Muon. They are really interesting. I also wanted to share our recent ACL paper, HTMuon: arxiv.org/abs/2603.10067. Motivated by Heavy-Tailed Self-Regularization theory, we also consider a generalized update of the form U\Sigma^p V^T which is similar to Power-Muon. We further prove that, under the LMO framework, HTMuon corresponds to steepest descent under a Schatten-q norm constraint, which generalizes Muon as steepest descent under the Schatten-∞ norm constraint. Recently, we have also been exploring faster numerical methods for computing the HTMuon update matrix for arbitrary p. I would be very happy to discuss this with you and exchange ideas!

English

102

Nilin@nilinabra·2 May

Contra-Muon: Converge faster by going *opposite* of SGD. Replace Muon=NS(g) with NS(g) - c/2*g/||g||_op where g is the momentum gradient estimate and ||_op is the operator norm. github.com/nilin/contra-m…

English

12.2K

Nilin@nilinabra·2d

@YouJiacheng I should fix the wording but here I intended "small relative to the top value" whereas I think you're talking about the mid/small values in a quantile sense

English

You Jiacheng@YouJiacheng·2d

@nilinabra oh cool. for clarity I was commenting about this

English

You Jiacheng@YouJiacheng·2d

I think it's more like suppressing top SVs (or boost mid SVs; I think bottom SVs are kinda noisy) this principle seems to be validated by at least 3 different optimizers: 1. Spectra (2602.11185; set TopK to RMS of rest SVs) 2. Contra-Muon 3. Freon (2605.1118)

Nilin@nilinabra

English

3.6K

Nilin@nilinabra·3d

100% vibe coded: github.com/nilin/aurilink

English

107

Nilin@nilinabra·3d

We played with this exact concept (one airpod giving cues). Few observations🧵: Lots of use cases. Our demo: [Used Car test drive] [seller] You're right, it's not blowing cold. Don't worry, we'll have our service team recharge it. [hint] "could be evap leak" [hint] "recharge won't fix leak"

Thinking Machines@thinkymachines

The model works in more real situations too, like when Mianna is trying to say words she's only read before.

English

409

Nilin@nilinabra·3d

Another challenge was that conversations often don't have a lot of pauses that are long enough to interject. And the device does not know how long a pause will be. So the ear piece would often talk over someone which was distracting for the user

English

104

Nilin@nilinabra·3d

The biggest challenge was calibrating when to speak. It would come up with many useful hints, but it was challenging to avoid it being a constant stream of comments which would overwhelm the user. Our solution was a reranker that ranked the recent proposed hints. And only when the latest proposed hint is best does the device speak.

English

Nilin@nilinabra·3d

More generally, the device would be useful whenever there's an information asymmetry and you need to make a decision. Another potential use case was to help understand social cues, for people with ASD.

English

108

Nilin@nilinabra·4d

The famous related hyperball method

Kaiyue Wen@wen_kaiyue

(1/n) Introducing Hyperball — an optimizer wrapper that keeps weight & update norm constant and lets you control the effective (angular) step size directly. Result: sustained speedups across scales + strong hyperparameter transfer.

English

164

Nilin@nilinabra·4d

One thing I like about hyperball optimization is that it doesn't mess with the existing weights like weight decay. But the manifold constraint is a little strict for my liking. I like to just shrink the outward component of the gradient update. A similar existing method is AdamP which sets this component to 0.

English

381

Nilin@nilinabra·4d

The related AdamP method: arxiv.org/abs/2006.08217

English

118

Nilin@nilinabra·6d

@kellerjordan0 I'm experimenting with different functions of the singular values. One interpretation of Contra-Muon is that the top directions are easy and will be learned anyway, so we underweight them in favor of learning more novel directions.

English

897

Nilin@nilinabra·6d

@kellerjordan0 I also tried power functions of the spectrum. Slightly negative powers worked terribly, so we want to boost intermediate singular values and not the bottom

English

142

Nilin retweetledi

Keller Jordan@kellerjordan0·6d

Modded-NanoGPT optimization result #11: @nilinabra has achieved a new record of 3225 steps (-25) via a novel technique dubbed Contra-Muon, in which top SVD components are somewhat suppressed. This result builds on #9.

English

146

26.4K

Nilin@nilinabra·5 May

We can also take negative powers of the singular values to boost smaller singular modes even further

English

629

Nilin@nilinabra·2 May

Here 0<c<=1. When c=1, leading singular modes contribute an equal amount to the loss delta.

English

745

Nilin retweetledi

Keller Jordan@kellerjordan0·2 May

New modded-NanoGPT optimization benchmark result: @nilinabra and Ali Naeimi have found a hyperparameter improvement for the Muon baseline, increasing weight decay from 0.0125 to 0.025. The baseline now runs in 3375 steps (-125).

English

111

13.5K

Keşfet

@PrimeIntellect @TianyuPang327 @YouJiacheng @kellerjordan0 @elonmusk @BarackObama @taylorswift13 @cristiano