Nikita Breskanu (@breskanu) - Twitter Profili | Zamantika Mersobahis Locabet

Nikita Breskanu retweetledi

Prime Intellect@PrimeIntellect·15 May

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

English

57

154

1.7K

606.6K

Nikita Breskanu@breskanu·3 May

It’s quite interesting that hyperball optimizers at first significantly underperform the regular ones, and only late in the training overtake. Maybe it’s better to switch to hyperball only closer to the end?

Keller Jordan@kellerjordan0

New modded-NanoGPT optimization benchmark result: @wen_kaiyue has improved upon both the Muon and AdamW baselines, by replacing their weight decay with hyperball optimization. The new record is 3325 steps.

English

0

155

Nikita Breskanu@breskanu·3 May

There is also a “sequel” written in 2016: “Information geometry and its applications” which covers more modern topics, like the natural gradient. Although some of the results are interesting, the main problem with applying the theory to practice is the singular Fisher matrix.

English

0

30

Nikita Breskanu@breskanu·3 May

Recently I have read an excellent book by S. Amari: “Methods of information geometry”. Basically, the book considers a statistical manifold and studies so-called alpha-connections on it, which lead to a lot of known quantities in statistics. Also great diffgem introduction!

English

1

0

1

69

Nikita Breskanu@breskanu·30 Nis

@xidulu @CV_novel_plume Technically that’s one backward, you just don’t aggregate the gradients, using the full information. So, I think it should be allowed

English

0

106

Xidulu@xidulu·30 Nis

@CV_novel_plume An obvious hack is to utilize per-sample gradient to get richer information. Not sure if that counts as multiple backward

English

2

0

5

391

Yuxin Fang@CV_novel_plume·30 Nis

One caveat is that the rules do not allow changing the batch size, and they also seem to disallow extra forward/backward passes per step. That makes more precise Hessian or curvature-estimation methods hard to use, since they typically require extra probes, HVPs, or finite-difference-style evaluations. So unlimited wall-clock time helps for extra optimizer-side computation, but not for collecting richer curvature information.

Keller Jordan@kellerjordan0

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English

1

0

27

4.1K

Nikita Breskanu@breskanu·29 Nis

I expect SOAP to be at the lead soon

Keller Jordan@kellerjordan0

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English

0

124

Nikita Breskanu@breskanu·22 Nis

@xidulu On convolutional networks I think it’s not better.

English

0

474

Xidulu@xidulu·22 Nis

Is Muon actually (provably) better than AdamW or is it that "Muon gives you better loss under a fixed + finite tuning budget"

English

20

1

129

38.3K

Nikita Breskanu@breskanu·21 Nis

Also visualized all of the approximations on small CNN on MNIST. True Fisher | KFAC | Shampoo Emprirical Fisher | EKFAC | SOAP

English

0

45

Nikita Breskanu@breskanu·21 Nis

It is interesting that Adam in KFAC eigenbasis performed better than Adam in Shampoo eigenbasis (default SOAP). This suggests that perhaps KFAC-style approximation is better than Shampoo one.

English

1

0

60

Nikita Breskanu@breskanu·21 Nis

fullfix.github.io/notes/2026/04/… A blog post on Fisher-based optimizers in DL, where I covered KFAC, EKFAC, Shampoo and SOAP and their connection with Fisher Information Matrix. Also compared all the mentioned optimizers with AdamW baseline on shakespeare-char.

English

1

0

116

Nikita Breskanu@breskanu·18 Nis

fullfix.github.io/notes/2026/04/… Wrote a blog post on the main statistical properties of the Fisher Information Matrix. In the last section, I also briefly discuss overparameterization and why it leads to Fisher singularity.

English

0

1

45

Nikita Breskanu@breskanu·6 Nis

ChatGPT really needs to incorporate this. When my conversation is big, the website starts lagging. Probably, due to too much text.

Cheng Lou@_chenglou

My dear front-end developers (and anyone who’s interested in the future of interfaces): I have crawled through depths of hell to bring you, for the foreseeable years, one of the more important foundational pieces of UI engineering (if not in implementation then certainly at least in concept): Fast, accurate and comprehensive userland text measurement algorithm in pure TypeScript, usable for laying out entire web pages without CSS, bypassing DOM measurements and reflow

English

0

152

Nikita Breskanu@breskanu·4 Nis

Feels similar to pseudo-labeling from classical ML.

Bo Wang@BoWang87

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass @1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English

0

1

117

Nikita Breskanu@breskanu·3 Nis

Better to normalize by ||A||_F * ||B||_F

English

0

42

Nikita Breskanu@breskanu·3 Nis

Found out that Frobenius norm of the commutator: ||AB - BA||_F is a good measure of eigenvectors closeness for symmetric matrices A, B. At least it’s good when there are no repeated eigenvalues, which is typically true in practice.

English

1

0

1

66

Nikita Breskanu@breskanu·2 Nis

@YouJiacheng @Ji_Ha_Kim thx, I understood. It's maximum possible largest singular value given those 2 constraints.

English

0

2

25

You Jiacheng@YouJiacheng·2 Nis

@breskanu @Ji_Ha_Kim sure it's better. ||X||_F is a bound given 1 constraint: sum(σ^2)=||X||_F^2. this bound is a bound given 2 constraints: sum(σ^2)=||X||_F^2 AND sum(σ^4)=||X.T@X||_F^2 more constraints => better.

English

1

0

2

81

Ji-Ha@Ji_Ha_Kim·2 Nis

With a new polished and optimized implementation, 2 rational iterations achieve 70% speedup on my GPU over 5 Polar Express polynomial iterations in TF32 while attaining better quality on almost all cases!

Ji-Ha@Ji_Ha_Kim

Very cool! I worked on this recently, and I actually used an identical approach early on. But I believe there is a significantly better approach - a **single** minimax rational iteration can beat 5 polynomial steps!

English

3

4

98

8.3K

Nikita Breskanu@breskanu·2 Nis

@Ji_Ha_Kim @YouJiacheng GPT says it's also always upper bounded by frobenius norm of X (I guess here G = X^TX). Then, it's cool that this measure is basically between: ||X||_2 <= sqrt(M(X)) <= ||X||_F so it's better then frobenius.

English

1

0

34

Ji-Ha@Ji_Ha_Kim·2 Nis

@YouJiacheng

QME

2

0

50

Nikita Breskanu@breskanu·2 Nis

@runame_ Hm, that explains it. Then it’s alright. Probably 1/2 corresponds to F^{-1} approx and 1/4 to F^{-1/2}, so more natural gradient-like update is better.

English

0

26

Runa Eschenhagen@runame_·2 Nis

@breskanu We use grafting from Adam here, so the update’s scale is determined by Adam’s update scale.

English

1

0

1

85

Nikita Breskanu@breskanu·2 Nis

It's quite surprising to me that Shampoo with power 1/2 works and doesn't break. Especially with such large learning rate. for something like (GG^T)^{-1/2}M(G^TG)^{-1/2} the scale should be 1/G, so I expect it to be very unstable, unless very small lr is used.

Runa Eschenhagen@runame_

1/14 Is Muon “better” than Shampoo? We argue that their relationship parallels Adam's relationship with Signum. Analogous to @lukas_balles and Hennig’s (2018) decomposition of Adam into element-wise scaled Signum, we can decompose Shampoo as left- and right-adapted Muon.

English

1

0

1

252

Nikita Breskanu@breskanu·1 Nis

@thomasmassena Thx, will check

English

0

1

20

Thomas Massena@thomasmassena·1 Nis

@breskanu You can check out the Turbo-Muon and Chebyshev Accelerated NS paper for this.

English

1

0

40

Nikita Breskanu@breskanu·31 Mar

Standard Muon takes X0 = G / ||G||_F. It feels like normalizing by spectral norm ||G||_2 may potentially be better than frobenius: it keeps the range [0, 1] needed for convergence, but singular values are more widespread across it.

English

1

0

1

630

Nikita Breskanu

Keşfet