Nikita Breskanu

68 posts

Nikita Breskanu

Nikita Breskanu

@breskanu

First year master's degree student at Constructor University, Bremen. Interested in DL research.

Bremen Katılım Ocak 2026
137 Takip Edilen14 Takipçiler
Nikita Breskanu retweetledi
Prime Intellect
Prime Intellect@PrimeIntellect·
Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline
Prime Intellect tweet media
English
57
154
1.7K
606.6K
Nikita Breskanu
Nikita Breskanu@breskanu·
There is also a “sequel” written in 2016: “Information geometry and its applications” which covers more modern topics, like the natural gradient. Although some of the results are interesting, the main problem with applying the theory to practice is the singular Fisher matrix.
English
0
0
0
30
Nikita Breskanu
Nikita Breskanu@breskanu·
Recently I have read an excellent book by S. Amari: “Methods of information geometry”. Basically, the book considers a statistical manifold and studies so-called alpha-connections on it, which lead to a lot of known quantities in statistics. Also great diffgem introduction!
English
1
0
1
69
Nikita Breskanu
Nikita Breskanu@breskanu·
@xidulu @CV_novel_plume Technically that’s one backward, you just don’t aggregate the gradients, using the full information. So, I think it should be allowed
English
0
0
0
106
Xidulu
Xidulu@xidulu·
@CV_novel_plume An obvious hack is to utilize per-sample gradient to get richer information. Not sure if that counts as multiple backward
English
2
0
5
391
Yuxin Fang
Yuxin Fang@CV_novel_plume·
One caveat is that the rules do not allow changing the batch size, and they also seem to disallow extra forward/backward passes per step. That makes more precise Hessian or curvature-estimation methods hard to use, since they typically require extra probes, HVPs, or finite-difference-style evaluations. So unlimited wall-clock time helps for extra optimizer-side computation, but not for collecting richer curvature information.
Yuxin Fang tweet media
Keller Jordan@kellerjordan0

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English
1
0
27
4.1K
Nikita Breskanu
Nikita Breskanu@breskanu·
I expect SOAP to be at the lead soon
Keller Jordan@kellerjordan0

Modded-NanoGPT Optimization Benchmark Hundreds of neural network optimizers have been proposed in the literature, recently including dozens citing Muon: MARS, SWAN, REG, ADANA, Newton-Muon, TrasMuon, AdaMuon, HTMuon, COSMOS, Conda, ASGO, SAGE, and Magma, to name a few. The majority of this innovation is happening in the public research community. But the community currently lacks a widely accepted, easily accessible way to compare and make sense of the deluge of methods. As a result, promising new ideas get buried, and spurious results go unchallenged. To help address these issues, I'm releasing a new optimization benchmark. It's designed for maximum simplicity and speed: Just a single file containing ~350 lines of plain PyTorch, which can complete a baseline LM training within 20 minutes of booting up a fresh 8xH100 machine. It also works with {1,2,4}xH100 or A100. These attributes make the new benchmark more accessible than any prior work. The rules are simple: The optimization algorithm can be changed arbitrarily, with the goal being to minimize the number of training steps needed to reach 3.28 val loss on FineWeb (this is the same target loss as in the main speedrun). Modifying the architecture or dataloader, on the other hand, is not allowed. Wallclock time is unlimited, in order to give a fair chance to optimizers which would need kernel work or larger scale to become wallclock-efficient. Like the main NanoGPT speedrun, submissions are open, and new results will be publicly broadcast. Beyond just improving the step count record, another goal of the benchmark is to collaboratively produce well-tuned baselines for as many optimizers as possible. For example, any improvement to the benchmark's best hyperparameters for AdamW would be considered a worthwhile new result. This benchmark is not intended to be the final measure of optimizer quality across all domains. Convenient shared experimental infrastructure which covers the full space of possibilities -- across varying batch size, tokens per parameter, model scale, epoch count, and architecture -- is desirable, but far beyond the current status quo. This benchmark is only meant to be one step towards that goal. To start the benchmark off, I've spent ~20 runs tuning baselines for Muon and AdamW. From time to time over the next few weeks, I'll add another optimizer from the literature, with my best effort at finding good hyperparameters. Researchers interested in neural network optimization are invited to join in by picking an optimizer and giving it a try on the benchmark. All optimizers are welcome, and even runs that don't necessarily have the best hyperparameters are desirable additions to the repo, because each new run adds to the collective knowledge.

English
0
0
0
124
Xidulu
Xidulu@xidulu·
Is Muon actually (provably) better than AdamW or is it that "Muon gives you better loss under a fixed + finite tuning budget"
English
20
1
129
38.3K
Nikita Breskanu
Nikita Breskanu@breskanu·
Also visualized all of the approximations on small CNN on MNIST. True Fisher | KFAC | Shampoo Emprirical Fisher | EKFAC | SOAP
Nikita Breskanu tweet media
English
0
0
0
45
Nikita Breskanu
Nikita Breskanu@breskanu·
It is interesting that Adam in KFAC eigenbasis performed better than Adam in Shampoo eigenbasis (default SOAP). This suggests that perhaps KFAC-style approximation is better than Shampoo one.
English
1
0
0
60
Nikita Breskanu
Nikita Breskanu@breskanu·
fullfix.github.io/notes/2026/04/… A blog post on Fisher-based optimizers in DL, where I covered KFAC, EKFAC, Shampoo and SOAP and their connection with Fisher Information Matrix. Also compared all the mentioned optimizers with AdamW baseline on shakespeare-char.
Nikita Breskanu tweet media
English
1
0
0
116
Nikita Breskanu
Nikita Breskanu@breskanu·
fullfix.github.io/notes/2026/04/… Wrote a blog post on the main statistical properties of the Fisher Information Matrix. In the last section, I also briefly discuss overparameterization and why it leads to Fisher singularity.
English
0
0
1
45
Nikita Breskanu
Nikita Breskanu@breskanu·
Feels similar to pseudo-labeling from classical ML.
Bo Wang@BoWang87

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass@1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English
0
0
1
117
Nikita Breskanu
Nikita Breskanu@breskanu·
Better to normalize by ||A||_F * ||B||_F
English
0
0
0
42
Nikita Breskanu
Nikita Breskanu@breskanu·
Found out that Frobenius norm of the commutator: ||AB - BA||_F is a good measure of eigenvectors closeness for symmetric matrices A, B. At least it’s good when there are no repeated eigenvalues, which is typically true in practice.
English
1
0
1
66
You Jiacheng
You Jiacheng@YouJiacheng·
@breskanu @Ji_Ha_Kim sure it's better. ||X||_F is a bound given 1 constraint: sum(σ^2)=||X||_F^2. this bound is a bound given 2 constraints: sum(σ^2)=||X||_F^2 AND sum(σ^4)=||X.T@X||_F^2 more constraints => better.
English
1
0
2
81
Nikita Breskanu
Nikita Breskanu@breskanu·
@Ji_Ha_Kim @YouJiacheng GPT says it's also always upper bounded by frobenius norm of X (I guess here G = X^TX). Then, it's cool that this measure is basically between: ||X||_2 <= sqrt(M(X)) <= ||X||_F so it's better then frobenius.
English
1
0
0
34
Nikita Breskanu
Nikita Breskanu@breskanu·
@runame_ Hm, that explains it. Then it’s alright. Probably 1/2 corresponds to F^{-1} approx and 1/4 to F^{-1/2}, so more natural gradient-like update is better.
English
0
0
0
26
Runa Eschenhagen
Runa Eschenhagen@runame_·
@breskanu We use grafting from Adam here, so the update’s scale is determined by Adam’s update scale.
English
1
0
1
85
Nikita Breskanu
Nikita Breskanu@breskanu·
It's quite surprising to me that Shampoo with power 1/2 works and doesn't break. Especially with such large learning rate. for something like (GG^T)^{-1/2}M(G^TG)^{-1/2} the scale should be 1/G, so I expect it to be very unstable, unless very small lr is used.
Nikita Breskanu tweet media
Runa Eschenhagen@runame_

1/14 Is Muon “better” than Shampoo? We argue that their relationship parallels Adam's relationship with Signum. Analogous to @lukas_balles and Hennig’s (2018) decomposition of Adam into element-wise scaled Signum, we can decompose Shampoo as left- and right-adapted Muon.

English
1
0
1
252
Thomas Massena
Thomas Massena@thomasmassena·
@breskanu You can check out the Turbo-Muon and Chebyshev Accelerated NS paper for this.
English
1
0
0
40
Nikita Breskanu
Nikita Breskanu@breskanu·
Standard Muon takes X0 = G / ||G||_F. It feels like normalizing by spectral norm ||G||_2 may potentially be better than frobenius: it keeps the range [0, 1] needed for convergence, but singular values are more widespread across it.
Nikita Breskanu tweet media
English
1
0
1
630