Minghao Yan

2

18

5.5K

Minghao Yan@Minghao__Yan·10 Mar

@karpathy We worked on building an evolutionary agent for the NanoGPT benchmark back in October and shared our findings in the paper: arxiv.org/abs/2601.10657 Similarly, we also observed that the agent is really good at tuning hyperparameters, designing context / lr / decay schedules!

English

1

142

Andrej Karpathy@karpathy·10 Mar

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

968

2.1K

19.4K

3.5M

Minghao Yan@Minghao__Yan·9 Mar

@_ScottCondron @karpathy Shameless self-plug here: we’ve worked on the very task of self-evolution on the NanoGPT benchmark in our paper! x.com/minghao__yan/s…

We even deployed PACEvolve on the Modded NanoGPT challenge. Despite the benchmark being heavily optimized by the community, PACEvolve discovered further gains in data loading, network initialization, and tuned better hyperparameters.

English

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

3

217

Scott Condron@_ScottCondron·9 Mar

While program evolution is cool because of @karpathy, here's a mini lit review because I was inspired to learn more about the space Program evolution: agents iterate on programs - The agent proposes a code change, evaluates it, logs the result + code, samples from what worked as it iterates. No gradient descent - but still runs, evals, artifacts so it's similar to other optimization loops (e.g. model training). AlphaEvolve from Deepmind was a recent paper on this. If you want to see implementation details, the ShinkaEvolve from @SakanaAILabs is an open version of AlphaEvolve with the primary aim of making the optimization process more sample efficient using better solution sampling strategies and novelty rejection. It also has a cool UI with visualizations of how the agent sampled from it's prior solutions and merged them github.com/SakanaAI/Shink… There's a great talk on @AutomlSeminar's youtube channel by @RobertTLange that explains it: youtube.com/watch?v=dAOIer… These fit into the broader area of LLM + search/eval loops - FunSearch (DeepMind) worth mentioning because it was an early example of using LLMs as a part of evolutionary optimization strategies, in this case it was searching for mathematical functions - The AI Scientist paper shows initial experimentation with fully automated science using LLMs - There's also Darwin Gödel Machine - where the agent self-improves itself / it's harness going back further - this is all a modern remix of evolutionary computation: 1960s–70s evolutionary computation (EP/ES/GA), then 90s genetic programming, now with LLMs acting like an intelligent mutation/recombination operator inside an evaluate/select loop

YouTube

Andrej Karpathy@karpathy

English

7

4

61

8.7K

Minghao Yan@Minghao__Yan·8 Mar

Code drop here:

Our code is finally out at github.com/Google/pacevol…. Run it with your favorite task and see if you can push the scientific frontier!

English

197

Minghao Yan@Minghao__Yan·21 Oca

This work was done during my internship at Google and would not have been possible without my mentors and collaborators across Google and DeepMind. Kudos to everyone involved! Paper: arxiv.org/pdf/2601.10657 Code drop coming soon, stay tuned!

English

0

2

210

Minghao Yan@Minghao__Yan·21 Oca

🚀 Thrilled to introduce PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution. We show how to push LLM self-evolution beyond short, unstable improvements and into consistent, long-horizon gains. 🧵👇

English

8

27

2.4K

Minghao Yan@Minghao__Yan·8 Mar

@karpathy Yes, we’ve been working on building a RSI agent on NanoGPT speedup since last Oct! Check it out in our paper: x.com/minghao__yan/s…

We even deployed PACEvolve on the Modded NanoGPT challenge. Despite the benchmark being heavily optimized by the community, PACEvolve discovered further gains in data loading, network initialization, and tuned better hyperparameters.

English

2

611

Andrej Karpathy@karpathy·7 Mar

(I still have the bigger cousin running on prod nanochat, working a bigger model and on 8XH100, which looks like this now. I'll just leave this running for a while...)

English

71

62

2.1K

418.8K

Andrej Karpathy@karpathy·7 Mar

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

1K

3.7K

28.3K

10.9M

Minghao Yan@Minghao__Yan·7 Mar

@eliebakouch agree! we tried it in our paper and it has been an awesome experience learning about both the capabilities and the limitations of current frontier LLMs.

We even deployed PACEvolve on the Modded NanoGPT challenge. Despite the benchmark being heavily optimized by the community, PACEvolve discovered further gains in data loading, network initialization, and tuned better hyperparameters.

English

144

elie@eliebakouch·6 Mar

i really think speedrun like this is one of the best environments for automated research. it's also one of the best ways to benchmark frontier models and harnesses i'm starting to get obsessed with this

Andrej Karpathy@karpathy

nanochat now trains GPT-2 capability model in just 2 hours on a single 8XH100 node (down from ~3 hours 1 month ago). Getting a lot closer to ~interactive! A bunch of tuning and features (fp8) went in but the biggest difference was a switch of the dataset from FineWeb-edu to NVIDIA ClimbMix (nice work NVIDIA!). I had tried Olmo, FineWeb, DCLM which all led to regressions, ClimbMix worked really well out of the box (to the point that I am slightly suspicious about about goodharting, though reading the paper it seems ~ok). In other news, after trying a few approaches for how to set things up, I now have AI Agents iterating on nanochat automatically, so I'll just leave this running for a while, go relax a bit and enjoy the feeling of post-agi :). Visualized here as an example: 110 changes made over the last ~12 hours, bringing the validation loss so far from 0.862415 down to 0.858039 for a d12 model, at no cost to wall clock time. The agent works on a feature branch, tries out ideas, merges them when they work and iterates. Amusingly, over the last ~2 weeks I almost feel like I've iterated more on the "meta-setup" where I optimize and tune the agent flows even more than the nanochat repo directly.

English

5

173

17K

Minghao Yan@Minghao__Yan·25 Şub

Our code is finally out at github.com/Google/pacevol…. Run it with your favorite task and see if you can push the scientific frontier!

🚀 Thrilled to introduce PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution. We show how to push LLM self-evolution beyond short, unstable improvements and into consistent, long-horizon gains. 🧵👇

English

4

414

Minghao Yan retweetledi

Henry Shevlin@dioscuri·18 Ara

All I want for Christmas is a new Matt Lakeman blogpost

English

1

9

1.8K

Minghao Yan@Minghao__Yan·29 Nis

For more details, check out our paper and open-sourced draft models here, or catch me in person at the conference!

English

1

191

Minghao Yan@Minghao__Yan·29 Nis

We showed that even for LLaMA 3.1 models, which were directly distilled from the target model, our lightweight method can still improve throughput by 52% with less than 10 minutes of distillation.

English

0

287

Minghao Yan@Minghao__Yan·29 Nis

I will be presenting our work, Decoding Speculative Decoding, at @naacl tomorrow. We identified the performance bottleneck in speculative decoding to be draft model depth and demonstrated low correlation between language modeling performance and token acceptance rate.

English