Minghao Yan

36 posts

Minghao Yan banner
Minghao Yan

Minghao Yan

@Minghao__Yan

interning @Google PhDing @WisconsinCS | prev @RiceCompSci, @AWS, @ThirdAILab (acq. by @ServiceNow)

Katılım Şubat 2019
1.2K Takip Edilen243 Takipçiler
Minghao Yan
Minghao Yan@Minghao__Yan·
@karpathy We worked on building an evolutionary agent for the NanoGPT benchmark back in October and shared our findings in the paper: arxiv.org/abs/2601.10657 Similarly, we also observed that the agent is really good at tuning hyperparameters, designing context / lr / decay schedules!
English
0
0
1
142
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
968
2.1K
19.4K
3.5M
Scott Condron
Scott Condron@_ScottCondron·
While program evolution is cool because of @karpathy, here's a mini lit review because I was inspired to learn more about the space Program evolution: agents iterate on programs - The agent proposes a code change, evaluates it, logs the result + code, samples from what worked as it iterates. No gradient descent - but still runs, evals, artifacts so it's similar to other optimization loops (e.g. model training). AlphaEvolve from Deepmind was a recent paper on this. If you want to see implementation details, the ShinkaEvolve from @SakanaAILabs is an open version of AlphaEvolve with the primary aim of making the optimization process more sample efficient using better solution sampling strategies and novelty rejection. It also has a cool UI with visualizations of how the agent sampled from it's prior solutions and merged them github.com/SakanaAI/Shink… There's a great talk on @AutomlSeminar's youtube channel by @RobertTLange that explains it: youtube.com/watch?v=dAOIer… These fit into the broader area of LLM + search/eval loops - FunSearch (DeepMind) worth mentioning because it was an early example of using LLMs as a part of evolutionary optimization strategies, in this case it was searching for mathematical functions - The AI Scientist paper shows initial experimentation with fully automated science using LLMs - There's also Darwin Gödel Machine - where the agent self-improves itself / it's harness going back further - this is all a modern remix of evolutionary computation: 1960s–70s evolutionary computation (EP/ES/GA), then 90s genetic programming, now with LLMs acting like an intelligent mutation/recombination operator inside an evaluate/select loop
YouTube video
YouTube
Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English
7
4
61
8.7K
Minghao Yan
Minghao Yan@Minghao__Yan·
This work was done during my internship at Google and would not have been possible without my mentors and collaborators across Google and DeepMind. Kudos to everyone involved! Paper: arxiv.org/pdf/2601.10657 Code drop coming soon, stay tuned!
English
1
0
2
210
Minghao Yan
Minghao Yan@Minghao__Yan·
🚀 Thrilled to introduce PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution. We show how to push LLM self-evolution beyond short, unstable improvements and into consistent, long-horizon gains. 🧵👇
Minghao Yan tweet media
English
1
8
27
2.4K
Andrej Karpathy
Andrej Karpathy@karpathy·
(I still have the bigger cousin running on prod nanochat, working a bigger model and on 8XH100, which looks like this now. I'll just leave this running for a while...)
Andrej Karpathy tweet media
English
71
62
2.1K
418.8K
Andrej Karpathy
Andrej Karpathy@karpathy·
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)
Andrej Karpathy tweet media
English
1K
3.7K
28.3K
10.9M
Minghao Yan retweetledi
Henry Shevlin
Henry Shevlin@dioscuri·
All I want for Christmas is a new Matt Lakeman blogpost
Henry Shevlin tweet media
English
0
1
9
1.8K
Minghao Yan
Minghao Yan@Minghao__Yan·
For more details, check out our paper and open-sourced draft models here, or catch me in person at the conference!
Minghao Yan tweet media
English
0
0
1
191
Minghao Yan
Minghao Yan@Minghao__Yan·
We showed that even for LLaMA 3.1 models, which were directly distilled from the target model, our lightweight method can still improve throughput by 52% with less than 10 minutes of distillation.
Minghao Yan tweet media
English
1
0
0
287
Minghao Yan
Minghao Yan@Minghao__Yan·
I will be presenting our work, Decoding Speculative Decoding, at @naacl tomorrow. We identified the performance bottleneck in speculative decoding to be draft model depth and demonstrated low correlation between language modeling performance and token acceptance rate.
Minghao Yan tweet media
English
1
2
10
953