
Moisés Arizpe
3.6K posts




Wow, this tweet went very viral! I wanted share a possibly slightly improved version of the tweet in an "idea file". The idea of the idea file is that in this era of LLM agents, there is less of a point/need of sharing the specific code/app, you just share the idea, then the other person's agent customizes & builds it for your specific needs. So here's the idea in a gist format: gist.github.com/karpathy/442a6… You can give this to your agent and it can build you your own LLM wiki and guide you on how to use it etc. It's intentionally kept a little bit abstract/vague because there are so many directions to take this in. And ofc, people can adjust the idea or contribute their own in the Discussion which is cool.


How come the NanoGPT speedrun challenge is not fully AI automated research by now?


Perhaps a 🌶️ take but I think the criticisms of @GoogleDeepMind's release are missing the point, and the real problem is that AI labs and safety orgs need to adapt to a world where intelligence is a function of inference compute. When Google says that Deep Think poses no new risks beyond Gemini 3 Pro, they probably mean that Deep Think is a scaffold of Gemini 3 Pro that anyone externally could have constructed on their own anyway. In other words, the capabilities of Deep Think have always been available to anyone willing to pay for Deep Think amounts of inference, simply by scaffolding a bunch of Gemini 3 Pro queries together. Deep Think just makes that more convenient for the casual user. The corollary of this is that capabilities far beyond Gemini 3 Deep Think are already available to anyone willing to scaffold a system together that uses even more inference compute. As a trivial example, you could run 10 Deep Think queries and just do consensus over them. That would be 10x the cost but would have higher performance on many benchmarks. Most Preparedness Frameworks were developed in ~2023 before the era of effective test-time scaling. But today, there is a massive difference on the hardest evals between something like GPT-5.2 Low and GPT-5.2 Extra High. Scaffolds are also much more effective. So if you want to evaluate whether Gemini 3 can, for example, help make a bio weapon, the answer may depend on how much inference compute you give it. In my opinion, the proper solution is to account for inference compute when measuring model capabilities. E.g., if one were to spend $1,000 on inference with a really good scaffold, what performance could be expected on a benchmark? ARC-AGI has already adopted this mindset but few other benchmarks have. Of course, serious entities like state actors could spend well beyond $1,000. Accurate benchmark evaluations can require dozens of queries on hundreds of problems. So, if we want to measure a model's capability when using $1 million of inference, we might need to spend billions of dollars for each model release! But in the same way that pretraining scaling laws can predict the capabilities of larger pretrained models, performance also scales somewhat cleanly with additional inference compute. In my opinion, it should become standard practice for all system cards to show plots of benchmark performance as a function of inference compute, and safety thresholds should be based on a projection of what performance would look like at $1 million+ of inference compute. If that were the norm, then indeed releasing Deep Think probably would not result in a meaningful safety change compared to Gemini 3 Pro, other than making good scaffolds more easily available to casual users.










.@Ahmad_Al_Dahle is joining as Airbnb's new CTO. I’m often asked about our AI strategy. We believe pairing great design with frontier technology will help us improve the way people experience travel. Excited to build!



1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).










