
Raimo Tuisku
621 posts

Raimo Tuisku
@raimo_t
I layer success over chaos. Programming since age 9 built a good mental model for everything, including humor. Founder @tryjido, prev @LeapMotion, @Twitter, @X










Introducing Code Review, a new feature for Claude Code. When a PR opens, Claude dispatches a team of agents to hunt for bugs.


Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.



🦞⌚ ClawWatch 1.0.0 is live today. We just launched the first true smartwatch-native AI agent: ClawWatch 1.0.0. Built for real wrist use: tap once to start talking continue naturally in multi-turn conversation stop when you stop get full Opus 4.6 intelligence + optional live web search ClawWatch 1.0.0 is battery-efficient by design while staying calm and easy to use: minimal UI, lightweight state indicators, and one adaptive interaction flow so your watch helps without distracting. Release: github.com/ThinkOffApp/Cl…


GPT-5.4, it's basically perfect (it took it around 24 minutes) Yeah, Minecraft is pretty much solved, I have to find a new test now


Dario at MS TMT Conference today: On defense / DOW:"We really believe in defending America." Anthropic has been working with the national security community for 2 years. "We are the most lean forward." On AI acceleration:"We do not see hitting a wall. This year will have a radical acceleration that surprises everyone." Exponentials catch people off guard. "We are at the precipice of something incredible. We need to manage it the right way." On where markets are wrong:"It's already big and it will get 1 million times bigger." The underestimation of exponential growth is the key thing people need to understand. On revenue scale:Anthropic was at ~$100M run rate 2 years ago. Now at $19B run rate. On culture — Dario says he spends 40% of his time on it:"Anyone who is CEO of a growing firm needs to realize they are chief culture officer. My job is to make sure everyone is on same page and believes in what we are doing. That's the most important thing." He does a vision quest with the whole company every couple weeks. "I want them to hear it directly from me. If I tell the CTO, who tells the VP Eng, who tells the manager — that's too long of a game of telephone." "Politics and infighting are a cancer to companies as they grow." On talent retention vs Meta:"We lost 2 people to Meta. They lost several dozen. Normalized by size, they lost 10-20x more people vs us." Attributes this to unified culture generating "super linear returns — by working together vs working against each other." On code as the breakout use case:Code has "exceeded our high expectations." Why? Devs adopt fast, code is verifiable, and gains compound — you build software to build software. "Didn't realize it would go so fast even at traditional enterprises." Frustration is around regulated industries where legal/compliance slow things down. "That's how fast everything could be going if not for non-AI barriers." On Anthropic's own AI usage:Top internal use cases: 1) writing code, 2) the process around writing code (SWE), 3) managing servers and controlling clusters. "If we were paying ourselves for our usage, we'd be one of our largest customers." On Claude Code:"You can supervise an army of 100 Claudes. It's closely analogous to a management skill." The people who are best at it keep the big picture in their head. Higher return to finding people who can handle more complex tasks. On platform vs apps:"We are primarily a platform, but there are places where we have expertise to make something directly useful." Claude Code emerged as a tool they built for themselves — thousands of internal users before shipping it externally. "Code is a prelude for what we will see in everything else." On societal implications:"Human history — lots of muddling through. We found ourselves in this comedy of errors and figured it out eventually. It's happening so fast that we need to do better than that this time." The market will deliver positive benefits — "I see that as priced in." What's not priced in: the choices we make around externalities. Jobs, national security, ensuring the benefits reach everyone. On chips & compute:Anthropic uses multiple chip suppliers. "We find that actually using different chips is useful to us. Chips aren't just a speed number — we gain benefits from heterogeneity." Also standard business logic of having more than one supplier.










