Owen Colegrove

2.1K posts

Owen Colegrove banner
Owen Colegrove

Owen Colegrove

@ocolegro

Physicist | Quant | Founder

San Francisco 가입일 Ocak 2021
661 팔로잉5.2K 팔로워
고정된 트윗
Owen Colegrove
Owen Colegrove@ocolegro·
Today SciPhi is open-sourcing Triplex, a SOTA LLM for knowledge graph construction. Triplex is so small that it can be used with SciPhi's R2R to build knowledge graphs directly from your laptop. Triplex outperforms few-shot prompted gpt-4o at 1/60th the inference cost.
English
24
59
507
42.8K
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
970
2.1K
19.4K
3.5M
Andrej Karpathy
Andrej Karpathy@karpathy·
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)
Andrej Karpathy tweet media
English
1K
3.6K
28.2K
10.9M
Owen Colegrove
Owen Colegrove@ocolegro·
@karpathy Have you tried making some basic CLI tooling that can add more structure to their work? E.g. by providing a more rigid-workflow with automatic experimental recording? This enabled us to get net positive work out of configurations like this.
English
0
0
1
265
Andrej Karpathy
Andrej Karpathy@karpathy·
I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?
Thomas Wolf@Thom_Wolf

How come the NanoGPT speedrun challenge is not fully AI automated research by now?

English
563
806
8.7K
1.6M
Owen Colegrove
Owen Colegrove@ocolegro·
@antonosika pretty sick! I remember seeing the demo ~6 months ago out in SF, what progress!!
English
0
0
6
908
Anton Osika – eu/acc
Anton Osika – eu/acc@antonosika·
Lovable just reached $40M ARR in 5 months. But more importantly: We’ve now helped 1M+ people build their idea. This is why non-technical people use lovable: //1
English
73
41
1.3K
196K
Owen Colegrove
Owen Colegrove@ocolegro·
Gemini / Claude are both getting so good that I can't really determine which I prefer easily. It's interesting / helpful to have them both review each-others answers and determine the winner.
Owen Colegrove tweet mediaOwen Colegrove tweet media
English
1
2
12
1.1K
Owen Colegrove 리트윗함
Harj Taggar
Harj Taggar@harjtaggar·
Autists had a great run, the AI future belongs to ADHD
English
379
758
9.6K
938.2K
Owen Colegrove
Owen Colegrove@ocolegro·
IMO, the best evaluation of current LLM capabilities is seeing the quality of application a high-agency developer can build overnight. Check out this super clean app with novel AI, built in just 20 hours by YC partner @t_blom: recipeninja.ai
English
3
0
10
1.1K
Owen Colegrove
Owen Colegrove@ocolegro·
All it took was ~20 min to implement an MCP server for R2R. It's really juicing up my Claude experience. What else can I do to make this tool useful for others (besides publishing and documenting)?
English
0
0
5
870
Owen Colegrove
Owen Colegrove@ocolegro·
Nothing about my day to day feels quite as good as making devs happy
Owen Colegrove tweet media
English
0
1
7
592
Rishabh Srivastava
Rishabh Srivastava@rishdotblog·
Got it! If it's helpful, I've found the citations API excellent with <50k input tokens R2R by @ocolegro is also super promising (and open-source) + works with any LLM provider, though I haven't had a chance to try them out yet github.com/SciPhi-AI/R2R Their knowledge graph generation +citation-backed answers (with source attribution) seem particularly promising
English
2
0
3
51
Eugene Yan
Eugene Yan@eugeneyan·
I'm at @NVIDIA GTC! Say hi if you want to discuss: • Summarization, translation, Q&A on very long docs • Scaling AI-powered experiences • How LLMs evolve RecSys & Search • Something I wrote @ eugeneyan.com Black jacket, blue jeans, Jensen Huang-autographed badge
Eugene Yan tweet media
English
3
3
19
2.9K
Owen Colegrove
Owen Colegrove@ocolegro·
@rishdotblog It's not as good as o1-pro, but generally it has enough juice and is 10x faster. I'm fully sold that a proper agentic system constructed around it must be beastly.
English
0
0
4
107
Owen Colegrove
Owen Colegrove@ocolegro·
@rishdotblog Awesome, will try - I've was also using o1-pro but I've switched over to manually interacting with 3.7 + extended thinking.
English
1
0
2
156
Rishabh Srivastava
Rishabh Srivastava@rishdotblog·
Tried going back to Cursor and Windsurf, and they felt unusable because I've grown so used to Claude Code's excellence o1-pro is the only thing that comes close. But so much slower and no API access 🫠
English
15
2
129
16.8K
Owen Colegrove
Owen Colegrove@ocolegro·
@kimmonismus what museum was this in? I saw something similar recently The National Museum of Emerging Science and Innovation (Miraikan)
English
0
0
0
300
Chubby♨️
Chubby♨️@kimmonismus·
CNN (Convolutional Neural Network) Visualization. Thats the stuff id love to see
English
19
72
781
128.6K
Owen Colegrove
Owen Colegrove@ocolegro·
We renamed R2R "Reason to Retrieve" after witnessing the recent takeoff in reasoning capabilities. This is a demo of the basic R2R RAG agent, we also have a more extensive `deep research` mode. We're excited to see what you build w/ the API - R2R remains OSS + self-hostable!
English
4
3
14
1.1K