Owen Colegrove

2.1K posts

Owen Colegrove banner
Owen Colegrove

Owen Colegrove

@ocolegro

Physicist | Quant | Founder

San Francisco Katılım Ocak 2021
661 Takip Edilen5.1K Takipçiler
Sabitlenmiş Tweet
Owen Colegrove
Owen Colegrove@ocolegro·
Today SciPhi is open-sourcing Triplex, a SOTA LLM for knowledge graph construction. Triplex is so small that it can be used with SciPhi's R2R to build knowledge graphs directly from your laptop. Triplex outperforms few-shot prompted gpt-4o at 1/60th the inference cost.
English
24
58
503
42.9K
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
965
2.1K
19.5K
3.6M
Andrej Karpathy
Andrej Karpathy@karpathy·
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)
Andrej Karpathy tweet media
English
1.1K
3.6K
28.4K
11.1M
Owen Colegrove
Owen Colegrove@ocolegro·
@karpathy Have you tried making some basic CLI tooling that can add more structure to their work? E.g. by providing a more rigid-workflow with automatic experimental recording? This enabled us to get net positive work out of configurations like this.
English
0
0
1
304
Andrej Karpathy
Andrej Karpathy@karpathy·
I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?
Thomas Wolf@Thom_Wolf

How come the NanoGPT speedrun challenge is not fully AI automated research by now?

English
562
804
8.7K
1.6M
Owen Colegrove
Owen Colegrove@ocolegro·
@antonosika pretty sick! I remember seeing the demo ~6 months ago out in SF, what progress!!
English
0
0
6
910
Anton Osika
Anton Osika@antonosika·
Lovable just reached $40M ARR in 5 months. But more importantly: We’ve now helped 1M+ people build their idea. This is why non-technical people use lovable: //1
English
73
41
1.3K
196.1K
Owen Colegrove retweetledi
Harj Taggar
Harj Taggar@harjtaggar·
Autists had a great run, the AI future belongs to ADHD
English
375
738
9.5K
938.9K
Rishabh Srivastava
Rishabh Srivastava@rishdotblog·
Got it! If it's helpful, I've found the citations API excellent with <50k input tokens R2R by @ocolegro is also super promising (and open-source) + works with any LLM provider, though I haven't had a chance to try them out yet github.com/SciPhi-AI/R2R Their knowledge graph generation +citation-backed answers (with source attribution) seem particularly promising
English
2
0
3
52
Eugene Yan
Eugene Yan@eugeneyan·
I'm at @NVIDIA GTC! Say hi if you want to discuss: • Summarization, translation, Q&A on very long docs • Scaling AI-powered experiences • How LLMs evolve RecSys & Search • Something I wrote @ eugeneyan.com Black jacket, blue jeans, Jensen Huang-autographed badge
Eugene Yan tweet media
English
3
3
19
2.9K
Owen Colegrove
Owen Colegrove@ocolegro·
@rishdotblog It's not as good as o1-pro, but generally it has enough juice and is 10x faster. I'm fully sold that a proper agentic system constructed around it must be beastly.
English
0
0
4
117
Owen Colegrove
Owen Colegrove@ocolegro·
@rishdotblog Awesome, will try - I've was also using o1-pro but I've switched over to manually interacting with 3.7 + extended thinking.
English
1
0
2
157
Rishabh Srivastava
Rishabh Srivastava@rishdotblog·
Tried going back to Cursor and Windsurf, and they felt unusable because I've grown so used to Claude Code's excellence o1-pro is the only thing that comes close. But so much slower and no API access 🫠
English
15
2
128
16.8K
Owen Colegrove
Owen Colegrove@ocolegro·
@kimmonismus what museum was this in? I saw something similar recently The National Museum of Emerging Science and Innovation (Miraikan)
English
0
0
0
300
Chubby♨️
Chubby♨️@kimmonismus·
CNN (Convolutional Neural Network) Visualization. Thats the stuff id love to see
English
18
72
775
128.6K
Owen Colegrove
Owen Colegrove@ocolegro·
@philipkung Nothing but constant problems when we've tried to work w/ devs on supabase.
English
0
0
3
340
Philip Kung
Philip Kung@philipkung·
Supabase is down again for the third time in the past month during peak business hours - except this time they did not put up a status banner whatsoever to give a heads up to their customers. We will be churning very soon.
English
5
0
27
4.8K
John Schulman
John Schulman@johnschulman2·
Excited to build a new AI research lab with some of my favorite former colleagues and some great new ones. Looking forward to sharing more in the coming weeks.
Thinking Machines@thinkymachines

Today, we are excited to announce Thinking Machines Lab (thinkingmachines.ai), an artificial intelligence research and product company. We are scientists, engineers, and builders behind some of the most widely used AI products and libraries, including ChatGPT, Character.ai, PyTorch, and Mistral. Our mission is to make artificial intelligence work for you by building a future where everyone has access to the knowledge and tools to make AI serve their unique needs. We are committed to open science through publications and code releases, while focusing on human-AI collaboration that serves diverse domains. Our approach embraces co-design of research and products to enable learning from real-world deployment and rapid iteration. This work requires three core foundations: state-of-the-art model intelligence, high-quality infrastructure, and advanced multimodal capabilities. We are committed to building models at the frontier of capabilities to deliver on this promise. If you’re interested in joining our team, consider applying here: 6wajk07p.paperform.co

English
41
47
1.2K
113.3K
Cody
Cody@breenemachine·
@ocolegro very cool - using Cursor?
English
1
0
0
30
Owen Colegrove
Owen Colegrove@ocolegro·
@breenemachine My job is primarily to see which parts of the original task it struggled to grok and to pick out errors in the code - a lot less mentally taxing than writing it from scratch
Owen Colegrove tweet media
English
0
0
0
139
Owen Colegrove
Owen Colegrove@ocolegro·
it gives full outputs in a reasonably well structured way that you can just copy / paste into your IDE. Here is a rewrite of our RAG streaming API that o1-pro just cooked up through a few cycles of iteration, it is a very robust improvement upon what we had: github.com/SciPhi-AI/R2R/…
English
2
0
0
83
Owen Colegrove
Owen Colegrove@ocolegro·
@JoshPurtell do you mean 'when will AI be as effective' ? If so, then probably longer as you say since training set is significantly smaller.
English
1
0
0
79
Josh
Josh@JoshPurtell·
@ocolegro Question I ask myself re: deprecating human coding - when will I be as effective at backend using a language I don't know like Rust ( should be strictly easier to vibe code w, given typing) as I am using Python. Honestly I don't anticipate that happening in 3 months
English
1
0
1
510
Owen Colegrove
Owen Colegrove@ocolegro·
I usually hand pick the relevant context for a medium difficulty coding problem and dump that with a well written problem to statement to o1-pro (which is significantly better than o1). There is usually some iteration, and eventually the code looks good and we move onto test writing. If those tests pass then typically the code is good to go.
English
1
0
1
52
Cody
Cody@breenemachine·
@ocolegro Is o1 good at file-length output? By async do you mean it’s doing agent-like work?
English
1
0
0
53