Owen Colegrove

2.1K posts

Owen Colegrove

@ocolegro

Physicist | Quant | Founder

San Francisco 가입일 Ocak 2021

661 팔로잉5.2K 팔로워

고정된 트윗

Owen Colegrove@ocolegro·19 Tem

Today SciPhi is open-sourcing Triplex, a SOTA LLM for knowledge graph construction. Triplex is so small that it can be used with SciPhi's R2R to build knowledge graphs directly from your laptop. Triplex outperforms few-shot prompted gpt-4o at 1/60th the inference cost.

English

507

42.8K

Owen Colegrove@ocolegro·10 Mar

@karpathy sssh stop telling everyone about this

English

543

Andrej Karpathy@karpathy·10 Mar

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

970

2.1K

19.4K

3.5M

Owen Colegrove@ocolegro·9 Mar

@garybasin @karpathy it's going to print if done right

English

248

Gary Basin@garybasin·9 Mar

@ocolegro @karpathy How’d it do in prod?

English

282

Andrej Karpathy@karpathy·7 Mar

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

3.6K

28.2K

10.9M

Owen Colegrove@ocolegro·1 Mar

@karpathy Have you tried making some basic CLI tooling that can add more structure to their work? E.g. by providing a more rigid-workflow with automatic experimental recording? This enabled us to get net positive work out of configurations like this.

English

265

Andrej Karpathy@karpathy·28 Şub

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

Thomas Wolf@Thom_Wolf

How come the NanoGPT speedrun challenge is not fully AI automated research by now?

English

563

806

8.7K

1.6M

Owen Colegrove@ocolegro·17 Nis

@antonosika pretty sick! I remember seeing the demo ~6 months ago out in SF, what progress!!

English

908

Anton Osika – eu/acc@antonosika·17 Nis

Lovable just reached $40M ARR in 5 months. But more importantly: We’ve now helped 1M+ people build their idea. This is why non-technical people use lovable: //1

English

1.3K

196K

Owen Colegrove@ocolegro·27 Mar

Gemini / Claude are both getting so good that I can't really determine which I prefer easily. It's interesting / helpful to have them both review each-others answers and determine the winner.

English

1.1K

Owen Colegrove 리트윗함

Harj Taggar@harjtaggar·24 Mar

Autists had a great run, the AI future belongs to ADHD

English

379

758

9.6K

938.2K

Owen Colegrove@ocolegro·21 Mar

IMO, the best evaluation of current LLM capabilities is seeing the quality of application a high-agency developer can build overnight. Check out this super clean app with novel AI, built in just 20 hours by YC partner @t_blom: recipeninja.ai

English

1.1K

Owen Colegrove@ocolegro·20 Mar

All it took was ~20 min to implement an MCP server for R2R. It's really juicing up my Claude experience. What else can I do to make this tool useful for others (besides publishing and documenting)?

English

870

Owen Colegrove@ocolegro·20 Mar

Nothing about my day to day feels quite as good as making devs happy

English

592

Owen Colegrove@ocolegro·20 Mar

@rishdotblog @AkashTandon @eugeneyan @nvidia ty @rishdotblog !

Rishabh Srivastava@rishdotblog·19 Mar

Got it! If it's helpful, I've found the citations API excellent with <50k input tokens R2R by @ocolegro is also super promising (and open-source) + works with any LLM provider, though I haven't had a chance to try them out yet github.com/SciPhi-AI/R2R Their knowledge graph generation +citation-backed answers (with source attribution) seem particularly promising

English

Eugene Yan@eugeneyan·19 Mar

I'm at @NVIDIA GTC! Say hi if you want to discuss: • Summarization, translation, Q&A on very long docs • Scaling AI-powered experiences • How LLMs evolve RecSys & Search • Something I wrote @ eugeneyan.com Black jacket, blue jeans, Jensen Huang-autographed badge

English

2.9K

Owen Colegrove@ocolegro·18 Mar

@rishdotblog It's not as good as o1-pro, but generally it has enough juice and is 10x faster. I'm fully sold that a proper agentic system constructed around it must be beastly.

English

107

Owen Colegrove@ocolegro·18 Mar

@rishdotblog Awesome, will try - I've was also using o1-pro but I've switched over to manually interacting with 3.7 + extended thinking.

English

156

Rishabh Srivastava@rishdotblog·18 Mar

Tried going back to Cursor and Windsurf, and they felt unusable because I've grown so used to Claude Code's excellence o1-pro is the only thing that comes close. But so much slower and no API access 🫠

English

129

16.8K

Owen Colegrove@ocolegro·18 Mar

@kimmonismus what museum was this in? I saw something similar recently The National Museum of Emerging Science and Innovation (Miraikan)

English

300

Chubby♨️@kimmonismus·17 Mar

CNN (Convolutional Neural Network) Visualization. Thats the stuff id love to see

English

781

128.6K

Y Combinator@ycombinator·17 Mar

R2R from sciphi.ai is an open-source agentic retrieval system that transforms RAG with multi-step reasoning across your data and the web. ycombinator.com/launches/N4p-r…

English

113

23.2K

Owen Colegrove@ocolegro·18 Mar

@jedwhite @ycombinator Thanks Jed, your enthusiasm is killer!!

English

Jed White 💥♻️@jedwhite·17 Mar

@ycombinator Huge congrats @ocolegro - this looks awesome!!

English

277

Owen Colegrove@ocolegro·17 Mar

We renamed R2R "Reason to Retrieve" after witnessing the recent takeoff in reasoning capabilities. This is a demo of the basic R2R RAG agent, we also have a more extensive `deep research` mode. We're excited to see what you build w/ the API - R2R remains OSS + self-hostable!

English

1.1K

탐색

@karpathy @garybasin @antonosika @t_blom @rishdotblog @AkashTandon @eugeneyan @nvidia