Owen Colegrove

2.1K posts

Owen Colegrove

@ocolegro

Physicist | Quant | Founder

San Francisco Katılım Ocak 2021

662 Takip Edilen5.1K Takipçiler

Sabitlenmiş Tweet

Owen Colegrove@ocolegro·19 Tem

Today SciPhi is open-sourcing Triplex, a SOTA LLM for knowledge graph construction. Triplex is so small that it can be used with SciPhi's R2R to build knowledge graphs directly from your laptop. Triplex outperforms few-shot prompted gpt-4o at 1/60th the inference cost.

English

503

43K

Owen Colegrove retweetledi

Dalton Caldwell@daltonc·6 Tem

Applications for the latest Standard Capital Series A funding cycle are now open! Apply by July 21, hear back by July 31.

English

238

60.5K

Owen Colegrove@ocolegro·2 Haz

@breenemachine ty

194

Cody@breenemachine·2 Haz

Actually useful research benchmarks for LLMs - awesome work from @ocolegro and others

Owen Colegrove@ocolegro

We have benchmarks for coding, math, and general reasoning, but few test how well a model can autonomously research in a realistic setting. Quant research is a perfect stress test: noisy, statistical, code-heavy, and easy to fool yourself. We recently put this to the test.

English

455

Owen Colegrove@ocolegro·2 Haz

Still, from this work we can learn (1) LLMs are already very good at doing quantitative research (2) quantitative research combines reasoning across various research domains that are representative of some of the highest forms of human thinking in a complex and noisy environment and ultimately it forms a great arena for testing LLM ability.

English

144

Owen Colegrove@ocolegro·2 Haz

This is still a backtest, not evidence of live-trading profitability. Also, the strategy signal used is somewhat fictitious in that it implicitly incorporates cross-exchange arbitrage opportunities that are hard/impossible to realize.

English

153

Owen Colegrove@ocolegro·2 Haz

English

Owen Colegrove@ocolegro·10 Mar

@karpathy sssh stop telling everyone about this

English

696

Andrej Karpathy@karpathy·10 Mar

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

957

2.1K

19.5K

3.7M

Owen Colegrove@ocolegro·9 Mar

@garybasin @karpathy it's going to print if done right

English

283

Gary Basin@garybasin·9 Mar

@ocolegro @karpathy How’d it do in prod?

English

317

Andrej Karpathy@karpathy·7 Mar

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English

3.6K

28.3K

11.1M

Owen Colegrove@ocolegro·1 Mar

@karpathy Have you tried making some basic CLI tooling that can add more structure to their work? E.g. by providing a more rigid-workflow with automatic experimental recording? This enabled us to get net positive work out of configurations like this.

English

320

Andrej Karpathy@karpathy·28 Şub

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

Thomas Wolf@Thom_Wolf

How come the NanoGPT speedrun challenge is not fully AI automated research by now?

English

560

793

8.7K

1.6M

Owen Colegrove@ocolegro·17 Nis

@antonosika pretty sick! I remember seeing the demo ~6 months ago out in SF, what progress!!

English

911

Anton Osika@antonosika·17 Nis

Lovable just reached $40M ARR in 5 months. But more importantly: We’ve now helped 1M+ people build their idea. This is why non-technical people use lovable: //1

English

1.3K

196.2K

Owen Colegrove retweetledi

Harj Taggar@harjtaggar·24 Mar

Autists had a great run, the AI future belongs to ADHD

English

374

726

9.4K

939.2K

Owen Colegrove@ocolegro·20 Mar

@rishdotblog @AkashTandon @eugeneyan @nvidia ty @rishdotblog !

Rishabh Srivastava@rishdotblog·19 Mar

Got it! If it's helpful, I've found the citations API excellent with <50k input tokens R2R by @ocolegro is also super promising (and open-source) + works with any LLM provider, though I haven't had a chance to try them out yet github.com/SciPhi-AI/R2R Their knowledge graph generation +citation-backed answers (with source attribution) seem particularly promising

English

Eugene Yan@eugeneyan·19 Mar

I'm at @NVIDIA GTC! Say hi if you want to discuss: • Summarization, translation, Q&A on very long docs • Scaling AI-powered experiences • How LLMs evolve RecSys & Search • Something I wrote @ eugeneyan.com Black jacket, blue jeans, Jensen Huang-autographed badge

English

2.9K

Owen Colegrove@ocolegro·18 Mar

@rishdotblog It's not as good as o1-pro, but generally it has enough juice and is 10x faster. I'm fully sold that a proper agentic system constructed around it must be beastly.

English

117

Owen Colegrove@ocolegro·18 Mar

@rishdotblog Awesome, will try - I've was also using o1-pro but I've switched over to manually interacting with 3.7 + extended thinking.

English

157

Rishabh Srivastava@rishdotblog·18 Mar

Tried going back to Cursor and Windsurf, and they felt unusable because I've grown so used to Claude Code's excellence o1-pro is the only thing that comes close. But so much slower and no API access 🫠

English

128

16.8K

Owen Colegrove@ocolegro·18 Mar

@kimmonismus what museum was this in? I saw something similar recently The National Museum of Emerging Science and Innovation (Miraikan)

English

301

Chubby♨️@kimmonismus·17 Mar

CNN (Convolutional Neural Network) Visualization. Thats the stuff id love to see

English

768

128.6K

Keşfet

@breenemachine @karpathy @garybasin @antonosika @rishdotblog @AkashTandon @eugeneyan @nvidia