Andy Zhou

709 posts

Andy Zhou

Andy Zhou

@zhouandy_

Co-Founder @IntologyAI

San Francisco, USA Katılım Ağustos 2016
546 Takip Edilen580 Takipçiler
Andrej Karpathy
Andrej Karpathy@karpathy·
I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?
Thomas Wolf@Thom_Wolf

How come the NanoGPT speedrun challenge is not fully AI automated research by now?

English
560
798
8.7K
1.6M
Andy Zhou retweetledi
CLS
CLS@ChengleiSi·
Can LLMs automate frontier LLM research, like pre-training and post-training? In our new paper, LLMs found post-training methods that beat GRPO (69.4% vs 48.0%), and pre-training recipes faster than nanoGPT (19.7 minutes vs 35.9 minutes). 1/
CLS tweet media
English
11
142
577
106.5K
Andy Zhou retweetledi
Justin Cho
Justin Cho@HJCH0·
I've joined @IntologyAI! I'm excited to push the boundaries of AI-accelerated scientific discovery with an incredibly driven and talented team. Looking forward to dive deep into research on AI-driven automation and creativity!
English
2
4
11
1.4K
Andy Zhou
Andy Zhou@zhouandy_·
Hi, we've confirmed the stream synchronization issue in the Llama FFW kernel - the timing wasn't properly measuring the actual computation. The 20x speedup we reported was incorrect. Our kernels were developed using Robust-KBench & KernelBench’s test configurations (documented in our blog). We've moved to BackendBench for more robust validation in kernel optimization.
English
2
0
29
25.2K
miru
miru@miru_why·
@niklassheth @ronusedh @IntologyAI their 'superhuman' ai cleverly assigned all the work to non-default streams, which means the correctness test (which waits on all streams) passes, while the profiling timer (which only waits on the default stream) is tricked into reporting a huge speedup
miru tweet media
English
12
32
566
258K
Intology
Intology@IntologyAI·
Introducing Locus: the first AI system to outperform human experts at AI R&D Locus conducts research autonomously over multiple days and achieves superhuman results on RE-Bench given the same resources as humans, as well as SOTA performance on GPU kernel & ML engineering tasks. RE-Bench is a collection of several frontier AI research tasks that typically take human experts (e.g., top ML PhDs and frontier lab researchers) several days. By scaling experimentation to far longer time horizons than previous systems, Locus represents a step change in AI scientist capabilities. 🧵
GIF
English
22
70
419
217K
Andy Zhou
Andy Zhou@zhouandy_·
Hi Mark! We used Robust-KBench for our kernel generation evaluation. Please refer to the paper arxiv.org/abs/2509.14279 and the benchmark repository github.com/SakanaAI/robus…, which contains details on the environment setup. We used the standard setting of Robust-KBench exactly with no additional modifications, which has specific settings for GPU type, PT version, input shapes, and timing code. We discuss much more in our blog! We are super excited about using Locus for more kernel problems, so happy to chat.
English
1
0
3
287
Andy Zhou retweetledi
CLS
CLS@ChengleiSi·
@zhouandy_ Congrats, Andy! The results look impressive!
English
1
0
2
223
Andy Zhou
Andy Zhou@zhouandy_·
Super excited about our progress! We've been building out our latest AI scientist system and wanted to share some early results. We were surprised that Locus was not only SOTA on RE-Bench but even surpassed human experts! In the coming months, we'll be releasing novel discoveries made by Locus. Very proud of our team! We firmly believe AI systems will transform the process of conducting science - if our mission resonates with you, consider joining us: us@intology.ai
Intology@IntologyAI

Introducing Locus: the first AI system to outperform human experts at AI R&D Locus conducts research autonomously over multiple days and achieves superhuman results on RE-Bench given the same resources as humans, as well as SOTA performance on GPU kernel & ML engineering tasks. RE-Bench is a collection of several frontier AI research tasks that typically take human experts (e.g., top ML PhDs and frontier lab researchers) several days. By scaling experimentation to far longer time horizons than previous systems, Locus represents a step change in AI scientist capabilities. 🧵

English
1
0
6
999
Andy Zhou retweetledi
Ron Arel
Ron Arel@ronusedh·
Feel the rain on your skin No one else can feel it for you Only you can let it in No one else, no one else Can speak the words on your lips
English
3
0
5
991
Andy Zhou retweetledi
Intology
Intology@IntologyAI·
Excited to be announcing the #AI4Science community, in collaboration w/ @askalphaxiv. As a part of our speaker series, we are hosting @jeffclune this Friday. Join 4000+ others passionate about AI-accelerated discovery. (event link & invite link) 🧵👇
English
2
4
21
9.6K
Elizabeth Holmes
Elizabeth Holmes@ElizabethHolmes·
If you are building a business that has the ability to change the world comment below with a short pitch. @ your favorite founder. I'll share my feedback and hopefully this will help give some exposure for young companies who are trying to do good. I'll share thoughts in the next 24 hours.
English
206
20
495
122.3K