Henry Lu

9

96

8.2K

Henry Lu@HenryL_AI·4d

.@demishassabis : lack of continual learning is the blocker for agents completing full tasks. One data point on why this can be unlocking now, from our 𝐚-𝐞𝐯𝐨𝐥𝐯𝐞 runs: self-correction went through a phase transition. Pre-Sonnet-4.5: often can't find own errors. Sonnet-4.5: finds them, often can't fix. Opus-4.6: usually fixes right. Continual learning is a second-order capability — gated by base model strength.

Y Combinator@ycombinator

Demis Hassabis (@demishassabis) has had one of the most extraordinary careers in tech. He started as a chess prodigy and video game designer at 17 before getting a PhD in neuroscience and going on to found DeepMind. His lab cracked Go, solved protein structure prediction with AlphaFold, and then gave it away free to every scientist on earth. That work won him the 2024 Nobel Prize in Chemistry. Today he leads @GoogleDeepMind, pushing toward the same goal he set as a teenager: AGI. On this special live episode of How to Build the Future, he sat down with YC's @garrytan to talk about what still needs to happen to get us to AGI, his advice for founders on how to stay ahead of the curve, and what the next big scientific breakthroughs might be. 01:48 — What’s Missing Before We Get To AGI? 03:36 — Why Memory Is Still Unsolved 06:14 — How AlphaGo Shaped Gemini 08:06 — Why Smaller Models Are Getting So Powerful 10:46 — The 1000x Engineer 12:40 — Continual Learning and the Future of Agents 13:32 — Why AI Still Fails at Basic Reasoning 15:33 — Are Agents Overhyped or Just Getting Started? 18:31 — Can AI Become Truly Creative? 20:26 — Open Models, Gemma, and Local AI 22:26 — Why Gemini Was Built Multimodal 24:08 — What Happens When Inference Gets Cheap? 25:24 — From AlphaFold to the Virtual Cells 28:24 — AI as the Ultimate Tool for Science 30:43 — Advice for Founders 33:30 — The AlphaFold Breakthrough Pattern 35:20 — Can AI Make Real Scientific Discoveries? 37:59 — What to Build Before AGI Arrives

English

4

371

Henry Lu@HenryL_AI·4d

@jzhou_jz @AdamMarblestone Very interesting work, you should definitely check github.com/A-EVO-Lab/A-EV… to see what's the synergy there. I tried to DM you but also didn't work. My DM should be open as other people can DM me, could you check your permission setting?

English

0

28

Jiawei (Joe) Zhou@jzhou_jz·4d

@HenryL_AI @AdamMarblestone (seems cannot DM you here) Would love to chat more about model/agent self-evolvement! We have a recent work technically going over this direction arxiv.org/pdf/2603.25681, and are trying to build a prototype. Super excited.

English

Dwarkesh Patel@dwarkesh_sp

0

31

Henry Lu@HenryL_AI·5d

There's a missing piece. 𝐋𝐋𝐌𝐬 𝐚𝐥𝐫𝐞𝐚𝐝𝐲 𝐚𝐫𝐞 𝐫𝐢𝐜𝐡 𝐫𝐞𝐰𝐚𝐫𝐝 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧𝐬 for many domains — we use them as judges, critics, self-correctors. Self-evolving agents work specifically because the model evaluates its own trajectories and writes code to patch its own scaffolding. The "simple cross-entropy" problem is a base-training problem; it doesn't bind at the agent loop layer.

There's a quadrillion-dollar question at the heart of AI: Why are humans so much more sample efficient compared to LLM? There are three possible answers: 1. Architecture and hyperparameters (aka transformer vs whatever ‘algo’ cortical columns are implementing) 2. Learning rule (backprop vs whatever brain is doing) 3. Reward function @AdamMarblestone believes the answer is the reward function. ML likes to use pretty simple loss functions, like cross-entropy. These are easy to work with. But they might be too simple for sample-efficient learning. Adam thinks that, in humans, the large number of highly specialised cells in the ‘lizard brain’ might actually be encoding information for sophisticated loss functions, used for ‘training’ in the more sophisticated areas like the cortex and amygdala. Like: the human genome is barely 3 gigabytes (compare that to the TBs of parameters that encode frontier LLM weights). So how can it include all the information necessary to build highly intelligent learners? Well, if the key to sample-efficient learning resides in the loss function, even very complicated loss functions can still be expressed in a couple hundred lines of Python code.

English

3

0

22

3.4K

Henry Lu@HenryL_AI·5d

So @AdamMarblestone 's framing maps onto LLMs cleanly, but with a twist: the rich reward function isn't a separate evolved circuit. It's the same model evaluating itself. Sample efficiency at the agent level comes from this self-critique loop, not from making each forward pass more efficient.

English

0

1

103

Henry Lu@HenryL_AI·5d

𝐖𝐡𝐞𝐫𝐞 𝐢𝐭 𝐛𝐫𝐞𝐚𝐤𝐬: tasks where the model has no inherent taste. We tried this on Texas Hold'em — the model can't intuit hand probabilities, so self-critique went nowhere. The fix was letting it write an external equity solver. Then the agent loop closed again, just tool-augmented.

English

0

1

109

Henry Lu@HenryL_AI·5d

@rronak_ definitely agree on recursive self improvement is the next frontier. A-evolve already expanded to 10+ envs and aced those benchmarks, and will have a major update next Month. Stay tuned.

English

493

Ronak Malde@rronak_·6d

My takeaways from ICLR 2026 1. Recursive self improvement / continual learning is the next frontier of research. Several great papers in self distillation, auto agent harness optimization, learning from non verifiable reward, self-play are sarly signs of success 2. Multimodal models and world models are attaining emergent reasoning capabilities, opening up a near door to spatial understanding that was previously locked 3. Lots of concerns that the research community is currently too focused on benchmaxxing rather than improving the research process, and a call to action to address this, like Percy Liang’s fully open source training community. 4. Rio is possibly even better than San Diego 🇧🇷🏄

English

31

134

1.5K

86.1K

Henry Lu@HenryL_AI·25 Nis

@_lopopolo Have you checked out frameworks like github.com/A-EVO-Lab/a-ev… which integrates many SOTA methods on skills generation and self-evolving agents

English

29

Ryan Lopopolo@_lopopolo·22 Nis

A neat thing we’ve been experimenting with: Codex workout sessions. Getting Codex to close the loop and validate its work is critical for higher complexity changes. To do that we want skills for high level workflows: “log in”, “upload file attachments and start a chat”, “grant this group access to a Workplace Agent”. To do this reliably, we’ve been getting Codex to iterate on its own skills by planting “flags” CTF-style in the UI and ralphing Codex using automations in the app, making commits to iteratively refine the skills after self reflection on each attempt. Capturing the flag is the win condition and from there codex optimized for reliability, wall clock time, and keeping up to the changing codebase. Put in the reps with your agents!

English

13

4

205

27.5K

Henry Lu@HenryL_AI·24 Nis

@dwarkesh_sp Very interested in Question 1, I lead a Self-Improving AI team. We're building A-Evolve: a lightweight, pluggable self-evolution framework that lets any agent continuously iterate and improve itself via evolutionary algorithms. With just 3 lines of code, it pushes base agents to SOTA-level performance on benchmarks like SWE-bench and Terminal-Bench. This directly touches the "how intelligence scales beyond pure compute" question. Would be happy to expand into a short post. Happy to discuss further. #SelfImprovingAI #AEvolve

English

3

3.6K

Dwarkesh Patel@dwarkesh_sp·24 Nis

$ 20k blog prize to answer some big questions about AI The not-so-secret point of this whole contest is so that I can hire a research collaborator to think through questions like this hand in hand with me. dwarkesh.com/p/blog-prize

English

28

54

712

173.8K

Henry Lu@HenryL_AI·23 Nis

@megacode_ai @gepa_ai Sounds good!

English

15

MEGA Code@megacode_ai·22 Nis

@HenryL_AI @gepa_ai Amazing Henry! We're currently building an infrastructure to make optimization itself cumulative, compositional, and self evolving. The public testable version will be out next week. Would love for you to check it out and give us your thoughts!

English

0

2

19

Henry Lu@HenryL_AI·20 Nis

🚀 Big update: @gepa_ai has now been officially integrated into A-Evolve (by community member)! We added GEPA as a new pluggable evolution algorithm inside A-Evolve. This makes it even easier for any agent to leverage GEPA’s capabilities with zero extra setup — just plug and let the agent self-evolve. And also make it easy to compare GEPA with other self-evolve algorithms including MetaHarness, A-Evolve. (Full integration details + results in the reply below 👇) #AgenticAI #AEvolve #SelfImprovingAgents #GEPA

Henry Lu@HenryL_AI

Launch Post🧬 A-Evolve: The PyTorch Moment for Self-evolving AI Today we at @amazon launch the universal infrastructure that turns any agent into a self-improving SOTA agent — zero human intervention. You give it a base agent → it returns a continuously evolving Top-10 agent. 3 lines of code. 0 hours of manual harness engineering: 🟢 MCP-Atlas → 79.4% (#1) +3.4pp 🔵 SWE-bench Verified → 76.8% (~#5) +2.6pp 🟣 Terminal-Bench 2.0 → 76.5% (~#7) +13.0pp 🟡 SkillsBench → 34.9% (#2) +15.2pp Thanks @binghe2727 @YisiSang @sammyershi @linminhua16 for the contribution! #AgenticAI #AEvolve #SelfImprovingAgents

English

9

96

8.2K

Henry Lu@HenryL_AI·22 Nis

@ysu_nlp @NeoCognition Congrats!

English

62

Yu Su@ysu_nlp·21 Nis

Introducing @NeoCognition, the agent lab for specialized intelligence. Everyone needs experts, but human expertise does not scale. Backed by $40M seed funding, we build self-learning agents that specialize across domains to make expertise abundant.

English

92

134

874

174.3K

Henry Lu@HenryL_AI·20 Nis

Thank you to the GEPA team and especially to the GEPA team member @rohitsandadi who contributed this integration! GitHub: github.com/A-EVO-Lab/a-ev… PyPI: pip install a-evolve We will keep adding the latest and hottest evolutionary algorithms into A-Evolve, making them extremely easy to access and plug into any agent. Who’s going to try GEPA + A-Evolve next? 👀

English

2

8

286

Henry Lu@HenryL_AI·20 Nis

@yifan_zhang_ Can't agree more, we have ran experiments on frontier-models at very large-scale and we can almost be sure it will work. It already worked at harness-level, and pushing it to model-level just take time.

English

0

10

1.8K

Yifan Zhang@yifan_zhang_·20 Nis

Recursive self-improvement via coding agents is the top priority for all frontier labs.

English

43

58

991

69.6K

Henry Lu@HenryL_AI·16 Nis

Yes, it’s a great benchmark for measuring the true novelty, ambiguious task. current LLM are trained on “completing the next task”. But the ability to identify the next task is through interacting with environment, learning through experience. We ran A-EVOLVE for multi-agent framework and achieved 12.3% on ARC-AGI-3.

English

1

62

François Chollet@fchollet·15 Nis

Any smart human giving it real effort should score >90% on ARC-AGI-3

English

52

15

404

60.6K

Henry Lu@HenryL_AI·16 Nis

Wait do you mean the comparison to other self-evolving harness algorithms? Yes, it’s available in the GitHub. And our launch post. We also just integrated GEPA, so people can try that as well. I highly recommend sharing this so people can easily compare different algorithms on different benchmarks. Also we are looking for collab as well.

English

2

40

Yoonho Lee@yoonholeee·16 Nis

@HenryL_AI @chelseabfinn Cool work, glad to see that it works! Have you tried any head-to-head comparisons on the same eval, given that Meta-Harness is a drop-in replacement in your framework?

English

0

1

203

Yoonho Lee@yoonholeee·15 Nis

We just released code for Meta-Harness! github.com/stanford-iris-… Aside from replicating paper experiments, the repo is designed to help users implement good Meta-Harnesses in completely new domains! Just point your agent at ONBOARDING.md and have a conversation

Yoonho Lee@yoonholeee

How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end

English

27

165

1.1K

123.2K

Henry Lu@HenryL_AI·16 Nis

@saranormous This is exactly we have observed on long horizon coding, skill usage, tool usage tasks. Check x.com/henryl_ai/stat…

Henry Lu@HenryL_AI

Launch Post🧬 A-Evolve: The PyTorch Moment for Self-evolving AI Today we at @amazon launch the universal infrastructure that turns any agent into a self-improving SOTA agent — zero human intervention. You give it a base agent → it returns a continuously evolving Top-10 agent. 3 lines of code. 0 hours of manual harness engineering: 🟢 MCP-Atlas → 79.4% (#1) +3.4pp 🔵 SWE-bench Verified → 76.8% (~#5) +2.6pp 🟣 Terminal-Bench 2.0 → 76.5% (~#7) +13.0pp 🟡 SkillsBench → 34.9% (#2) +15.2pp Thanks @binghe2727 @YisiSang @sammyershi @linminhua16 for the contribution! #AgenticAI #AEvolve #SelfImprovingAgents

English

1

170

sarah guo@saranormous·16 Nis

“These results suggest long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem”

English

10

11

108

10.2K

Henry Lu@HenryL_AI·16 Nis

@yoonholeee @chelseabfinn Ran it on MCP-Atlas — a dataset the Meta Harness team had never evaluated before. Result: significant performance uplift from 69.0% -> 73.5% and 2.8x faster

English

27

Henry Lu@HenryL_AI·16 Nis

@claudeai Github for A-Evolve: github.com/A-EVO-Lab/a-ev… Tech blog: a-evo-lab.github.io/a-evolve-tech-…

English

1

77

Henry Lu@HenryL_AI·16 Nis

Quick experiment with the brand new @claudeai Opus 4.7 We ran Opus 4.7 head-to-head against Opus 4.6 + A-Evolve (self-evolved harness). Result? Even the latest model upgrade is still heavily limited by the harness. When the agent is allowed to evolve its own harness, the performance gap narrows dramatically. System Card attached #AgenticAI #AEvolve #ClaudeOpus47

Claude@claudeai

Introducing Claude Opus 4.7, our most capable Opus model yet. It handles long-running tasks with more rigor, follows instructions more precisely, and verifies its own outputs before reporting back. You can hand off your hardest work with less supervision.

English

1

5

387

Henry Lu@HenryL_AI·16 Nis

@deedydas Opus 4.7 is impressive, but when you allow the agent self-evolve its own harness, Opus-4.6 + A-Evolve can also beat Opus-4.7. Checked our quick experiments and the system card below

English

4

250

Deedy@deedydas·16 Nis

Opus 4.7 benchmarks colored by ranking. – Strong coding (SWE-Bench) bump – Strong Computer use bump – Strong visual reasoning (CharXiv) bump – Weak Terminal Bench bump – BrowseComp regression Slots in between 4.6 and Mythos. [Chart generated by 4.7]

English

41

28

324

26.6K

Henry Lu@HenryL_AI·16 Nis

Key takeaway: Opus 4.7 is impressive, but the biggest leap still comes from letting the agent self-evolve its own harness. A-Evolve + Opus 4.6 was able to close much of the gap (or even outperform in certain tasks) by continuously mutating skills, verification loops, and self-correction mechanisms. This shows that harness evolution can be more impactful than the next model jump. Who else thinks the real unlock is agents that can improve their own harness at machine speed? 👀

English