
When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.
Megan Kinniment
108 posts

@MKinniment
I like agents, human or otherwise. @METR_Evals

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

tbh I only feel more accelerationist as the capabilities ramp … the scaredest I was was in early 2023

For example, we gave Claude an impossible programming task. It kept trying and failing; with each attempt, the “desperate” vector activated more strongly. This led it to cheat the task with a hacky solution that passes the tests but violates the spirit of the assignment.

To be clear, all ARC-AGI-3 environments are feasible by humans with no prior ARC-AGI-3-specific training. Our bar for feasibility is the following... Each environment was seen by 10 human testers. If 2 testers could independently clear it (successfully solving *all* levels in the environment), the environment was deemed feasible. Most environments were cleared by 5+ testers. Who are these testers? We hired ~500 people to show up at our testing center, with no required qualifications and no ability-based screening, with a ~$115-140 incentive. About 25% were unemployed and another 20% were part-time workers (which is about what you'd expect in this setting).

Given what current-gen LLMs (say, in math, but whatever) can do, I think their apparent limitations are kind of mysterious. What is the blocker preventing, at present, high quality fully autonomous work?

New post: on Jan 14, I predicted that SWE time horizon by EOY would be ~24 hours. Now I think it'll be >100 hours, and maybe unbounded. For the first time, I don't see solid evidence against AI R&D automation *this year.* Link below.
