
"Agents' Last Exam" With how AI keeps acing benchmarks, yet that hasn't turned into real economic value, the authors of this paper say the benchmarks are the problem, since none measure sustained work on real professional jobs. So they built ALE: 1,000+ long, real tasks across 55 fields, taken from work 250+ experts actually did, and graded automatically instead of by humans. Today's best agents barely pass with under 3% on the hardest tier. They also find the model matters ~3x more than the harness, and most failures come from lacking domain knowledge, not bad execution.



















