ZeroEval retweetledi

A Failure-Focused Evaluation of Frontier Models
Benchmark scores tell you which model is "best on average", but not where they fail.
We reproduced a set of difficult evaluations on seven frontier models to investigate two signals: consistent failures and task-specific advantages.
Our findings:
→ 85.2% average failure rate on Humanity’s Last Exam across all seven models evaluated.
→ 46.2% of Humanity’s Last Exam questions were failed by all seven models under these evaluation conditions.
→ Nearly 80% of engineering problems, including structural analysis, thermodynamics, and control systems, remained unsolved by all models.
Let’s dig deeper (1/8)

English








