Gavin J.
225 posts

Gavin J.
@RivraDev
Engineer → Micro SaaS founder 19 yrs Silicon Valley big tech I build small systems that create leverage. Automation • Systems • Micro SaaS


Stanford paper shows that AI agents get better when you optimize the harness around the model, not just the model itself. On TerminalBench-2 with Claude Haiku 4.5, the optimized harness scored 37.6%, ahead of Claude Code at 27.5%. The real story is not that a model got smarter, but that its surrounding machinery got less fragile. In this paper, the harness includes prompts, tool definitions, context management, and the logic that decides when a task is done. Meta-Harness turns that wrapper into something a coding agent can search: it reads prior source code, execution traces, and scores, then proposes precise fixes. The point is that many agent failures are hidden in raw logs, where timeouts, bad tool calls, or premature stopping leave a forensic trail. By giving the optimizer a filesystem full of those traces, rather than a short summary or a single score, the method can diagnose causes instead of guessing. That scale difference matters because the authors argue earlier optimization methods expose only tiny slices of history, while Meta-Harness can inspect vastly more context from previous runs. The gains show up across tasks: 48.6% versus 40.9% on text classification, 38.8% versus 34.1% on retrieval-augmented math, and the best reported Haiku 4.5 result on TerminalBench-2. But the strongest claim is narrower than the hype: this does not retrain the model or prove open-ended self-improvement. It shows something more practical and, in some ways, more unsettling. A large share of what we call model performance is really interface performance, and better scaffolding can look a lot like better intelligence. --- yoonholee .com/meta-harness/




























