Utkarsh Bali
10 posts

Utkarsh Bali
@ubali07
shipping agent testing infra in sf incoming ai automation @recurly, prev @Qual_Gent (yc x25)
San Francisco, CA เข้าร่วม Mayıs 2025
178 กำลังติดตาม6 ผู้ติดตาม

@ubali07 @RobinNewhouse @cline Agent testing becomes serious when it is tied to real workflows, not toy prompts. The valuable evals are usually boring: regression cases, tool failures, state drift, and whether the agent recovers without hiding the error.
English

Last week, I spent 90 minutes with @RobinNewhouse, Senior SWE Applied AI at @cline, discussing agent testing. He's the person building evals infrastructure for one of the most-used open source coding agents in the world.
Here's a few things from the talk that stayed with me:
English

@BoazWith @RobinNewhouse @cline That’s right. He mentioned that some agents literally cache answers in binaries or pull the solutions from web mid-eval. Technically passing the tests, but useless in prod.
That’s why he stressed so much on agent’s trajectory analysis to see ‘how’ instead of only pass/fail.
English

@ubali07 @RobinNewhouse @cline What was the weirdest eval failure mode he mentioned? My worry is always tests passing while the agent learned the wrong shortcut.
English

That's why I'm building @checkpointdata with the best people I know @gupta_ayush2006 @agaur02
Because every agent deserves a CI/CD pipeline.
usecheckpoint.dev
Thanks for the real talk @RobinNewhouse @cline 🙏
More soon.
#BuildInPublic #LLMOps
English