Utkarsh Bali

@ubali07

shipping agent testing infra in sf incoming ai automation @recurly, prev @Qual_Gent (yc x25)

San Francisco, CA เข้าร่วม Mayıs 2025

178 กำลังติดตาม6 ผู้ติดตาม

Utkarsh Bali@ubali07·53m

@Gsdata5566 @RobinNewhouse @cline Exactly. The boring stuff IS the eval!

English

AI Professor 蓝V互关@Gsdata5566·7h

@ubali07 @RobinNewhouse @cline Agent testing becomes serious when it is tied to real workflows, not toy prompts. The valuable evals are usually boring: regression cases, tool failures, state drift, and whether the agent recovers without hiding the error.

English

Utkarsh Bali@ubali07·9h

Last week, I spent 90 minutes with @RobinNewhouse, Senior SWE Applied AI at @cline, discussing agent testing. He's the person building evals infrastructure for one of the most-used open source coding agents in the world. Here's a few things from the talk that stayed with me:

English

Utkarsh Bali@ubali07·57m

@BoazWith @RobinNewhouse @cline That’s right. He mentioned that some agents literally cache answers in binaries or pull the solutions from web mid-eval. Technically passing the tests, but useless in prod. That’s why he stressed so much on agent’s trajectory analysis to see ‘how’ instead of only pass/fail.

English

Boaz Hwang@BoazWith·8h

@ubali07 @RobinNewhouse @cline What was the weirdest eval failure mode he mentioned? My worry is always tests passing while the agent learned the wrong shortcut.

English

Utkarsh Bali@ubali07·9h

That's why I'm building @checkpointdata with the best people I know @gupta_ayush2006 @agaur02 Because every agent deserves a CI/CD pipeline. usecheckpoint.dev Thanks for the real talk @RobinNewhouse @cline 🙏 More soon. #BuildInPublic #LLMOps

English

Utkarsh Bali@ubali07·9h

The future won't be "chatting" with the agent. As models continue to improve, people are going to "one-shot" everything. "Make a PR for this feature" -> DONE. That's the future. So, agent testing infra will be the foundation everything else is built on.

English

ค้นพบ

@Gsdata5566 @RobinNewhouse @cline @BoazWith @checkpointdata @gupta_ayush2006 @agaur02 @elonmusk