Utkarsh Bali

10 posts

Utkarsh Bali banner
Utkarsh Bali

Utkarsh Bali

@ubali07

shipping agent testing infra in sf incoming ai automation @recurly, prev @Qual_Gent (yc x25)

San Francisco, CA เข้าร่วม Mayıs 2025
178 กำลังติดตาม6 ผู้ติดตาม
AI Professor 蓝V互关
@ubali07 @RobinNewhouse @cline Agent testing becomes serious when it is tied to real workflows, not toy prompts. The valuable evals are usually boring: regression cases, tool failures, state drift, and whether the agent recovers without hiding the error.
English
1
0
0
21
Utkarsh Bali
Utkarsh Bali@ubali07·
Last week, I spent 90 minutes with @RobinNewhouse, Senior SWE Applied AI at @cline, discussing agent testing. He's the person building evals infrastructure for one of the most-used open source coding agents in the world. Here's a few things from the talk that stayed with me:
English
3
0
1
28
Utkarsh Bali
Utkarsh Bali@ubali07·
@BoazWith @RobinNewhouse @cline That’s right. He mentioned that some agents literally cache answers in binaries or pull the solutions from web mid-eval. Technically passing the tests, but useless in prod. That’s why he stressed so much on agent’s trajectory analysis to see ‘how’ instead of only pass/fail.
English
1
0
0
6
Boaz Hwang
Boaz Hwang@BoazWith·
@ubali07 @RobinNewhouse @cline What was the weirdest eval failure mode he mentioned? My worry is always tests passing while the agent learned the wrong shortcut.
English
1
0
0
16
Utkarsh Bali
Utkarsh Bali@ubali07·
The future won't be "chatting" with the agent. As models continue to improve, people are going to "one-shot" everything. "Make a PR for this feature" -> DONE. That's the future. So, agent testing infra will be the foundation everything else is built on.
English
1
0
0
12