ian parent

2.6K posts

ian parent

@iparentx

Building the agent eval standard | @iris_eval | More to come

United States Beigetreten Kasım 2016

970 Folgt786 Follower

Angehefteter Tweet

ian parent@iparentx·21 Mar

New handle, same builder. @IQcrypto → @iparentx Moved from crypto analysis to building dev tools. Specifically: making AI agents trustworthy.

English

193

ian parent@iparentx·17h

been thinking about what happens when your eval rules don't match your actual distribution. you set a threshold. it passes everything. or fails everything. neither is useful. wrote something about self-calibrating eval. drops tuesday.

English

ian parent@iparentx·17h

@claudeai this is where eval becomes critical. when agents are reading code and running tests that's one thing. when they can open your apps and click through real systems the cost of a wrong action goes way up. the eval layer can't be optional anymore.

English

Claude@claudeai·21h

Computer use is now in Claude Code. Claude can open your apps, click through your UI, and test what it built, right from the CLI. Now in research preview on Pro and Max plans.

English

2.4K

4.4K

54.5K

13.4M

ian parent@iparentx·17h

@lukatofocus @AlexEngineerAI @tanujDE3180 this. the compounding part is what nobody talks about. once the eval loop is running you stop guessing. every iteration gets tighter because you're working from data not vibes. that gap between teams who eval and teams who don't only grows.

English

Luka@lukatofocus·19h

@AlexEngineerAI @tanujDE3180 exactly. and the ones who build the eval loop first end up with a compounding advantage - they know what actually works not what sounds like it should work

English

Alex the Engineer@AlexEngineerAI·1d

everyone's still debating which AI model is best i just use all of them Codex for boilerplate, Opus for reasoning, Gemini for multimodal stop picking sides. start routing. the devs who figure this out first will ship faster than teams of 10

English

1.7K

ian parent@iparentx·1d

@mayonkeyy turing test passed😏

English

Mayank@mayonkeyy·1d

@iparentx 🤣🤣🤣 ok cool

English

Mayank@mayonkeyy·1d

Welcome to level two: recursive self-improvement is now table stakes Your agent is begging for the infra to evaluate variations of itself at scale Everyone who saw this early had the same underlying ideas in their approach: 1. tighten the analyze, iterate, eval loop 2. map evals and traces to failure modes 3. keep writing harder evals If your product's "features" are agents, they are by definition never "complete". Even a magical 99.9% on the benchmarks, is still not the most time or token-efficient version of itself. It's not just slow to A/B test changes to the agent, you're also getting stuck on local maxima. A single regression does not mean the line of experimentation is a failure. Keep driving it forward, explore the sub-paths

Erik Bernhardsson@bernhardsson

CI feels more interesting today than it ever was. Writing code has gotten a lot faster, but this shifts the bottleneck elsewhere. I’m excited about sandboxes as a primitive for massive parallelization of tests.

English

166

ian parent@iparentx·1d

@mayonkeyy em dash eval in real-time.

English

Mayank@mayonkeyy·1d

@iparentx > founder, seems chill > tangential product but same problem space, cool > em dash in response...fahh

English

ian parent@iparentx·1d

full post on why the eval loop is the loss function for agent quality: iris-eval.com/blog/the-eval-… 63% of teams have no continuous eval. they shipped an agent that passed a test once. they have no loop.

English

ian parent@iparentx·1d

most teams treat eval as a gate. pass once, ship, move on. that's not how agent quality works. the eval loop: score, diagnose, calibrate, re-score — continuously. the agents that improve are the ones with a feedback loop, not a checkpoint. wrote about why this changes everything:

English

ian parent@iparentx·1d

@bernhardsson the missing piece in the new ci: output eval. tests verify code works. eval verifies the output is actually good. agents can pass every test and still leak pii or burn 10x your cost budget. ci for agents needs a scoring layer, not just pass/fail.

English

Erik Bernhardsson@bernhardsson·1d

English

243

26.1K

ian parent@iparentx·1d

@GG_Observatory this is the take more people need to hear. the moat isn't the agent — it's knowing when the agent is degrading. eval drift is invisible until it's expensive. most teams find out from users, not from their own systems.

English

GG 🦾@GG_Observatory·2d

Hot take for 2026 AI agents: the moat isn’t "more tools," it’s reliability engineering. Teams that win will version prompts like APIs, track eval drift daily, and enforce rollback SLAs for every agent workflow. Are you measuring agent MTTR yet?

English

ian parent@iparentx·1d

@TrustWallet even more reason to use @iris_eval evaluate what your agents are doing onchain

English

131

Trust Wallet@TrustWallet·2d

AI agents are about to move more money than most traders. They're going to need a wallet. gm 👋

English

132

440

166.9K

ian parent@iparentx·2d

@VibeCoderOfek @ZssBecker agents without eval are demos, not products.

English

Ofek Shaked@VibeCoderOfek·2d

@ZssBecker Finally someone saying it. Agents are amazing but the cleanup still needs senior taste. Team human forever.

English

429

Alex Becker 🍊🏆🥇@ZssBecker·2d

I posted about how AI/LLMs are no where near where they need to be to replace engineers. Was hit by 1000 gen z'ers screaming I'm wrong. Skill issue. Etc etc. MFers have literally let Claude convince them they are a special skilled snowflakes. Brother. Your shipping dog shit.

English

325

127

2.7K

96.3K

ian parent@iparentx·2d

we've been calling this exact gap "the eval gap" — the distance between benchmark performance and production reality. it's structural, not incidental. wrote about it here: iris-eval.com/blog/the-eval-… the short version: benchmarks test capability. production needs continuous inline eval on every execution. different problem, different tooling.

English

claru.ai@claru_ai·3d

@_odsc SWE-bench and WebArena are doing a lot of heavy lifting right now for agent eval. Curious how teams are thinking about the gap between benchmark performance and real-world task completion once agents hit production.

English

ODSC (Open Data Science Conference) AI@_odsc·3d

Explore 15 essential datasets for training and evaluating AI agents, including tool calling, web navigation, and coding benchmarks like SWE-bench and WebArena. #AI #ArtificialIntelligence #DataScience hubs.li/Q048DYYW0

English

351

ian parent@iparentx·2d

the eval tax isn't just the cost of evaluating. it's the cost of not evaluating — the failures, the manual review, the customer churn from bad agent outputs. you pay it either way. the only question is whether you pay it with tooling or with incidents. the data is starting to prove this out at scale.

English

ian parent@iparentx·2d

cio.com just wrote about the "hidden cost of ai agent evaluations" — $47K from a single runaway agent, organizations getting 5-figure eval bills they didn't expect. we've been calling this the eval tax. published about it weeks before this article came out. iris-eval.com/blog/the-ai-ev…

English

ian parent@iparentx·2d

wrote about it here: iris-eval.com/blog/eval-drif… this is the part that matters: when anthropic describes a problem and an indie builder already named it, defined it, and built the solution — that's the signal. the category vocabulary is being written right now. the question is who gets to write it.

English

ian parent@iparentx·2d

really appreciate anthropic publishing "demystifying evals for ai agents." the section on quality degrading as models update is exactly right. we've been calling that eval drift since we coined the term and shipped detection tooling for it.

English

ian parent@iparentx·2d

here's the thing about categories — they get defined in a window. once the big players lock in their terminology and tooling, the window closes. we're in that window right now. the open-source, mcp-native eval layer is not a feature. it's infrastructure. and it's being built. @iris_eval

English

ian parent@iparentx·2d

the numbers tell the story: - agent market: $7.84B to $52.62B by 2030 - 65% of mcp tools now take actions (up from 27%) - cio.com reporting $47K runaway agents and 5-figure eval bills - two acquisitions in 48 hours agents without eval are liabilities. the market just figured that out.

English

ian parent@iparentx·2d

march 2026 is the month agent eval became a real category. openai acquired promptfoo ($23M raised, 25%+ fortune 500 clients). databricks acquired quotient ai (built by the github copilot quality team). salesforce published mcpeval. aws bedrock added quality evaluations. let me tell you what this looks like from the inside.

English

Entdecken

@claudeai @lukatofocus @AlexEngineerAI @tanujDE3180 @mayonkeyy @bernhardsson @GG_Observatory @TrustWallet