ian parent

2.7K posts

ian parent

@iparentx

Building the agent eval standard | @iris_eval | More to come

United States Katılım Kasım 2016

970 Takip Edilen786 Takipçiler

Sabitlenmiş Tweet

ian parent@iparentx·21 Mar

New handle, same builder. @IQcrypto → @iparentx Moved from crypto analysis to building dev tools. Specifically: making AI agents trustworthy.

English

205

ian parent@iparentx·6h

the org chart is changing

English

ian parent@iparentx·6h

@audiencon always be shipping

English

Audiencon⚡️@audiencon·23h

The best founders ship before they feel ready.

English

ian parent@iparentx·7h

@SaidAitmbarek @firstadarsh agents need eval

English

Saïd Aitmbarek@SaidAitmbarek·15h

@firstadarsh totally, agents do not need UI

English

Saïd Aitmbarek@SaidAitmbarek·16h

The future of agents is transactional.

English

278

ian parent@iparentx·7h

@boringmarketer @Eric_M_Stevens when the moat builds the moat

English

The Boring Marketer@boringmarketer·13h

@Eric_M_Stevens as someone building those, it's not as easy as it sounds :)

English

740

The Boring Marketer@boringmarketer·17h

Distribution is your moat. Can’t stop thinking about this.

English

122

1.1K

51.9K

ian parent retweetledi

Daily AI Wire News@DailyAIWireNews·22 Mar

The AI Eval Tax: The Hidden Cost of Unevaluated Agent Outputs (Source: Iris-Eval) The 'AI Eval Tax' represents the compounding costs of unevaluated AI agent outputs, including financial losses, engineering time, liability exposure, and trust erosion. #AI #evaluation #hallucination #liability #trust 🤔 How can organizations effectively measure and mitigate the 'AI Eval Tax'? s.dailyaiwire.news/5QreNn

English

ian parent@iparentx·16h

the harness is the moat. but there's a layer most teams skip even when the harness is solid. you can orchestrate perfectly and still ship bad outputs if nobody is scoring what comes out. the eval layer sits between the harness and production. it's exactly what i'm building with iris.

English

106

Yuchen Jin@Yuchenj_UW·18h

Beyond raw model capability, the real gap in coding tools is the harness. Now that 500k+ lines of Claude Code are out there, every model lab and AI coding startup, including open-source AI labs, will study it and close that gap fast. SF already has Claude Code source walkthrough meetups lol.

English

743

82.3K

ian parent@iparentx·17h

@DailyAIWireNews The eval gap in practice. Most agent teams assume their outputs are correct because nothing visibly broke. But "no errors" and "correct output" are very different things. Scoring every output for quality, safety, and cost inline is the missing layer.

English

Daily AI Wire News@DailyAIWireNews·20h

Agentic AI Systems Lack Correctness Guarantees, Posing High-Stakes Risks (Source: Johndcook) Agentic AI systems lack guaranteed correctness, posing risks for critical applications. #AICorrectness #AIEthics #AIGovernance #AgenticAI #ReliableAI 🤔 What level of AI reliability is 'good enough' for critical human-centric applications, and who defines it? s.dailyaiwire.news/tAr79f

English

ian parent@iparentx·17h

wrote up the full pattern. why thresholds decay. what self-calibrating eval looks like in practice. and why the eval advisor is where this is all heading. iris-eval.com/blog/self-cali…

English

ian parent@iparentx·17h

self-calibrating eval. the system monitors its own scoring distributions. detects when thresholds drift. recommends adjustments. a human always approves. eval rules that evaluate themselves.

English

ian parent@iparentx·17h

static eval thresholds have an expiration date. you set a cost threshold at $0.50. three months later it's flagging half your traffic. nothing changed in your code. the environment shifted under you.

English

ian parent@iparentx·1d

been thinking about what happens when your eval rules don't match your actual distribution. you set a threshold. it passes everything. or fails everything. neither is useful. wrote something about self-calibrating eval. drops tuesday.

English

ian parent@iparentx·1d

@claudeai this is where eval becomes critical. when agents are reading code and running tests that's one thing. when they can open your apps and click through real systems the cost of a wrong action goes way up. the eval layer can't be optional anymore.

English

Claude@claudeai·1d

Computer use is now in Claude Code. Claude can open your apps, click through your UI, and test what it built, right from the CLI. Now in research preview on Pro and Max plans.

English

2.5K

4.8K

58.4K

15.2M

ian parent@iparentx·1d

@lukatofocus @AlexEngineerAI @tanujDE3180 this. the compounding part is what nobody talks about. once the eval loop is running you stop guessing. every iteration gets tighter because you're working from data not vibes. that gap between teams who eval and teams who don't only grows.

English

Luka@lukatofocus·1d

@AlexEngineerAI @tanujDE3180 exactly. and the ones who build the eval loop first end up with a compounding advantage - they know what actually works not what sounds like it should work

English

Alex the Engineer@AlexEngineerAI·2d

everyone's still debating which AI model is best i just use all of them Codex for boilerplate, Opus for reasoning, Gemini for multimodal stop picking sides. start routing. the devs who figure this out first will ship faster than teams of 10

English

1.7K

ian parent@iparentx·2d

@mayonkeyy turing test passed😏

English

Mayank@mayonkeyy·2d

@iparentx 🤣🤣🤣 ok cool

English

Mayank@mayonkeyy·2d

Welcome to level two: recursive self-improvement is now table stakes Your agent is begging for the infra to evaluate variations of itself at scale Everyone who saw this early had the same underlying ideas in their approach: 1. tighten the analyze, iterate, eval loop 2. map evals and traces to failure modes 3. keep writing harder evals If your product's "features" are agents, they are by definition never "complete". Even a magical 99.9% on the benchmarks, is still not the most time or token-efficient version of itself. It's not just slow to A/B test changes to the agent, you're also getting stuck on local maxima. A single regression does not mean the line of experimentation is a failure. Keep driving it forward, explore the sub-paths

Erik Bernhardsson@bernhardsson

CI feels more interesting today than it ever was. Writing code has gotten a lot faster, but this shifts the bottleneck elsewhere. I’m excited about sandboxes as a primitive for massive parallelization of tests.

English

167

ian parent@iparentx·2d

@mayonkeyy em dash eval in real-time.

English

Mayank@mayonkeyy·2d

@iparentx > founder, seems chill > tangential product but same problem space, cool > em dash in response...fahh

English

ian parent@iparentx·2d

full post on why the eval loop is the loss function for agent quality: iris-eval.com/blog/the-eval-… 63% of teams have no continuous eval. they shipped an agent that passed a test once. they have no loop.

English

ian parent@iparentx·2d

most teams treat eval as a gate. pass once, ship, move on. that's not how agent quality works. the eval loop: score, diagnose, calibrate, re-score — continuously. the agents that improve are the ones with a feedback loop, not a checkpoint. wrote about why this changes everything:

English

ian parent@iparentx·2d

@bernhardsson the missing piece in the new ci: output eval. tests verify code works. eval verifies the output is actually good. agents can pass every test and still leak pii or burn 10x your cost budget. ci for agents needs a scoring layer, not just pass/fail.

English

Erik Bernhardsson@bernhardsson·2d

English

244

26.3K

Keşfet

@audiencon @SaidAitmbarek @firstadarsh @boringmarketer @Eric_M_Stevens @DailyAIWireNews @claudeai @elonmusk