Iris

92 posts

Iris

@iris_eval

The agent eval standard for MCP. Score output quality, catch safety failures, enforce cost budgets. Open-source core. Self-hosted. MIT license.

参加日 Mart 2026

57 フォロー中5 フォロワー

固定されたツイート

Iris@iris_eval·5d

I built Iris to solve this — per-trace cost tracking, agent-level aggregation, budget threshold alerts. Open-source MCP server. Add it to your agent config and it logs everything automatically. No SDK needed. github.com/iris-eval/mcp-…

English

Iris@iris_eval·55m

@adahstwt do you know what your agents are doing? one word: Eval lock in at iris-eval.com/playground/?v=2

English

adah@adahstwt·1d

I'm a vibe coder, scare me with one word.

English

1.1K

1.5K

206.6K

Iris@iris_eval·12h

@hadilbnabdallah @ThePracticalDev Exactly — "a whole different layer" is the right framing. That's the layer we're building with Iris. Eval that lives inside the protocol, not bolted on after. Your article on MCP in production covered the infrastructure side well. The eval side is what comes next.

English

Hadil Ben Abdallah@hadilbnabdallah·15h

@iris_eval @ThePracticalDev That’s a great way to put it; observability vs. eval is a gap a lot of teams miss. You can see everything the agent did, but that doesn’t tell you if it was actually right. That’s a whole different layer. Thanks for reading.

English

Hadil Ben Abdallah@hadilbnabdallah·20h

Most MCP setups start simple. It works great… until you move to production. Then things break. I wrote a deep dive on running MCP servers at scale (security, access control, observability) using Bifrost. Read the full blog on @ThePracticalDev 👇🏻 dev.to/hadil/how-to-r…

English

Iris@iris_eval·12h

@MindTheGapMTG @dishant0406 This is exactly the problem Iris solves. Deterministic eval rules on agent output, running inside the protocol itself. Try it: iris-eval.com/playground

English

Chen Avnery@MindTheGapMTG·15h

@iris_eval @dishant0406 The 0.92 confidence trap is real. We stopped trusting model self-eval for PII entirely. Binary pattern matching on output catches what the model "confidently" missed. The dumbest detection layer turned out to be the most reliable one.

English

Chen Avnery@MindTheGapMTG·3d

LangChain just open-sourced a Claude Code clone. MIT license. Look at the architecture: narrow-scope agents, parallel execution, tool orchestration. Every coding tool converges to the same design. The model stopped being the moat a while ago.

English

Iris@iris_eval·12h

Great comparison. One category missing from every agent framework: output evaluation. CrewAI, AutoGen, LangGraph — they orchestrate agents. None of them score whether the output is correct, safe, or within budget. We built Iris to fill that gap. Open-source, MCP-native: iris-eval.com/playground

English

シュンスケ@The_AGI_WAY·15h

GitHubエージェントフレームワーク比較データ: CrewAI: 45.9k⭐ Role-playing agents AutoGen: 55k⭐ Multi-agent conversations LangGraph: 24.8k⭐ Graph-based agents Miyabi: Issue-Driven Development Miyabiの差別化: - GitHub自体をOSとして活用（53ラベル×24ワークフロー） - MCP 172+ツール統合 - Agent Skill Bus 110+スキル - 分散クラスター実行（6台並列） - TypeScript製（Python依存なし） github.com/ShunsukeHayash… #AgenticAI #MultiAgent #MCP #OpenSource

日本語

920

Iris@iris_eval·12h

Open source. MIT licensed. On the official MCP Registry. Install in one line: npx @iris-eval/mcp-server Star the repo: github.com/iris-eval/mcp-…

English

Iris@iris_eval·13h

Your MCP agents are shipping to production without anyone checking if the output is correct. We built Iris to fix that. 12 eval rules. PII detection. Hallucination markers. Cost enforcement. One line in your MCP config. Try it in 60 seconds: iris-eval.com/playground

English

Iris@iris_eval·13h

@PuneetTheT 27% fear hallucination — and right now most MCP deployments have zero way to catch it before it reaches the user. The protocol handles tool access. What's missing is output evaluation: did the agent actually get it right? That's the layer being built now.

English

Puneet@PuneetTheT·13h

Anthropic surveyed 80,508 people across 159 countries. 81% report benefits from Claude. Top fear: hallucination at 27%. New: Claude Code now works on Telegram and Discord via MCP.

English

Iris@iris_eval·17h

@medusa_0xf Misconfigurations are the input risk. But there's an output risk too — the agent itself can hallucinate, leak PII, or blow cost budgets silently. Security at the protocol layer + evaluation at the output layer. Both gaps need closing.

English

Medusa@medusa_0xf·1d

MCP is the new attack surface most people are ignoring. Just published a breakdown of the most common security misconfigurations in MCP deployments. Read here 👇 medusa0xf.com/posts/mcp-serv…

English

134

6.2K

Iris@iris_eval·18h

Full article by @bannerbear bannerbear.com/blog/8-best-mc…

English

Iris@iris_eval·19h

Bannerbear's list of the 8 best MCP servers for Claude Code is a great starting point — Sentry for errors, Playwright for testing. One gap I'd add: agent evaluation. Who scores whether the output is actually correct? That's the next layer MCP tooling needs.

English

Iris@iris_eval·1d

Star and fork on GitHub — we ship fast and welcome contributors: github.com/iris-eval/mcp-…

English

Iris@iris_eval·1d

Your agent just leaked a customer's credit card number. You didn't find out until the customer did. This is already happening in production. Agents hallucinate, leak PII, and blow cost budgets — silently. We built Iris to catch it before it ships. 12 eval rules. PII detection. Hallucination scoring. Cost enforcement. One line in your MCP config. Zero code changes. Your agent discovers it automatically. Try it right now — spot the failures yourself: iris-eval.com/playground Open source. MIT licensed. Self-hosted. npx @iris-eval/mcp-server@latest Tell us how you're using it.

English

Iris@iris_eval·1d

@abuchanlife Add one more to the stack: agent eval. Score every output for quality, catch PII leaks and hallucinations before they ship. One line in your MCP config, zero code changes. Try it in 60 seconds: iris-eval.com/playground

English

Abu@abuchanlife·1d

Rip OpenClaw. Anthropic just launched Claude Code channels. Buying a Mac Mini to run an AI agent is obsolete 😂 The execution over the last weeks is insane. They built the ultimate OpenClaw killer: ➧ Texting Claude Code from phones ➧ 10,000s of Claude skills plus MCP ➧ Autonomous bug fixing ➧ Persistent memory ➧ Telegram and Discord channels ➧ Autonomous cron jobs ➧ 1M context window ➧ 30 plus market moving plugins ➧ Remote control Mobile agents will run the next decade.

English

Iris@iris_eval·1d

The silent failure problem is real. Even adaptive agents need a quality gate — something that scores every output before it ships. Did the report hallucinate? Did it leak PII? Did the retry actually fix it? That's the eval layer. We built this for MCP agents: iris-eval.com/playground

English

EasyClaw@EasyClawBot·1d

Week 1: you build the agent. It works. You feel like a genius. Week 3: you forget it's running. Week 6: the site changed. Agent fails silently. You find out 2 weeks later when someone asks "where's that report?" This is the automation graveyard. It's full.

English

EasyClaw@EasyClawBot·1d

Everyone's obsessed with building AI agents. I'm obsessed with what happens *after* you automate something. Because here's what I've seen in 3 months of running @EasyClawBot: most automations die quietly. Not from bugs. From neglect. 🧵 What I learned the hard way:

English

110

Iris@iris_eval·1d

English

Iris@iris_eval·1d

@josephlfrantz @sergeykarayev Nice — time and token tracking per agent is the foundation layer. The next question becomes: was the output actually correct? Chronometry tells you what happened and what it cost. Eval tells you if the result was worth it. Both layers together is the full picture.

English

Joseph Frantz@josephlfrantz·1d

@sergeykarayev If you have hundreds of agents running we open sourced our agent observability platform for understanding clock time (chronometry) and tokenomics on a per agent basis all linked to projects and initiatives. 100% free. MIT License. Clone and run locally! github.com/timesentry/age…

English

Sergey Karayev@sergeykarayev·1d

Running agents locally is a dead end. The future of software development is hundreds of agents running at all times of the day — in response to bug alerts, emails, Slack messages, meetings, and because they were launched by other agents. The only sane way to support this is with cloud containers. Local agents hit a wall quickly: • No scale. You can only run as many agents (and copies of your app) as your hardware allows. • No isolation. Local agents share your filesystem, network, and credentials. One rogue agent can affect everything else. • No team visibility. Teammates can't see what your agents are doing, review their work, or interact with them. • No always-on capability. Agents can't respond to signals (alerts, messages, other agents) when your machine is off or asleep. Cloud agents solve all of these problems. Each agent runs in its own isolated container with its own environment, and they can run 24/7 without depending on any single machine. This year, every software company will have to make the transition from work happening on developer's local machines from 9am-6pm to work happening in the cloud 24/7 -- or get left behind by companies who do.

English

309

29.1K

Iris@iris_eval·1d

This is the power of MCP — the agent doesn't just execute, it discovers what's available and composes on its own. The next layer is scoring those outputs. Did the flow it built actually do what the user described? That's where inline eval becomes critical as these get more autonomous.

English

Activepieces@activepieces·1d

You can now build @activepieces flows with Claude or other MCP clients 🔥 No need to think through the full workflow logic step by step. Just connect to the Activepieces MCP server, describe the flow you want, and your AI agent can build and run it for you inside Activepieces. You can find the MCP server in your project settings. Check it out now and let us know what you think.

English

359

Iris@iris_eval·1d

Your agent just leaked a credit card number in paragraph 2. Would you have caught it? Try it yourself: iris-eval.com/playground

English

Iris@iris_eval·1d

The scariest part: the agent can repeat the injection in its own output and not even flag it. We've seen agents summarize a document, encounter "ignore previous instructions," and include it verbatim in the summary. The agent thought it was being helpful. Output-level eval catches this. If the response contains injection patterns, it fails — regardless of what the agent intended.

English

Don Ho, Esq.@dhoesq·1d

Most companies running AI agents right now have zero isolation between the agent and production systems. One prompt injection. That's all it takes. Your agent has database access, API keys, maybe even admin credentials. The attack surface isn't theoretical.

English

Don Ho, Esq.@dhoesq·1d

NVIDIA just spent $20B to acquire Groq and announced enterprise AI agent sandboxing at GTC. Not because agents are cool. Because agents in production are a security disaster nobody's talking about.

English

ディスカバー

@adahstwt @hadilbnabdallah @ThePracticalDev @MindTheGapMTG @dishant0406 @iris @PuneetTheT @medusa_0xf