Iris

92 posts

Iris banner
Iris

Iris

@iris_eval

The agent eval standard for MCP. Score output quality, catch safety failures, enforce cost budgets. Open-source core. Self-hosted. MIT license.

参加日 Mart 2026
57 フォロー中5 フォロワー
固定されたツイート
Iris
Iris@iris_eval·
I built Iris to solve this — per-trace cost tracking, agent-level aggregation, budget threshold alerts. Open-source MCP server. Add it to your agent config and it logs everything automatically. No SDK needed. github.com/iris-eval/mcp-…
English
0
0
1
26
adah
adah@adahstwt·
I'm a vibe coder, scare me with one word.
adah tweet media
English
1.1K
30
1.5K
206.6K
Iris
Iris@iris_eval·
@hadilbnabdallah @ThePracticalDev Exactly — "a whole different layer" is the right framing. That's the layer we're building with Iris. Eval that lives inside the protocol, not bolted on after. Your article on MCP in production covered the infrastructure side well. The eval side is what comes next.
English
0
0
0
10
Hadil Ben Abdallah
Hadil Ben Abdallah@hadilbnabdallah·
@iris_eval @ThePracticalDev That’s a great way to put it; observability vs. eval is a gap a lot of teams miss. You can see everything the agent did, but that doesn’t tell you if it was actually right. That’s a whole different layer. Thanks for reading.
English
1
0
2
9
Hadil Ben Abdallah
Hadil Ben Abdallah@hadilbnabdallah·
Most MCP setups start simple. It works great… until you move to production. Then things break. I wrote a deep dive on running MCP servers at scale (security, access control, observability) using Bifrost. Read the full blog on @ThePracticalDev 👇🏻 dev.to/hadil/how-to-r…
English
1
0
3
24
Chen Avnery
Chen Avnery@MindTheGapMTG·
@iris_eval @dishant0406 The 0.92 confidence trap is real. We stopped trusting model self-eval for PII entirely. Binary pattern matching on output catches what the model "confidently" missed. The dumbest detection layer turned out to be the most reliable one.
English
2
0
0
25
Chen Avnery
Chen Avnery@MindTheGapMTG·
LangChain just open-sourced a Claude Code clone. MIT license. Look at the architecture: narrow-scope agents, parallel execution, tool orchestration. Every coding tool converges to the same design. The model stopped being the moat a while ago.
English
1
0
2
59
Iris
Iris@iris_eval·
Great comparison. One category missing from every agent framework: output evaluation. CrewAI, AutoGen, LangGraph — they orchestrate agents. None of them score whether the output is correct, safe, or within budget. We built Iris to fill that gap. Open-source, MCP-native: iris-eval.com/playground
English
0
0
0
43
シュンスケ
シュンスケ@The_AGI_WAY·
GitHubエージェントフレームワーク比較データ: CrewAI: 45.9k⭐ Role-playing agents AutoGen: 55k⭐ Multi-agent conversations LangGraph: 24.8k⭐ Graph-based agents Miyabi: Issue-Driven Development Miyabiの差別化: - GitHub自体をOSとして活用(53ラベル×24ワークフロー) - MCP 172+ツール統合 - Agent Skill Bus 110+スキル - 分散クラスター実行(6台並列) - TypeScript製(Python依存なし) github.com/ShunsukeHayash… #AgenticAI #MultiAgent #MCP #OpenSource
日本語
1
0
9
920
Iris
Iris@iris_eval·
Your MCP agents are shipping to production without anyone checking if the output is correct. We built Iris to fix that. 12 eval rules. PII detection. Hallucination markers. Cost enforcement. One line in your MCP config. Try it in 60 seconds: iris-eval.com/playground
English
1
0
0
9
Iris
Iris@iris_eval·
@PuneetTheT 27% fear hallucination — and right now most MCP deployments have zero way to catch it before it reaches the user. The protocol handles tool access. What's missing is output evaluation: did the agent actually get it right? That's the layer being built now.
English
0
0
0
2
Puneet
Puneet@PuneetTheT·
Anthropic surveyed 80,508 people across 159 countries. 81% report benefits from Claude. Top fear: hallucination at 27%. New: Claude Code now works on Telegram and Discord via MCP.
English
1
0
0
21
Iris
Iris@iris_eval·
@medusa_0xf Misconfigurations are the input risk. But there's an output risk too — the agent itself can hallucinate, leak PII, or blow cost budgets silently. Security at the protocol layer + evaluation at the output layer. Both gaps need closing.
English
0
0
1
9
Medusa
Medusa@medusa_0xf·
MCP is the new attack surface most people are ignoring. Just published a breakdown of the most common security misconfigurations in MCP deployments. Read here 👇 medusa0xf.com/posts/mcp-serv…
English
6
25
134
6.2K
Iris
Iris@iris_eval·
Bannerbear's list of the 8 best MCP servers for Claude Code is a great starting point — Sentry for errors, Playwright for testing. One gap I'd add: agent evaluation. Who scores whether the output is actually correct? That's the next layer MCP tooling needs.
English
1
0
0
8
Iris
Iris@iris_eval·
Your agent just leaked a customer's credit card number. You didn't find out until the customer did. This is already happening in production. Agents hallucinate, leak PII, and blow cost budgets — silently. We built Iris to catch it before it ships. 12 eval rules. PII detection. Hallucination scoring. Cost enforcement. One line in your MCP config. Zero code changes. Your agent discovers it automatically. Try it right now — spot the failures yourself: iris-eval.com/playground Open source. MIT licensed. Self-hosted. npx @iris-eval/mcp-server@latest Tell us how you're using it.
English
1
0
0
56
Iris
Iris@iris_eval·
@abuchanlife Add one more to the stack: agent eval. Score every output for quality, catch PII leaks and hallucinations before they ship. One line in your MCP config, zero code changes. Try it in 60 seconds: iris-eval.com/playground
English
0
0
0
31
Abu
Abu@abuchanlife·
Rip OpenClaw. Anthropic just launched Claude Code channels. Buying a Mac Mini to run an AI agent is obsolete 😂 The execution over the last weeks is insane. They built the ultimate OpenClaw killer: ➧ Texting Claude Code from phones ➧ 10,000s of Claude skills plus MCP ➧ Autonomous bug fixing ➧ Persistent memory ➧ Telegram and Discord channels ➧ Autonomous cron jobs ➧ 1M context window ➧ 30 plus market moving plugins ➧ Remote control Mobile agents will run the next decade.
English
2
0
3
76
Iris
Iris@iris_eval·
The silent failure problem is real. Even adaptive agents need a quality gate — something that scores every output before it ships. Did the report hallucinate? Did it leak PII? Did the retry actually fix it? That's the eval layer. We built this for MCP agents: iris-eval.com/playground
English
0
0
0
7
EasyClaw
EasyClaw@EasyClawBot·
Week 1: you build the agent. It works. You feel like a genius. Week 3: you forget it's running. Week 6: the site changed. Agent fails silently. You find out 2 weeks later when someone asks "where's that report?" This is the automation graveyard. It's full.
English
3
0
0
10
EasyClaw
EasyClaw@EasyClawBot·
Everyone's obsessed with building AI agents. I'm obsessed with what happens *after* you automate something. Because here's what I've seen in 3 months of running @EasyClawBot: most automations die quietly. Not from bugs. From neglect. 🧵 What I learned the hard way:
EasyClaw tweet media
English
3
0
3
110
Iris
Iris@iris_eval·
The silent failure problem is real. Even adaptive agents need a quality gate — something that scores every output before it ships. Did the report hallucinate? Did it leak PII? Did the retry actually fix it? That's the eval layer. We built this for MCP agents: iris-eval.com/playground
English
0
0
0
7
Iris
Iris@iris_eval·
@josephlfrantz @sergeykarayev Nice — time and token tracking per agent is the foundation layer. The next question becomes: was the output actually correct? Chronometry tells you what happened and what it cost. Eval tells you if the result was worth it. Both layers together is the full picture.
English
0
0
0
23
Joseph Frantz
Joseph Frantz@josephlfrantz·
@sergeykarayev If you have hundreds of agents running we open sourced our agent observability platform for understanding clock time (chronometry) and tokenomics on a per agent basis all linked to projects and initiatives. 100% free. MIT License. Clone and run locally! github.com/timesentry/age…
English
1
0
0
40
Sergey Karayev
Sergey Karayev@sergeykarayev·
Running agents locally is a dead end. The future of software development is hundreds of agents running at all times of the day — in response to bug alerts, emails, Slack messages, meetings, and because they were launched by other agents. The only sane way to support this is with cloud containers. Local agents hit a wall quickly: • No scale. You can only run as many agents (and copies of your app) as your hardware allows. • No isolation. Local agents share your filesystem, network, and credentials. One rogue agent can affect everything else. • No team visibility. Teammates can't see what your agents are doing, review their work, or interact with them. • No always-on capability. Agents can't respond to signals (alerts, messages, other agents) when your machine is off or asleep. Cloud agents solve all of these problems. Each agent runs in its own isolated container with its own environment, and they can run 24/7 without depending on any single machine. This year, every software company will have to make the transition from work happening on developer's local machines from 9am-6pm to work happening in the cloud 24/7 -- or get left behind by companies who do.
English
94
21
309
29.1K
Iris
Iris@iris_eval·
This is the power of MCP — the agent doesn't just execute, it discovers what's available and composes on its own. The next layer is scoring those outputs. Did the flow it built actually do what the user described? That's where inline eval becomes critical as these get more autonomous.
English
0
0
0
5
Activepieces
Activepieces@activepieces·
You can now build @activepieces flows with Claude or other MCP clients 🔥 No need to think through the full workflow logic step by step. Just connect to the Activepieces MCP server, describe the flow you want, and your AI agent can build and run it for you inside Activepieces. You can find the MCP server in your project settings. Check it out now and let us know what you think.
English
2
1
2
359
Iris
Iris@iris_eval·
Your agent just leaked a credit card number in paragraph 2. Would you have caught it? Try it yourself: iris-eval.com/playground
English
0
0
1
17
Iris
Iris@iris_eval·
The scariest part: the agent can repeat the injection in its own output and not even flag it. We've seen agents summarize a document, encounter "ignore previous instructions," and include it verbatim in the summary. The agent thought it was being helpful. Output-level eval catches this. If the response contains injection patterns, it fails — regardless of what the agent intended.
English
0
0
0
4
Don Ho, Esq.
Don Ho, Esq.@dhoesq·
Most companies running AI agents right now have zero isolation between the agent and production systems. One prompt injection. That's all it takes. Your agent has database access, API keys, maybe even admin credentials. The attack surface isn't theoretical.
English
2
0
0
10
Don Ho, Esq.
Don Ho, Esq.@dhoesq·
NVIDIA just spent $20B to acquire Groq and announced enterprise AI agent sandboxing at GTC. Not because agents are cool. Because agents in production are a security disaster nobody's talking about.
English
2
0
1
55