Arize AI

1.4K posts

Arize AI banner
Arize AI

Arize AI

@arizeai

The AI engineering platform for teams shipping reliable AI agents and LLM applications. Also home to @ArizePhoenix.

San Francisco, CA Bergabung Ocak 2020
126 Mengikuti4.4K Pengikut
Arize AI
Arize AI@arizeai·
Twitter said MCP was great six months ago, then it said skills killed MCP. We ran 500 trials to see who was right. One model (Claude Opus 4.6), 25 GitHub tasks across four difficulty tiers, four arms: GitHub's official MCP server, two community gh skills (one verbose, one opinionated), and bare Claude with shell access only. Findings: Correctness barely moved. 0.826 to 0.845 across all four arms. Everyone gets there eventually. But on the hardest analysis tasks, MCP cost 6x more than skills and took 5x longer. The GitHub MCP is a thin REST wrapper, and when a task doesn't map to one endpoint, it fans out into a dozen calls that each return verbose JSON. The cost compounds. Tool fidelity for MCP on tier 4 collapsed to 0.33. The agent was told to use only MCP tools. It ignored that and shelled out to bash to parse the JSON it had written to disk, because the API surface couldn't compose. A short opinionated skill (341 lines) beat a long encyclopedic one (2,187 lines). And the punchline we didn't expect: bare Claude with no skill and no MCP scored slightly higher on correctness than either skill. For a famous CLI like gh, the training data is doing most of the work. But MCP isn't dead. OAuth, enterprise access control, remote/proprietary tools the model has never seen, consumer-facing agents where the user can't be expected to install a CLI and paste an API token — that's MCP territory, and CLIs can't compete there. The right answer isn't MCP vs CLI. It's MCP plus CLI. Claude Code uses both. Yours probably should too. Full writeup, all the data, and the open-source eval harness: arize.com/blog/mcp-vs-cl…
Arize AI tweet media
English
1
0
1
65
Arize AI
Arize AI@arizeai·
At Google Cloud NEXT, our CEO @jason_lopatecki spoke with Google product leader Rami Shalom about why shared standards are critical when developing agents. Get the recap here (you may just learn a few things): arize.com/blog/agent-tel…
English
0
0
0
37
Arize AI
Arize AI@arizeai·
The practical value of standardized agent telemetry: • Instrument once. • Route traces anywhere. • Debug step by step. • Run evals on production behavior. • Improve prompts, retrieval, and tools from real trajectories. That’s the agent feedback loop.
English
1
0
0
36
Arize AI
Arize AI@arizeai·
How do you understand what an agent actually did? The answer starts with portable traces. A production agent might rewrite a request, retrieve context, call tools, invoke models, hand work to another agent, and return one simple-looking answer. The answer is the output, but the trace is the decision path.
English
1
0
0
66
Arize AI
Arize AI@arizeai·
The agent harness you wrote last year was implicitly tuned for a model that doesn't quite exist anymore. Models shift while we're not looking. Relying on vibes means customers find out before you do. @rachelnabors shares the data and a forkable repo to test your own loop: x.com/rachelnabors/s…
English
0
0
0
61
Arize AI
Arize AI@arizeai·
One AI Question with @jimbobbennett What's your 🌶️ take on AI? Our DevEx Engineer's take: Start with the mindset that AI sucks—so you're forced to build the evals and observability to make it great. Don't trust it. Test it. #AI #Programming #SoftwareDevelopment
English
0
0
2
82
Arize AI
Arize AI@arizeai·
If a prompt change can alter tool use, routing, or output without touching your code, it isn’t just text. It’s runtime behavior. That’s when prompts need their own lifecycle: versioning, rollout, rollback, and observability. This is the decision gate we use:
Arize AI tweet media
English
1
0
0
61
Arize AI
Arize AI@arizeai·
Agents today are running longer sessions, making more decisions, and touching more systems. That makes knowing if they're doing the right thing critical. Thanks again @furrier and team for having us. ✨
English
0
0
0
41
Arize AI
Arize AI@arizeai·
We ran 500 evals to test the "MCP is dead, long live the CLI" claim and presented the results at AI Engineer: Miami. The answer is more interesting than a Twitter fight! Correctness was tied (~82%). But on the hardest analytical tasks, MCP cost 6× more and ran 5× longer than CLI-via-skills. Sometimes MCP was able to one-shot things and beat the CLI, but more often the MCP needed to use the CLI itself to complete a task. Plot twist: a test with NO skills, no MCP, actually did better than MCP and some skills. The real conclusion: MCP vs. CLI is the wrong question. CLI for local, popular, composable, dev-only. MCP for remote, OAuth, proprietary, consumer. Real agents use both. Check out the full talk here: youtu.be/CfITzVcUkZA
YouTube video
YouTube
English
0
1
2
300
Arize AI
Arize AI@arizeai·
Agent traces aren't telemetry. They aren't debugging exhaust. They're the first compounding data loop enterprise software has ever had — and you should make sure you own them. Read the full blog post: arize.com/blog/using-con…
Arize AI tweet media
English
0
0
1
110
Arize AI
Arize AI@arizeai·
The TLDR from @aparnadhinak? Bigger context windows help. But reliable agents need a harness that decides what stays close, what gets compressed, what gets evicted, and what can be retrieved later. Read more: arize.com/blog/context-m…
English
0
0
1
92
Arize AI
Arize AI@arizeai·
Across Pi, OpenClaw, Claude Code, Letta, and Arize’s Alyx, the same techniques keep showing up: • Cap large file reads • Use offset and limit pagination • Budget tool results • Compact older history into summaries • Isolate subagents from parent sessions
Arize AI tweet media
English
1
0
0
111
Arize AI
Arize AI@arizeai·
Long-running agents don't just need bigger context windows. They need better context management. But context always fills up with more than the task: file reads, tool outputs, stale turns, subagent responses, memory summaries, and repeated previews.
Arize AI tweet media
English
1
0
5
206
Arize AI
Arize AI@arizeai·
GPT 5.5 and 5.5 Pro are now live in the @OpenAI API and available in the Arize AX prompt playground! Find out how frontier intelligence improves your agents in seconds!
Arize AI tweet media
English
0
0
2
329