Arize AI

1.4K posts

Arize AI

@arizeai

The AI engineering platform for teams shipping reliable AI agents and LLM applications. Also home to @ArizePhoenix.

San Francisco, CA Bergabung Ocak 2020

126 Mengikuti4.4K Pengikut

Tweet Disematkan

Arize AI@arizeai·3d

Demos are easy. Production is where reality hits. Join us at Observe to hear from @calcsam, @ivanburazin, @EnoReyes, @Chi_Wang_, and more on what it actually takes to make it work. Grab your spot 👇 arize.com/observe

English

1.1K

Arize AI@arizeai·1h

Twitter said MCP was great six months ago, then it said skills killed MCP. We ran 500 trials to see who was right. One model (Claude Opus 4.6), 25 GitHub tasks across four difficulty tiers, four arms: GitHub's official MCP server, two community gh skills (one verbose, one opinionated), and bare Claude with shell access only. Findings: Correctness barely moved. 0.826 to 0.845 across all four arms. Everyone gets there eventually. But on the hardest analysis tasks, MCP cost 6x more than skills and took 5x longer. The GitHub MCP is a thin REST wrapper, and when a task doesn't map to one endpoint, it fans out into a dozen calls that each return verbose JSON. The cost compounds. Tool fidelity for MCP on tier 4 collapsed to 0.33. The agent was told to use only MCP tools. It ignored that and shelled out to bash to parse the JSON it had written to disk, because the API surface couldn't compose. A short opinionated skill (341 lines) beat a long encyclopedic one (2,187 lines). And the punchline we didn't expect: bare Claude with no skill and no MCP scored slightly higher on correctness than either skill. For a famous CLI like gh, the training data is doing most of the work. But MCP isn't dead. OAuth, enterprise access control, remote/proprietary tools the model has never seen, consumer-facing agents where the user can't be expected to install a CLI and paste an API token — that's MCP territory, and CLIs can't compete there. The right answer isn't MCP vs CLI. It's MCP plus CLI. Claude Code uses both. Yours probably should too. Full writeup, all the data, and the open-source eval harness: arize.com/blog/mcp-vs-cl…

English

Arize AI@arizeai·4h

At Google Cloud NEXT, our CEO @jason_lopatecki spoke with Google product leader Rami Shalom about why shared standards are critical when developing agents. Get the recap here (you may just learn a few things): arize.com/blog/agent-tel…

English

Arize AI@arizeai·4h

The practical value of standardized agent telemetry: • Instrument once. • Route traces anywhere. • Debug step by step. • Run evals on production behavior. • Improve prompts, retrieval, and tools from real trajectories. That’s the agent feedback loop.

English

Arize AI@arizeai·4h

How do you understand what an agent actually did? The answer starts with portable traces. A production agent might rewrite a request, retrieve context, call tools, invoke models, hand work to another agent, and return one simple-looking answer. The answer is the output, but the trace is the decision path.

English

Arize AI@arizeai·23h

The agent harness you wrote last year was implicitly tuned for a model that doesn't quite exist anymore. Models shift while we're not looking. Relying on vibes means customers find out before you do. @rachelnabors shares the data and a forkable repo to test your own loop: x.com/rachelnabors/s…

English

Arize AI@arizeai·1d

One AI Question with @jimbobbennett What's your 🌶️ take on AI? Our DevEx Engineer's take: Start with the mindset that AI sucks—so you're forced to build the evals and observability to make it great. Don't trust it. Test it. #AI #Programming #SoftwareDevelopment

English

Arize AI me-retweet

R 'Nearest' Nabors@rachelnabors·1d

x.com/i/article/2049…

ZXX

3.8K

Arize AI@arizeai·1d

When a prompt can change tool use, routing, or output without touching application logic, that’s when prompt-as-config starts to matter. Learn more from @dat_attacked: arize.com/blog/prompt-te…

English

Arize AI@arizeai·1d

If a prompt change can alter tool use, routing, or output without touching your code, it isn’t just text. It’s runtime behavior. That’s when prompts need their own lifecycle: versioning, rollout, rollback, and observability. This is the decision gate we use:

English

Arize AI@arizeai·1d

Agents today are running longer sessions, making more decisions, and touching more systems. That makes knowing if they're doing the right thing critical. Thanks again @furrier and team for having us. ✨

English

Arize AI@arizeai·1d

Most agent demos look great. But then things hit prod ... and you realize you have some work to do. Our CEO @jasonlopatecki joined @theCUBE + NYSE Wired to talk about how agents are evolving, and what needs to shift to make them work at scale. 👇 @furrier @GemmaAllenSays @bjbaumann2014

English

155

Arize AI@arizeai·2d

We ran 500 evals to test the "MCP is dead, long live the CLI" claim and presented the results at AI Engineer: Miami. The answer is more interesting than a Twitter fight! Correctness was tied (~82%). But on the hardest analytical tasks, MCP cost 6× more and ran 5× longer than CLI-via-skills. Sometimes MCP was able to one-shot things and beat the CLI, but more often the MCP needed to use the CLI itself to complete a task. Plot twist: a test with NO skills, no MCP, actually did better than MCP and some skills. The real conclusion: MCP vs. CLI is the wrong question. CLI for local, popular, composable, dev-only. MCP for remote, OAuth, proprietary, consumer. Real agents use both. Check out the full talk here: youtu.be/CfITzVcUkZA

YouTube

English

300

Arize AI@arizeai·2d

Agent traces aren't telemetry. They aren't debugging exhaust. They're the first compounding data loop enterprise software has ever had — and you should make sure you own them. Read the full blog post: arize.com/blog/using-con…

English

110

Arize AI@arizeai·3d

The TLDR from @aparnadhinak? Bigger context windows help. But reliable agents need a harness that decides what stays close, what gets compressed, what gets evicted, and what can be retrieved later. Read more: arize.com/blog/context-m…

English

Arize AI@arizeai·3d

Across Pi, OpenClaw, Claude Code, Letta, and Arize’s Alyx, the same techniques keep showing up: • Cap large file reads • Use offset and limit pagination • Budget tool results • Compact older history into summaries • Isolate subagents from parent sessions

English

111

Arize AI@arizeai·3d

Long-running agents don't just need bigger context windows. They need better context management. But context always fills up with more than the task: file reads, tool outputs, stale turns, subagent responses, memory summaries, and repeated previews.

English

206

Arize AI me-retweet

Aparna Dhinakaran@aparnadhinak·4d

x.com/i/article/2048…

ZXX

737

136.7K

Arize AI@arizeai·6d

GPT 5.5 and 5.5 Pro are now live in the @OpenAI API and available in the Arize AX prompt playground! Find out how frontier intelligence improves your agents in seconds!

English

329

Jelajahi

@jason_lopatecki @rachelnabors @jimbobbennett @dat_attacked @furrier @jasonlopatecki @theCUBE @GemmaAllenSays