Arize AI

1.7K posts

Arize AI

@arizeai

The AI engineering platform for teams shipping reliable AI agents and LLM applications. Also home to @ArizePhoenix.

San Francisco, CA Katılım Ocak 2020

157 Takip Edilen4.8K Takipçiler

Arize AI@arizeai·1h

Your AI agent generated 40 PRs. Great. But how many were merged? How much rework did they create? And what did each successful PR actually cost? Tokens, prompts, and output volume measure motion. But measuring AI productivity requires connecting traces to outcomes. The missing layer is a join between the trace and a validated outcome: a PR that merged and held, a ticket that stayed resolved, or a customer task that completed. Learn more in our new guide on measuring AI productivity with traces, evals, correlation IDs, and cost per validated outcome: arize.com/blog/how-to-me…

English

Arize AI@arizeai·5h

@usaiinstitute Worth adding @ArizeAI to that comparison. OpenTelemetry-native tracing, built-in evals that run on live traffic, and no vendor lock-in on the instrumentation layer. Different architecture than both. OSS option with @ArizePhoenix.

English

United States Artificial Intelligence Institute@usaiinstitute·3d

Building AI agents? Before you choose an LLM observability platform, compare: ✅ Langfuse ✅ LangSmith See which fits your stack, workflow, and deployment needs. tinyurl.com/3r4t47ek #AI #LLM #GenerativeAI #AIAgents #LangChain #LangSmith #USAII

United States Artificial Intelligence Institute tweet media

English

Arize AI retweetledi

Mikyo@mikeldking·1d

Over the past 6 months we've maniacally prepped our repos to be coding agent friendly. Here are some things that worked. 1. Make CI blazing fast. Use every Rust, Zig, or Go ported tool that lets agents verify their work. This means UV, oxfmt, Typescript 7. Move integration tests to post merge hooks. 2. Trigger coding agents automatically based on triage labels. A coding agent should setup a proof of concept or repro steps automatically so an engineer can pick up the issue seamlessly. 3. Setup crons for things devs hate doing. Setup agents to fill SDK gaps, skill tuning, filling in critical regression checks. 4. Give agents ways to prove their work. Add screenshotting skills, agent-browswer, cloud storage for storing assets. 5. Make it possible to hermetically deploy your app, preferably multiple at a time. If a coding agent can deploy the app locally, the faster it can work. 6. Give the agents realistic production "simulation" data. Agents will work much better when they are working against data that looks like how your users use the product.

English

982

86.1K

Arize AI@arizeai·1d

@seldo read the whole thing so you don't have to. Here's the abbreviated version 👇 arize.com/blog/how-do-yo…

English

110

Arize AI@arizeai·1d

For years the big labs went quiet on how they actually train frontier models. Trade secrets. Microsoft just broke the silence with a 109-page report on MAI-Thinking-1, a Sonnet 4.6-class model.

English

301

Arize AI@arizeai·3d

Before you call something a loop, you should name what iterates and what closes it. Learn more in our latest write up from @aparnadhinak + @seldo: arize.com/blog/what-is-a…

English

181

Arize AI@arizeai·3d

Execution loops are the loop most people picture when they say "agent." But there's more to this space than just that. - Execution: steps in one run - Task: fresh runs against a spec - Product: agents across repo/backlog - System: improve prompts/evals/harnesses

English

287

Arize AI@arizeai·3d

There's a lot of talk about loops recently. But the term “loop” currently describes at least four different architectures: execution, task, product, and system (plus the human oversight loop governing them).

English

64.5K

Arize AI@arizeai·4d

@calcsam from @mastra broke down why "we tested before launch" isn't enough at #ArizeObserve. arize.com/blog/3-product…

English

444

Arize AI@arizeai·4d

Your eval suite is incomplete right now. Guaranteed. You wrote it before a single real user touched the agent, so it can't cover the questions they'll actually ask.

English

294

Arize AI@arizeai·4d

@lordofblocks Thanks for the shout out @lordofblocks 🚀

English

David J.@lordofblocks·5d

I spent years building games. What grips me now is bigger: The plumbing behind AI. Every app you touch runs on models it doesn't own. Your request leaves the screen, bounces to one of a dozen providers, and returns before you've blinked. A whole economy grew up in that half-second, and almost none of it is the names you know. So I went down the rabbit hole for months and mapped 30 of the companies doing the real work: 𝗧𝗵𝗲 𝗹𝗮𝗯𝘀 — the models everything else routes to: @OpenAI @AnthropicAI @GoogleDeepMind @MistralAI @deepseek_ai 𝗚𝗮𝘁𝗲𝘄𝗮𝘆𝘀 & 𝗿𝗼𝘂𝘁𝗲𝗿𝘀 — one key in front of them all: @OpenRouter @PortkeyAI @LiteLLM @RequestyAI @notdiamond_ai 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 & 𝘀𝗽𝗲𝗲𝗱 — the teams serving tokens fastest: @GroqInc @FireworksAI_HQ @baseten @modal @DeepInfra 𝗢𝗯𝘀𝗲𝗿𝘃𝗮𝗯𝗶𝗹𝗶𝘁𝘆 — knowing what every call cost and did: @helicone_ai @langfuse @braintrust @arizeai @honeyhiveai 𝗔𝗴𝗲𝗻𝘁 𝗳𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸𝘀 — wiring models into real products: @LangChain @llama_index @crewAIInc @mastra @DSPyOSS 𝗙𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴 & 𝗰𝘂𝘀𝘁𝗼𝗺 𝗺𝗼𝗱𝗲𝗹𝘀 — making a model your own: @predibase @OpenPipeAI @UnslothAI @LaminiAI @togethercompute You'll know a handful of these. The rest are the quiet layer doing the actual work, and that is the whole point. Mapping it made one thing obvious. The models have become a commodity, and the real value moved to the layer sitting between you and all of them. The gateway row is where that fight is happening. And when I went looking for the best of them and kept landing on one called ModelStream, which folds every provider into a single key and a single bill, and takes crypto on top of that. I've been running everything through it since - try out for free here: modelstream.ai/?aff_coupon=QS… P.S. Half the fun is the ones I left off. Who did I miss?

English

Arize AI retweetledi

Aparna Dhinakaran@aparnadhinak·5d

Glam shot! 📸 @arizeai

English

1.2K

Arize AI@arizeai·4d

GPT-5.6 support just went live in Arize AX. 🚀 Now available: 🌞 gpt-5.6-sol 🌍 gpt-5.6-terra 🌙 gpt-5.6-luna Compare all three side-by-side in the Prompt Playground, plug them into LLM-as-a-judge evals, and watch them in production - all in one place. Try it 👇 app.arize.com

English

295

Arize AI@arizeai·5d

@ivanburazin from @daytonaio dishes on why you need to trace before you migrate at #ArizeObserve. Read the full write-up here: arize.com/blog/trace-bef…

English

Arize AI@arizeai·5d

Before you rip out Kubernetes for something faster, do one thing: trace what you already have. Half the time the bottleneck isn't your runtime. It's model latency, tool selection, or a retry loop hiding in plain sight.

English

231

Arize AI retweetledi

Mikyo@mikeldking·5d

I found @grinich talk at Observe fascinating because he articulated so well what's fundamentally different about authentication and authorization in the age of agents. If you are interested in agent first experiences, I can't think of a more dialed in tech leader.

Arize AI@arizeai

An agent was told: “make the tests pass.” It deleted the tests. That story is funny on its face. But it's also the exact reason agent engineering is getting harder. In this Rise of the AI Engineer conversation, @WorkOS founder @grinich makes the case that the next layer of AI engineering is identity, permissions, evals, observability, and memory around agents. Full conversation below.

English

1.1K

Arize AI@arizeai·6d

More of a reader? Here's a writeup of what @grinich at @WorkOS thinks is most critical as we move forward. arize.com/blog/ai-engine…

English

184

Arize AI@arizeai·6d

English

265.3K

Arize AI retweetledi

Aparna Dhinakaran@aparnadhinak·7 Tem

x.com/i/article/2074…

ZXX

261

59.3K

Arize AI@arizeai·7 Tem

Pro tip: not every check should break the build. Hard invariants belong in CI. Quality signals like helpfulness, latency, and groundedness should be recorded, trended, and inspected with traces. That gives you a practical first eval without turning normal model variance into constant CI noise. If you’ve been putting off evals because the starting point felt too abstract, this is the guide for you.

English

108

Arize AI@arizeai·7 Tem

Our answer? Write your first eval like a test. In a practical writeup, Arize's Head of Open Source @mikeldking walks through exactly how to run LLM evals directly inside pytest, Vitest, or Jest with Phoenix. Here's what he covers: - how evals differ from ordinary tests - what a single eval is made of - when to hard-assert behavior in CI - when to track quality as a trend instead - how Phoenix maps test suites to datasets and runs to experiments - a full working example in Python and TypeScript arize.com/blog/evals-in-…

English

796

Arize AI@arizeai·7 Tem

Most teams hear the same advice: “add evals.” But when you’re staring at a real LLM app, that advice gets vague fast. Should your first eval be an integration test? A golden dataset? A CI gate? A dashboard metric? An LLM judge?

English

438

Keşfet

@usaiinstitute @ArizePhoenix @seldo @aparnadhinak @calcsam @mastra @lordofblocks @OpenAI