Arthur

1.1K posts

Arthur

@itsArthurAI

The AI Performance Company. Arthur helps teams discover, govern, and innovate AI systems that perform and scale reliably.

New York, USA Katılım Ocak 2019

588 Takip Edilen2.1K Takipçiler

Sabitlenmiş Tweet

Arthur@itsArthurAI·7 Oca

☁️ Arthur is now available in @googlecloud ! Many of our customers are building on Google Cloud and leveraging the latest Gemini and agent frameworks, so we partnered with Google to make Arthur available directly within your GCP environment. This means data never leaves your GCP environment, procurement is seamless through the Marketplace, and deployment fits naturally into your existing workflows and stack. With the explosion of agents, teams lose visibility into which agents are running and lack insight into failures. As enterprises race to adopt Agentic AI, a comprehensive agentic governance approach is crucial to preventing chaos, security nightmares, and business continuity issues. That’s why we launched Arthur’s Agent Discovery & Governance (ADG) Platform on Google Cloud. With Arthur on Google Cloud, you can: 🔍 Automate Discovery: Instantly find and catalog agents company-wide 📈 Unify Monitoring: Monitor and govern internally-developed and third-party agentic solutions 🛡️ Centralize Policy Management: Enforce acceptable use and security policies for all agent interactions 🔄 Continuously Evaluate: Monitor performance aligned specifically to agent tasks Read full announcement → arthur.ai/blog/arthur-la…

English

560

Arthur@itsArthurAI·6 Tem

Learn more: arthur.ai/blog/smaller-m…

English

Arthur@itsArthurAI·6 Tem

We're excited to announce our partnership with ScaleDown, bringing task-specific small models and rigorous evals together for teams building production agents. Cost, speed, and quality usually feel like a pick-two problem. ScaleDown changes that math, by building small, purpose-built language models for the workhorse steps inside an agent: compression, summarization, extraction, and classification. These high-volume operations quietly drive most of your cost and latency, and they almost never need a frontier model. ScaleDown's numbers: comparable or better accuracy, 10 to 15x cheaper, and 2 to 20x faster. So why isn't every team already swapping? Because a model change is a lot easier to commit to when you can show it works. That's the gap Arthur closes. Our eval platform gives you an objective before-and-after on the metrics that matter, so a ScaleDown swap becomes evidence instead of a leap of faith. With Arthur and Scaledown, you can: → Compare across many models at once, not just one-versus-one → Evaluate real production workflows, not demo environments → Build evals anchored to real failure modes → Generate synthetic datasets, so you can measure from a single sample ScaleDown makes your agent leaner and faster while Arthur makes the improvement measurable and trustworthy. Together we close the loop between shipping an efficient model and proving it holds up in production. Read more (link in comments).

English

100

Arthur@itsArthurAI·17 Haz

Learn more: arthur.ai/blog/guardrail…

English

Arthur@itsArthurAI·17 Haz

Ask a room of practitioners to define guardrails, evals, and policies, and the answers start to blur. The confusing part: the same underlying check can wear all three hats depending on how you use it. Take prompt injection detection. Run it in the request path and it's a guardrail, blocking the attack in real time. Run it offline across yesterday's traffic and it's an eval, telling you how often you're being probed. What changes is the job you've assigned the check, not the check itself. In our latest blog post, we break down the three modes and when to reach for each: → Policies set the standard - the rules and intent everyone signs up to → Guardrails hold the line - real-time enforcement that blocks bad inputs and outputs → Evals keep you honest - measurement that tells you whether behavior matches intent, and why it failed Read it here (link in comments). #AIAgents #LLMOps #Guardrails #AIReliability #ProductionAI

English

113

Arthur retweetledi

TrueFoundry@truefoundry·15 Haz

@truefoundry now integrates with @itsArthurAI . Teams building LLM applications and agents can now run Arthur's Engine validation directly on the TrueFoundry AI Gateway catching prompt injection, toxicity, and policy violations on both prompts and completions before they reach users. Every request and response is checked inline, alongside the unified rate limiting, cost tracking, access controls, and observability TrueFoundry already provides across all AI providers. One gateway. Validated and governed. Thanks @itsArthurAI team for the collaboration!

English

106

Arthur@itsArthurAI·16 Haz

Read more arthur.ai/blog/ai-sre-de…

English

Arthur@itsArthurAI·16 Haz

We used Claude as an SRE this month at Arthur and it turned out to be the best and worst SRE 🚨 Our Head of FDE, Ian McGraw, wrote up an interesting case study about a PostgreSQL I/O spike where Claude helped him chase the problem down but pointed at the wrong root cause along the way. If you're curious about why you should never give your agents access to your production environments, this is worth a read. Link to the blog in the comments. #AIAgents #Postgres #SRE

English

Arthur@itsArthurAI·15 Haz

We all want to move fast when building AI apps, but guardrails shouldn't come at the cost of speed. Arthur AI now integrates with the @truefoundry AI Gateway. By bringing the Arthur Engine to the Gateway as a custom guardrail, you validate AI inputs and outputs in real time across all your models, from one place, with one consistent policy: → Prompt injection — catching inputs crafted to hijack the model off its instructions → Toxicity — flagged on the way in and the way out, so harmful content never reaches your model or your users → Hallucination — validated against grounding context you supply, before a response is trusted as an answer The payoff is in how it deploys. Instead of wiring Arthur into every app by hand, you attach it once at the Gateway and any model behind it is covered. Pin it to a model to protect every caller, or opt in per request with a single header. Read the full integration guide in the TrueFoundry docs: truefoundry.com/docs/ai-gatewa… #LLMSecurity #Guardrails #AIGovernance

English

Arthur@itsArthurAI·10 Haz

Uber burned through its entire 2026 AI coding budget in four months. Microsoft started canceling Claude Code licenses. Meta quietly took down its "tokenmaxxing leaderboard." Tokenmaxxing at companies has just met its first big bill. A single agent can consume up to 1,000x more tokens than a one-shot query. Multiply that across thousands of agents and you get spend no one budgeted for. The good news: tokenmaxxing is governable. With the right agent governance controls, you can keep the upside without the surprise invoice: → Track token spend per agent, team, and workflow in real time → Set budgets and rate limits before costs run away → Gain visibility into AI spend so you can optimize ROI Read more in our blog: arthur.ai/blog/govern-ai…

English

149

Arthur@itsArthurAI·3 Haz

Prompts define agent behavior. But when prompts live inside application code, only engineers can change them, and every prompt update requires a full deploy cycle. In Part 2 of our Best Practices for Building Agents series, we cover how externalizing prompts changes the way teams build and iterate on agents. With a proper prompt management layer: → External storage so prompts iterate independently of the agent codebase → Version control with environment tags to promote changes from staging to production safely → Regression testing against real production traces before any prompt ships → Rollback support so a bad change is a one-click fix Our FDE team sees the same pattern repeatedly: teams that externalize prompts ship faster and break things less often. Read Part 2 here: arthur.ai/blog/best-prac…

English

Arthur@itsArthurAI·1 Haz

Read more arthur.ai/blog/best-prac…

English

Arthur@itsArthurAI·1 Haz

Not all OpenTelemetry traces are created equal. ⚡ If you're instrumenting an AI agent, the semantic convention you choose shapes how much you'll actually be able to debug later. When we built the Arthur engine, we evaluated both options and went with OpenInference over the OTEL-community GenAI conventions. Here's why it matters for production agents: → Richer LLM span detail — full prompts, completions, token counts, cost, and model parameters → First-class retrieval and re-ranking spans, which RAG-heavy agents live and die by → Clear span typing — LLM, TOOL, AGENT, CHAIN, RETRIEVER are all distinct, not lumped together → Explicit message, document, and tool-call types you can actually query against The OTEL GenAI conventions are improving, but compare two traces side by side and the expressiveness gap is obvious (Images: OTEL GenAI semantic on left and OpenInference semantic on right). This is the kind of decision that feels minor on day one and compounds for months. Part 1 of our Best Practices for Building Agents series goes deeper on what to trace, which frameworks ship with strong OTEL support out of the box, and how our FDE team approaches observability with enterprise customers. Read it here (link in comments) 👇

English

Arthur@itsArthurAI·20 May

Most teams building agents skip the foundations, and it shows up later as breakage, risk, and stalled rollouts. Our Forward Deployed Engineering (FDE) team just wrapped a six-part series on what it actually takes to ship a reliable agent, and the order matters more than people think: 1️⃣ Observability & tracing — You can't manage what you can't see. 2️⃣ Prompt management — once you can see behavior, you need a safe way to change it. 3️⃣ Continuous evals — automated evals on live traffic, powered by the traces from step 1. 4️⃣ Experiments & supervised evals — validate changes against a fixed dataset before they ship. 5️⃣ Guardrails — intercept bad inputs and outputs in real time. 6️⃣ Discovery & governance — make the agent discoverable, auditable, and owned. Read more 👇 arthur.ai/blog/checklist…

English

Arthur@itsArthurAI·12 May

We've deployed production agents across dozens of enterprises. The same six gaps show up every time. Over the past few months, our Forward Deployed Engineering team published a six-part series distilling what it actually takes to get an AI agent production-ready. Here's your checklist: ✅ Observability & Tracing — Instrument every LLM call, tool invocation, and RAG retrieval. You can't fix what you can't see. ✅ Prompt Management — Store prompts externally, version them, and test changes before promoting. Hardcoded prompts break at scale. ✅ Continuous Evaluations — Run unsupervised evals on live traffic to catch failures before your users do. ✅ Experiments & Supervised Evals — Validate prompt, RAG, and agent changes against a fixed dataset before they ship. ✅ Guardrails — Intercept bad inputs before they reach the model and bad outputs before they reach the user. ✅ Discovery & Governance — Make your agent discoverable, auditable, and owned so it can clear enterprise review. Full recap + links to all six parts 👇 arthur.ai/blog/checklist…

English

Arthur@itsArthurAI·5 May

Keşfet

@truefoundry @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine