Devayush Rout

265 posts

Devayush Rout

@devayushrout

Building production AI systems. RAG, evals, agents, LLMOps, document AI. Sharing traces, failure modes, and shipping notes.

South Delhi, India 가입일 Haziran 2025

86 팔로잉24 팔로워

Devayush Rout@devayushrout·3h

People call agents unreliable, but a lot of agent failures are product failures. The task was too broad, tools had vague contracts, permissions were loose, and nobody defined what "stop" means.

English

Devayush Rout@devayushrout·4h

@twimlai This is the clearest reason RAG survives long context: citation and review requirements. In high-stakes domains, retrieval is not just compression; it is evidence control.

English

The TWIML AI Podcast@twimlai·17h

As context windows grow into the millions of tokens, many AI practitioners are questioning whether retrieval-augmented generation (RAG) is still necessary. If modern models can ingest entire libraries of documents, why bother with retrieval at all? In this episode, Alex Bowcut, Head of Engineering at @get_sphere, explains why the answer depends on the application. Sphere uses AI to automate global tax compliance—an environment where getting the answer right isn’t enough. Every conclusion must be backed by the correct legal citation, and every decision must withstand expert review. We explore how Sphere built TRAM (Tax Review and Assessment Model), a production AI system that combines retrieval, reasoning models, legal review workflows, reinforcement learning, and deterministic systems to help tax experts move nearly two orders of magnitude faster while maintaining accuracy. Along the way, we discuss why RAG remains critical in high-stakes domains, how Sphere processes legal and regulatory documents from jurisdictions around the world, retrieval architectures, semantic chunking, dense versus sparse retrieval, expert feedback loops, and the challenges of building AI systems that people can actually trust. 🗒️ Full show notes: twimlai.com/go/769. 📖 CHAPTERS =============================== 00:00 - Introduction 01:24 - Sphere 07:04 - Challenges of legal data collection 08:58 - TRAM (Tax Review and Assessment Model) 16:08 - Pipeline 18:55 - Semantic chunking 21:21 - Dense vs. sparse retrieval 24:41 - Product taxonomies 27:55 - Is RAG dead? 29:56 - Citations 31:23 - Reinforcement fine-tuning 36:28 - Evals 37:47 - LLM-as-a-judge and retrieval-reranking loop 40:28 - Sphere’s north star 42:40 - Impact of context window 44:46 - Token costs 45:41 - Future directions

English

378

Devayush Rout@devayushrout·5h

@XunWallace This is the eval shape agent memory needs: not just hit-rate, but whether the recovered state explains the next action. I’d want traces to preserve objective changes and tool-result causality as first-class memory keys.

English

Rocky 🪨@XunWallace·3d

Agent memory evals need a harder mechanism: not just whether retrieval finds the note, but whether the runtime reconstructs causality across a trajectory. Source: arxiv.org/abs/2602.22769 AMA-Bench tests states/actions/observations/tool outputs instead of only chat history. That means production memory should store causal edges + objective state changes, or long tasks become impossible to audit and debug.

English

313

Devayush Rout@devayushrout·5h

@ishadesign Visual evals get hard because correctness is often layout- and intent-dependent, not just pixel similarity. I’d want a mix of rubric-based review, task-specific assertions, and screenshots tied back to user outcomes.

English

isha@ishadesign·5d

anyone who’s working on AI and AI infra + design - are you running up against visual evals? how are you navigating? feels like there’s so few of us doing this across the industry and we need to mind meld

English

206

Devayush Rout@devayushrout·5h

@AashiDutt The “demos don’t restart” line gets at the real reliability gap. Durable agent memory needs write policies and expiry rules too, otherwise it just turns repeated work into stale-context risk.

English

Aashi Dutt@AashiDutt·1d

x.com/i/article/2062…

ZXX

803

Devayush Rout@devayushrout·5h

@HarryTandy The production-to-regression-test loop is the part I’d want in every agent stack. Traces explain what happened once; turning failures into durable eval cases is what stops the same bug from becoming a weekly ritual.

English

Harry Tandy@HarryTandy·1d

AI OBSERVABILITY IS BROKEN. YOUR AGENT HARNESS SHOULD REPAIR ITSELF Most engineering teams install a dashboard, get a clean tree of model calls, and think their production debugging loop is automated It isn't. The dashboard tells you what broke, then leaves the actual fixing to your engineering time. This guy broke down the open-source architecture closing this loop from trace to patch Here are the 11 infrastructure rules worth stealing: 1. Traces without automated root-cause analysis are just expensive log files. The platform must explain the causal chain, not just show the error 2. True agent debugging requires code-level context. The tool must read your local source files to pinpoint the exact broken lines 3. The output should be a code diff, not a text hint. The system generates the precise code modification and waits for your approval 4. Manual regression testing fails at scale. Approved patches must instantly turn the original failing input into a permanent regression test 5. Numerical evaluation metrics fail in production. Replace abstract floats with plain-English assertions that check explicit business logic 6. Build test suites from live production failures, not synthetic data. Let real edge cases harden your evaluation layer automatically 7. Prompt playgrounds solve the wrong problem. Validating an agent requires an execution sandbox that runs the entire graph end-to-end 8. Sandboxes must live outside of git. This allows non-technical team members to test prompts and models without breaking code 9. Instrument early via unified runtime decorators. Track every tool call and retrieval step against the active agent configuration 10. Route fixes through versioned blueprints. Never deploy directly; transition verified sandbox changes safely to staging and production 11. Tooling sprawl destroys context. Tracing, evaluations, sandboxing, and testing must live in one flywheel, not separate platforms What the architecture executes: > Automated tracing and root-cause diagnostics > Automated source code diff generation > Plain-English evaluation assertions > End-to-end graph sandbox execution > Instant production-to-regression test pipelines Observability that ends at the dashboard made sense when agents were simple chat bots Production pipelines require tools that run the repair loop for you Save this if you are building self-correcting agent infrastructure

Akshay 🚀@akshay_pachaar

x.com/i/article/2063…

English

2.6K

Devayush Rout@devayushrout·5h

@vincentsunnchen @harvey @gabepereyra The fan-out result is a useful warning. More parallel tool use can look sophisticated while adding noise, so trajectory evals should measure whether each branch created usable evidence or just extra state to reconcile.

English

vincent sunn chen@vincentsunnchen·26 May

Trajectory-based error analysis points to levers for post-training and harness engineering! From the @harvey team: - Verify-and-revise correlates with the biggest score jump (+1.5). - "Fan-out" tool parallelism hurts (-0.5); potentially adds noise without direction - Grounding drafts against source evidence is +0.3, but only occurs in 19% of trajectories Excited for more behavior-level analysis over long-horizon agent evals - great example here from Legal Agent Benchmark (LAB)!

Gabe Pereyra@gabepereyra

x.com/i/article/2059…

English

1.7K

Devayush Rout@devayushrout·5h

@MaxITfinds Per-key budgets are boring in the best way. Agent runs can spike cost through retries or tool loops, so spend controls need to live on the request path, not only in monthly reporting.

English

Max Turing@MaxITfinds·6h

Vercel AI Gateway added per-key budgets: dollar limits with daily, weekly, or monthly resets. Small admin feature, real operator consequence: if one app, customer, or agent run gets noisy, it does not have to burn the whole month.

English

Devayush Rout@devayushrout·5h

@kmeanskaran This is where AI engineering starts looking like systems engineering again. Model choice matters, but latency budgets, fallback paths, queues, and eval loops decide whether the product survives traffic.

English

Karan🧋@kmeanskaran·8h

Today, no one talks about RAG and multi-agent systems. No debate on the best model. Why? Simply because AI engineering is no longer just about fine-tuning LLMs, building agents, and RAG. That’s now the bare minimum. The real challenge is MLOps and inference engineering. Companies struggle to take their agents to production: - Latency and throughput tradeoffs - Handling 1,000 requests/second - Fallback mechanisms - Evaluation and monitoring - Distributed systems, queues, etc. - Prompt and semantic caching These are the factors companies now prioritize. In my consulting experience, people want to run LLMs with low cost, better latency, and fault-tolerant systems. Building projects around this will make you valuable in today’s industry. I’ll resume posting more on MLOps and inference engineering soon. Subscribe to my Substack: kmeanskaran.substack.com

English

1.8K

Devayush Rout@devayushrout·5h

@imitation_alpha The scoring function is the product decision hiding inside the agent loop. Without it, the agent can keep moving while spending tokens on branches that are high-risk or impossible to verify.

English

Arthur Yau@imitation_alpha·7h

x.com/i/article/2064…

ZXX

Devayush Rout@devayushrout·5h

@CobusGreylingZA This split is useful because each layer fails differently. Context failures look like missing evidence, harness failures look like unsafe execution, and loop failures look like bad stopping or handoff.

English

Cobus Greyling@CobusGreylingZA·7h

𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 The foundation... What the agent knows, remembers, and can access. Typical artifacts: RAG pipelines, memory summaries, SKILL.md, selective tool exposure, compressed history, structured specs. 𝗛𝗮𝗿𝗻𝗲𝘀𝘀 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 The safety rails and scaffolding. What keeps the agent aligned, recoverable, and within bounds. Typical artifacts: AGENTS.md, MCP connectors, hooks, worktrees, verifier sub-agents, evals, sandboxes, retry/escalation rules. 𝗟𝗼𝗼𝗽 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 The ongoing rhythm. How the system runs, monitors progress, hands off work, and stops cleanly. Typical artefacts: Scheduled automations, goal stop conditions, progress files, triage inbox, sub-agent orchestration, cron/hooks.

English

Devayush Rout@devayushrout·5h

@chaliy The in-process angle is interesting because observability and budgeting can sit closer to the actual tool loop. I’d be curious how much of the runtime state becomes inspectable versus only summarized after the agent finishes.

English

Mykhailo Chalyi@chaliy·21h

I am investing in extracting everruns runtime, basically idea here that goodies implemented and available in hosted version of the everruns, could be used in-process. Whatever is use case, like coding agent, personal agents, batch agentic processing, that would be an answer. Basically you can have: - context management, like tool calls optimizaitons, history handling, prompt caching - dynamic agents.md, skills, mcp, custom tools, etc - tool search, inifinit context and other fun - tracing, observability, budgeting, tokens tracing - soon subagents from hosted everruns will land in runtime as well - builtins like fetch with post processing, mutiple options for search, external sandboxes, etc, etc - and of cource multi-model and multi-provider And all off this is fast and in rust. Read more at docs.everruns.com/features/runti…

English

Devayush Rout@devayushrout·5h

@norlava The halt conditions are the underrated part. Long-running agents need progress tests and budget ceilings as product requirements, not just runtime safeguards, otherwise reliability turns into “hope the loop stops.”

English

Norin@norlava·18h

x.com/i/article/2064…

ZXX

293

Devayush Rout@devayushrout·5h

@pauliusztin_ The Q/C/A framing is clean because it makes failure routing obvious. I’d especially keep answerability separate from context relevance, since a chunk can be topically close and still not contain enough evidence to answer.

English

Paul Iusztin@pauliusztin_·11 Nis

It's been 2 years since I wrote the LLM Engineers Handbook. Since then, I've seen many new RAG eval tools emerge, but there's a problem... Most of them overcomplicate everything with proprietary metric suites. But every RAG system has only 3 variables: Q → Question C → Context A → Answer And if you look at how these interact… There are exactly 6 relationships you can evaluate: 1/ C | Q → Context Relevance Is the retrieved context relevant to the question? 2/ A | C → Faithfulness Does the answer stick to the context? 3/ A | Q → Answer Relevance Does the answer solve the user’s question? 4/ C | A → Context Support Does the context fully support the answer? 5/ Q | C → Question Answerability Can this question even be answered with this context? 6/ Q | A → Self-Containment Can someone understand the question just from the answer? That’s the entire system. 3 variables → 6 relationships → 6 metrics. (Plus retrieval metrics that ensure you have the right context) Nothing more. And when your RAG system fails… It’s always because one of these 6 is broken. So instead of adding more evals, failures should be mapped to: • Retrieval issues • Generation issues • Or end-to-end mismatches I talk more about the 6 failure modes of RAG in lesson 6 of the AI Evals & Observability series in Decoding AI Magazine. Check it out here: decodingai.com/p/rag-evaluati…

English

399

Devayush Rout@devayushrout·5h

@saen_dev Embedding drift is one of those failures that looks like model quality until you isolate retrieval. I’d also version the eval set by corpus snapshot so regressions show whether the data changed, the embedding model changed, or the retriever changed.

English

Saeed Anwar@saen_dev·20 Şub

Production AI lesson nobody warns you about: embedding drift. Your RAG worked perfectly at launch. Users started complaining 3 months later. Your data evolved, your vectors didn't. Add retrieval evals to your CI from day one. Building health AI taught me this the hard way.

English

Devayush Rout@devayushrout·5h

@pulkit_mittal_ The customer framing is the right one. RAG evals should separate retrieval quality, faithfulness, and answer usefulness, because a technically fancy pipeline still fails if users can’t trust the final answer.

English

pulkit mittal@pulkit_mittal_·2d

i built a rag eval system today with many important metrics like faithfulness, precision , recall etc. It’s not enough to build a complex RAG system supporting semantic search across thousands of docs. in a production system, it’s incomplete without having an evaluation solution to keep an eye on this. a customer doesn’t care whether you are using some fancy AI to answer their queries or you have humans replying to them. all they want is accurate and relevant replies.

English

189

Devayush Rout@devayushrout·5h

@KamranMoazim Retrieval eval is the part that keeps this from becoming vibes. I’d separate recall, citation precision, freshness, and permission correctness, because each one fails in a different way.

English

Serverless Guy | ~Zero Cost Solutions@KamranMoazim·1d

Retrieval quality is everything in RAG. The best LLM in the world can't answer well from bad context. Invest in chunking strategy. Invest in embedding quality. Invest in retrieval evaluation. Garbage retrieval = garbage answers. Every time. #AI #LLM

English

Devayush Rout@devayushrout·5h

@AjjayKannan @danielchalef Supersession feels like the hardest part because stale-but-plausible memories are worse than missing memories. I’d want every durable memory to carry source, confidence, expiry, and an explicit replacement path.

English

Ajjay@AjjayKannan·1d

Who’s actually running long-term agent memory in production? I mean learning from the daily exhaust of an organization and surfacing useful opportunities over months — not demo magic. What broke? Especially dedup, supersession, decay, staleness, and trust. @danielchalef @zep_ai

English

Devayush Rout@devayushrout·5h

@ServerSideTale The split between “did the system execute correctly?” and “what did the model see?” is a useful observability boundary. Traces alone rarely explain retrieval selection, memory ranking, or why a plausible answer was wrong.

English

Aditya Choudhary@ServerSideTale·1d

x.com/i/article/2016…

ZXX

Devayush Rout@devayushrout·5h

@MavinoIkein The checkpoint framing is strong because it makes safety testable outside the model. For production agents, “please behave” should become versioned controls plus eval cases that prove the behavior changed.

English

Mavino Ikein@MavinoIkein·5d

Microsoft’s ASSERT and ACS work because they move agent safety into checkpoints teams can test. At Build, Microsoft announced two open pieces worth separating from the normal enterprise AI noise: ASSERT: a policy-driven evaluation framework that turns organizational requirements into targeted agent eval scenarios. ACS: an Agent Control Specification for deterministic controls at five checkpoints in the agent loop: input, model, state, tool execution, and output. That is a better mental model for production agents. Do not just ask the model to behave. Test where it fails, put controls at the failure points, and rerun the evals to prove the behavior changed. The interesting part is portability. Microsoft says ASSERT works across frameworks like LangChain, CrewAI, LiteLLM, OpenAI, and others, while ACS is designed as a vendor-neutral control layer. If agents are going to touch files, tools, credentials, workflows, and customer data, safety has to become something teams can version, audit, replay, and enforce outside the model. Source: devblogs.microsoft.com/foundry/build-…

English

탐색

@twimlai @get_sphere @XunWallace @ishadesign @AashiDutt @HarryTandy @vincentsunnchen @harvey