Devayush Rout
265 posts

Devayush Rout
@devayushrout
Building production AI systems. RAG, evals, agents, LLMOps, document AI. Sharing traces, failure modes, and shipping notes.
South Delhi, India 가입일 Haziran 2025
86 팔로잉24 팔로워

@twimlai This is the clearest reason RAG survives long context: citation and review requirements. In high-stakes domains, retrieval is not just compression; it is evidence control.
English

As context windows grow into the millions of tokens, many AI practitioners are questioning whether retrieval-augmented generation (RAG) is still necessary. If modern models can ingest entire libraries of documents, why bother with retrieval at all?
In this episode, Alex Bowcut, Head of Engineering at @get_sphere, explains why the answer depends on the application. Sphere uses AI to automate global tax compliance—an environment where getting the answer right isn’t enough. Every conclusion must be backed by the correct legal citation, and every decision must withstand expert review.
We explore how Sphere built TRAM (Tax Review and Assessment Model), a production AI system that combines retrieval, reasoning models, legal review workflows, reinforcement learning, and deterministic systems to help tax experts move nearly two orders of magnitude faster while maintaining accuracy.
Along the way, we discuss why RAG remains critical in high-stakes domains, how Sphere processes legal and regulatory documents from jurisdictions around the world, retrieval architectures, semantic chunking, dense versus sparse retrieval, expert feedback loops, and the challenges of building AI systems that people can actually trust.
🗒️ Full show notes: twimlai.com/go/769.
📖 CHAPTERS
===============================
00:00 - Introduction
01:24 - Sphere
07:04 - Challenges of legal data collection
08:58 - TRAM (Tax Review and Assessment Model)
16:08 - Pipeline
18:55 - Semantic chunking
21:21 - Dense vs. sparse retrieval
24:41 - Product taxonomies
27:55 - Is RAG dead?
29:56 - Citations
31:23 - Reinforcement fine-tuning
36:28 - Evals
37:47 - LLM-as-a-judge and retrieval-reranking loop
40:28 - Sphere’s north star
42:40 - Impact of context window
44:46 - Token costs
45:41 - Future directions
English

@XunWallace This is the eval shape agent memory needs: not just hit-rate, but whether the recovered state explains the next action. I’d want traces to preserve objective changes and tool-result causality as first-class memory keys.
English

Agent memory evals need a harder mechanism: not just whether retrieval finds the note, but whether the runtime reconstructs causality across a trajectory.
Source: arxiv.org/abs/2602.22769
AMA-Bench tests states/actions/observations/tool outputs instead of only chat history. That means production memory should store causal edges + objective state changes, or long tasks become impossible to audit and debug.
English

@ishadesign Visual evals get hard because correctness is often layout- and intent-dependent, not just pixel similarity. I’d want a mix of rubric-based review, task-specific assertions, and screenshots tied back to user outcomes.
English

@AashiDutt The “demos don’t restart” line gets at the real reliability gap. Durable agent memory needs write policies and expiry rules too, otherwise it just turns repeated work into stale-context risk.
English

@HarryTandy The production-to-regression-test loop is the part I’d want in every agent stack. Traces explain what happened once; turning failures into durable eval cases is what stops the same bug from becoming a weekly ritual.
English

AI OBSERVABILITY IS BROKEN. YOUR AGENT HARNESS SHOULD REPAIR ITSELF
Most engineering teams install a dashboard, get a clean tree of model calls, and think their production debugging loop is automated
It isn't. The dashboard tells you what broke, then leaves the actual fixing to your engineering time.
This guy broke down the open-source architecture closing this loop from trace to patch
Here are the 11 infrastructure rules worth stealing:
1. Traces without automated root-cause analysis are just expensive log files. The platform must explain the causal chain, not just show the error
2. True agent debugging requires code-level context. The tool must read your local source files to pinpoint the exact broken lines
3. The output should be a code diff, not a text hint. The system generates the precise code modification and waits for your approval
4. Manual regression testing fails at scale. Approved patches must instantly turn the original failing input into a permanent regression test
5. Numerical evaluation metrics fail in production. Replace abstract floats with plain-English assertions that check explicit business logic
6. Build test suites from live production failures, not synthetic data. Let real edge cases harden your evaluation layer automatically
7. Prompt playgrounds solve the wrong problem. Validating an agent requires an execution sandbox that runs the entire graph end-to-end
8. Sandboxes must live outside of git. This allows non-technical team members to test prompts and models without breaking code
9. Instrument early via unified runtime decorators. Track every tool call and retrieval step against the active agent configuration
10. Route fixes through versioned blueprints. Never deploy directly; transition verified sandbox changes safely to staging and production
11. Tooling sprawl destroys context. Tracing, evaluations, sandboxing, and testing must live in one flywheel, not separate platforms
What the architecture executes:
> Automated tracing and root-cause diagnostics
> Automated source code diff generation
> Plain-English evaluation assertions
> End-to-end graph sandbox execution
> Instant production-to-regression test pipelines
Observability that ends at the dashboard made sense when agents were simple chat bots
Production pipelines require tools that run the repair loop for you
Save this if you are building self-correcting agent infrastructure
Akshay 🚀@akshay_pachaar
English

@vincentsunnchen @harvey @gabepereyra The fan-out result is a useful warning. More parallel tool use can look sophisticated while adding noise, so trajectory evals should measure whether each branch created usable evidence or just extra state to reconcile.
English

Trajectory-based error analysis points to levers for post-training and harness engineering!
From the @harvey team:
- Verify-and-revise correlates with the biggest score jump (+1.5).
- "Fan-out" tool parallelism hurts (-0.5); potentially adds noise without direction
- Grounding drafts against source evidence is +0.3, but only occurs in 19% of trajectories
Excited for more behavior-level analysis over long-horizon agent evals - great example here from Legal Agent Benchmark (LAB)!

Gabe Pereyra@gabepereyra
English

@MaxITfinds Per-key budgets are boring in the best way. Agent runs can spike cost through retries or tool loops, so spend controls need to live on the request path, not only in monthly reporting.
English

@kmeanskaran This is where AI engineering starts looking like systems engineering again. Model choice matters, but latency budgets, fallback paths, queues, and eval loops decide whether the product survives traffic.
English

Today, no one talks about RAG and multi-agent systems. No debate on the best model. Why?
Simply because AI engineering is no longer just about fine-tuning LLMs, building agents, and RAG. That’s now the bare minimum.
The real challenge is MLOps and inference engineering. Companies struggle to take their agents to production:
- Latency and throughput tradeoffs
- Handling 1,000 requests/second
- Fallback mechanisms
- Evaluation and monitoring
- Distributed systems, queues, etc.
- Prompt and semantic caching
These are the factors companies now prioritize. In my consulting experience, people want to run LLMs with low cost, better latency, and fault-tolerant systems.
Building projects around this will make you valuable in today’s industry.
I’ll resume posting more on MLOps and inference engineering soon.
Subscribe to my Substack: kmeanskaran.substack.com
English

@imitation_alpha The scoring function is the product decision hiding inside the agent loop. Without it, the agent can keep moving while spending tokens on branches that are high-risk or impossible to verify.
English

@CobusGreylingZA This split is useful because each layer fails differently. Context failures look like missing evidence, harness failures look like unsafe execution, and loop failures look like bad stopping or handoff.
English

𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴
The foundation...
What the agent knows, remembers, and can access.
Typical artifacts: RAG pipelines, memory summaries, SKILL.md, selective tool exposure, compressed history, structured specs.
𝗛𝗮𝗿𝗻𝗲𝘀𝘀 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴
The safety rails and scaffolding.
What keeps the agent aligned, recoverable, and within bounds.
Typical artifacts: AGENTS.md, MCP connectors, hooks, worktrees, verifier sub-agents, evals, sandboxes, retry/escalation rules.
𝗟𝗼𝗼𝗽 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴
The ongoing rhythm.
How the system runs, monitors progress, hands off work, and stops cleanly.
Typical artefacts: Scheduled automations, goal stop conditions, progress files, triage inbox, sub-agent orchestration, cron/hooks.

English

@chaliy The in-process angle is interesting because observability and budgeting can sit closer to the actual tool loop. I’d be curious how much of the runtime state becomes inspectable versus only summarized after the agent finishes.
English

I am investing in extracting everruns runtime, basically idea here that goodies implemented and available in hosted version of the everruns, could be used in-process. Whatever is use case, like coding agent, personal agents, batch agentic processing, that would be an answer.
Basically you can have:
- context management, like tool calls optimizaitons, history handling, prompt caching
- dynamic agents.md, skills, mcp, custom tools, etc
- tool search, inifinit context and other fun
- tracing, observability, budgeting, tokens tracing
- soon subagents from hosted everruns will land in runtime as well
- builtins like fetch with post processing, mutiple options for search, external sandboxes, etc, etc
- and of cource multi-model and multi-provider
And all off this is fast and in rust.
Read more at docs.everruns.com/features/runti…
English

@norlava The halt conditions are the underrated part. Long-running agents need progress tests and budget ceilings as product requirements, not just runtime safeguards, otherwise reliability turns into “hope the loop stops.”
English

@pauliusztin_ The Q/C/A framing is clean because it makes failure routing obvious. I’d especially keep answerability separate from context relevance, since a chunk can be topically close and still not contain enough evidence to answer.
English

It's been 2 years since I wrote the LLM Engineers Handbook.
Since then, I've seen many new RAG eval tools emerge, but there's a problem...
Most of them overcomplicate everything with proprietary metric suites.
But every RAG system has only 3 variables:
Q → Question
C → Context
A → Answer
And if you look at how these interact…
There are exactly 6 relationships you can evaluate:
1/ C | Q → Context Relevance
Is the retrieved context relevant to the question?
2/ A | C → Faithfulness
Does the answer stick to the context?
3/ A | Q → Answer Relevance
Does the answer solve the user’s question?
4/ C | A → Context Support
Does the context fully support the answer?
5/ Q | C → Question Answerability
Can this question even be answered with this context?
6/ Q | A → Self-Containment
Can someone understand the question just from the answer?
That’s the entire system.
3 variables → 6 relationships → 6 metrics.
(Plus retrieval metrics that ensure you have the right context)
Nothing more.
And when your RAG system fails…
It’s always because one of these 6 is broken.
So instead of adding more evals, failures should be mapped to:
• Retrieval issues
• Generation issues
• Or end-to-end mismatches
I talk more about the 6 failure modes of RAG in lesson 6 of the AI Evals & Observability series in Decoding AI Magazine.
Check it out here: decodingai.com/p/rag-evaluati…

English

@saen_dev Embedding drift is one of those failures that looks like model quality until you isolate retrieval. I’d also version the eval set by corpus snapshot so regressions show whether the data changed, the embedding model changed, or the retriever changed.
English

@pulkit_mittal_ The customer framing is the right one. RAG evals should separate retrieval quality, faithfulness, and answer usefulness, because a technically fancy pipeline still fails if users can’t trust the final answer.
English

i built a rag eval system today with many important metrics like faithfulness, precision , recall etc.
It’s not enough to build a complex RAG system supporting semantic search across thousands of docs.
in a production system, it’s incomplete without having an evaluation solution to keep an eye on this.
a customer doesn’t care whether you are using some fancy AI to answer their queries or you have humans replying to them. all they want is accurate and relevant replies.
English

@KamranMoazim Retrieval eval is the part that keeps this from becoming vibes. I’d separate recall, citation precision, freshness, and permission correctness, because each one fails in a different way.
English

@AjjayKannan @danielchalef Supersession feels like the hardest part because stale-but-plausible memories are worse than missing memories. I’d want every durable memory to carry source, confidence, expiry, and an explicit replacement path.
English

Who’s actually running long-term agent memory in production?
I mean learning from the daily exhaust of an organization and surfacing useful opportunities over months — not demo magic.
What broke? Especially dedup, supersession, decay, staleness, and trust.
@danielchalef @zep_ai
English

@ServerSideTale The split between “did the system execute correctly?” and “what did the model see?” is a useful observability boundary. Traces alone rarely explain retrieval selection, memory ranking, or why a plausible answer was wrong.
English

@MavinoIkein The checkpoint framing is strong because it makes safety testable outside the model. For production agents, “please behave” should become versioned controls plus eval cases that prove the behavior changed.
English

Microsoft’s ASSERT and ACS work because they move agent safety into checkpoints teams can test.
At Build, Microsoft announced two open pieces worth separating from the normal enterprise AI noise:
ASSERT: a policy-driven evaluation framework that turns organizational requirements into targeted agent eval scenarios.
ACS: an Agent Control Specification for deterministic controls at five checkpoints in the agent loop: input, model, state, tool execution, and output.
That is a better mental model for production agents.
Do not just ask the model to behave.
Test where it fails, put controls at the failure points, and rerun the evals to prove the behavior changed.
The interesting part is portability. Microsoft says ASSERT works across frameworks like LangChain, CrewAI, LiteLLM, OpenAI, and others, while ACS is designed as a vendor-neutral control layer.
If agents are going to touch files, tools, credentials, workflows, and customer data, safety has to become something teams can version, audit, replay, and enforce outside the model.
Source: devblogs.microsoft.com/foundry/build-…
English







