Adaline

543 posts

Adaline

Adaline

@tryadaline

Iterate, evaluate, deploy, and monitor LLMs.

Playground Katılım Ocak 2024
2 Takip Edilen827 Takipçiler
Adaline
Adaline@tryadaline·
Observability vs. monitoring for agentic AI — what the distinction actually means in production: go.adaline.ai/tjYy12r
English
0
0
0
23
Adaline
Adaline@tryadaline·
Monitoring tells you that an agent failed, but observability tells you which step in the sequence caused it. For a single LLM call, the distinction barely matters. For a multi-step agent with tool calls, branching logic, and intermediate states, the distinction is the difference between a two-hour fix and a two-week investigation. The four things production agent observability actually requires: • Input traces per step tell you what the agent received at each stage, not just the final prompt. Without them, you can only see the end state. • Tool call logs capture which tool was called, with what parameters, and what it returned. This is the layer where silent failures hide. • Intermediate decision points show where the agent chose one path over another and on what signal. • Eval attachment links evaluations to specific execution traces so you can see what the eval found on the exact run that failed. Build this before the first failure you cannot reproduce. By then, the trace is gone.
Adaline tweet media
English
5
0
1
64
Adaline
Adaline@tryadaline·
Full breakdown on LLM-as-a-judge bias and how to build evaluation pipelines that account for it: go.adaline.ai/NTrH9AN
English
0
0
0
36
Adaline
Adaline@tryadaline·
LLM judges rate longer responses higher, and it is not because length correlates with quality in your task domain. This is because length correlates with quality in the training data. Human annotators rate more complete-looking answers higher. More words read as more effort, and models trained on that signal learn the proxy rather than the underlying quality criterion. Here is what this does to your system over time: your production model learns that verbose answers score better, because they consistently do. The feedback loop runs quietly until someone checks whether longer is actually correct more often, and finds out it is not. Write length-independence into your eval rubric. Tell the judge that brevity is acceptable when brevity is correct. Calibrate against examples where the short answer is correct, because this bias does not correct itself.
Adaline tweet media
English
1
0
0
43
Adaline
Adaline@tryadaline·
How to evaluate coding agents in production — the four metrics that matter and the five failure modes to design tests around: go.adaline.ai/t3PXOv5
English
0
0
0
24
Adaline
Adaline@tryadaline·
SWE-bench gives coding agents a known codebase, a clear problem statement, and a test suite that validates the fix. That is not what production looks like. METR found in March 2026 that automated grader scores averaged 24 percentage points higher than what maintainers actually accepted. The benchmark was measuring something different from production readiness. Four things production coding agent evals need that SWE-bench does not test: • Multi-file reasoning: Production tasks require reasoning about files that the agent was not explicitly given. • Tool failure handling: Real tools return malformed responses, and the eval should verify the agent handles them cleanly. • Partial context tolerance: Real requirements are often ambiguous, which benchmark tasks never replicate. • Regression detection: The eval should verify the agent has not touched code outside the task scope. Benchmark scores tell you which models to eliminate. Production evals tell you which ones to ship.
Adaline tweet media
English
2
0
0
57
Adaline
Adaline@tryadaline·
MCP standardizes how agents discover and connect to tools. It does not standardize what happens when those connections break. Three things MCP does not handle: • Retry safety: Whether a failed call is safe to retry depends on whether the operation has side effects, and MCP does not carry that information. • Silent failures: When a tool returns null instead of an error, MCP does not surface that signal to the agent. • Observability: Traces that reconstruct what happened across a multi-step sequence are not part of the MCP spec. Engineers who mistake integration ease for production reliability will encounter this during the first real production failure. The protocol handles the connection, but everything that happens after is still your engineering problem.
Adaline tweet media
English
1
0
0
46
Adaline
Adaline@tryadaline·
Using the same model family to generate and judge your outputs isn’t evaluation. It’s self-grading. Three biases that don’t show up in aggregate agreement scores but consistently show up in practice: 1. Position bias: Give the model two responses, and it favors whichever appears first, regardless of quality. 2. Verbosity bias: Longer outputs score higher, not more accurate ones. 3. Self-enhancement bias: When a model judges its own outputs against a competitor, it rates itself higher even when the outputs are identical. The research case for LLM-as-a-judge is solid. The case for running it without calibration is not. Which of these has burned your eval pipeline?
English
1
0
0
48
Adaline
Adaline@tryadaline·
Your agent passed every internal test. In production, it completed fewer than 1 in 3 tasks correctly. Not because the model was weak, but because production is a completely different test. Here are the five failure modes that arrive at the same time: • Context rot: Quality degrades across turns without any error being thrown. • Tool execution unreliability: The model produces confident responses when tools return null or time out. • Evaluation blindness: You find out quality changed when users complain, not when a metric catches it. • Unsafe retry behavior: Retry logic re-runs stateful workflows and creates side effects. • Memory drift: Agents behave inconsistently across sessions with the same user. None of these arrives one at a time. Full reading guide in the first reply.
Adaline tweet media
English
1
0
0
39
Adaline
Adaline@tryadaline·
Your agent called the wrong tool. To resolve the issue, you rewrote the system prompt, reran the test, and watched it call the wrong tool again. The system prompt was never the problem. The model selects its tool before writing a single word, and that decision is based entirely on the tool’s description. Most production tool definitions are one sentence long. One sentence is not enough to avoid selection errors in production. Here are the four failure modes that cause most of these problems: • Ambiguous overlap: Two tools with similar descriptions cause the model to pick inconsistently between them. • Missing constraints: Without a rule for when not to use a tool, any matching request becomes a trigger. • Misleading parameter names: Parameter names carry their own selection signal, separate from the description text. • Unnecessary calls: Agents invoke tools on queries they could answer on their own. A wrong call leads to a wrong answer about 62% of the time. The full breakdown is in the first reply.
Adaline tweet media
English
2
0
1
77
Adaline
Adaline@tryadaline·
Pattern set: → Strict schemas. → Predictable errors. → Prompt-schema alignment. → Testing durability across versions. Subscribe: go.adaline.ai/z5oLAVq
English
1
0
0
31
Adaline
Adaline@tryadaline·
Tool calls fail when schemas are ambiguous. Not when models are weak.
English
1
0
1
50