Sabitlenmiş Tweet

When you deploy an LLM-as-a-Judge, you’re shipping a classifier into production.
Each new version is a hypothesis about how the model interprets the world.
It’s data science, just expressed in natural language.
Here’s what that looked like for a recent client project where we trained an evaluator to detect a specific agent error type (labeled Category 1 failures) before release.
Dataset
Dev: 104 labeled traces (46 failures, 58 clean)
Eval: 95 labeled traces (34 failures, 61 clean)
What We Saw
v1 established a clear baseline.
v2 drove recall higher but overfit to the dev set, collapsing generalization.
v3 made surgical adjustments that clarified “when not to trigger,” improving specificity and stability.
v10 is when started to see a step change in the eval set performance, a sign the judge was beginning to generalize.
Why It Matters
I find that teams often fall into the trap of assuming the llm works without verifying it through hard data. This is a big mistake! Look at the numbers below and see for yourself. Even with careful preparation, the model still fails to correctly classify more than 80 percent of actual labeled errors.
A few percent of overfit recall here, a small generalization gap there, and suddenly your CI isn’t filtering what you think it is.
Treat them like classifiers: versioned, measured, and tuned against held-out data.
That’s how you keep agents honest in production.
@HamelHusain @sh_reya

English








