At some point every team running agents in production hits the same wall. You want to know if the output is good, "good" is not something a regex can check, and you cannot read ten thousand outputs a day by hand. The only thing that scales to grade nuance is another model. So you reach for LLM-as-judge, and you are right to. You are also pointing a measurement instrument at your system that can drift, flatter, and contradict itself if you are not careful about how you wire it.
This is the practical guide to wiring it carefully. For the failure it is built to catch, see quality drift, the bug you only find in your support inbox.
What the judge is actually for
Be precise about the job. You are not trying to assign a true, defensible grade to every individual output. That is the thing LLM judges are weakest at, and chasing it leads to endless arguments about whether run 4,012 deserved a 3 or a 4.
The job is to watch a distribution. You want to know, continuously, whether the quality of your agent's output is holding steady, improving, or sliding. That is a question about the rolling shape of many scores, and it is robust to exactly the per-call noise that makes a single judgment shaky. A judge that is wrong by half a point on any given run, but wrong in a consistent direction, still tells you cleanly when the whole distribution shifts down after a deploy.
Hold onto that framing, because it changes every decision that follows. You are building a trend instrument, not a court.
Checks first, judges second
Before any model call, run the cheap layer. A deterministic check is anything code can verify for free: a JSON schema match, a required-field assertion, a value-in-range test, a length bound. Checks run on every single output, in milliseconds, at zero cost, and they catch structural failures cleanly.
The judge runs on top of that, on a sample, for the things checks cannot see: is the answer actually correct, is the tone right, did it address what was asked. Checks tell you the output is well-formed. The judge tells you the well-formed output is good. Run checks at 100% and judges at a sample rate of 5 to 10%, and you catch structural breakage for free while paying only cents to catch qualitative drift.
The four biases, and what to do about each
LLM judges have well-documented failure modes. None of them are disqualifying; all of them have mitigations. Ignoring them is what turns a quality score into a number that feels rigorous and means nothing.
Position bias. In a pairwise "which is better, A or B" setup, judges tend to favour whichever answer appears first. Mitigation: randomise the order, and for high-stakes comparisons run both orderings and average. If A wins in both positions, A actually won.
Verbosity bias. Judges reward length. A longer, more padded answer often scores higher than a tighter, better one, because more text reads as more thorough. Mitigation: say so in the rubric ("do not reward length; a complete answer in two sentences scores the same as a complete answer in two paragraphs"), and watch for a creeping correlation between output length and score, which is the tell.
Self-preference. A model rates outputs from its own family more generously. Mitigation: judge with a different model family than the one generating the output. If your agent runs on one provider, judge with another.
Leniency drift. Judges cluster toward the top of the scale, turning a 1-to-5 rubric into a de facto 3-to-5 rubric. Mitigation: anchor every score with a concrete example of what it looks like, and include genuinely bad reference outputs so the bottom of the scale is real to the judge.
Anchored scales: the single highest-leverage fix
If you do one thing from this post, do this. Do not ask for "a score from 1 to 5 for quality". That instruction means something slightly different to every model version, and your quality metric becomes a measurement of the judge's mood.
Instead, write what each score is, with examples drawn from real outputs:
routing_accuracy:
1: "Routed to a clearly wrong queue (different department)."
3: "Routed to an adjacent but suboptimal queue."
5: "Routed to the exact correct queue per the queue charter."
Now the judge has fixed reference points. When the judge model changes under you, the anchors hold the scale in place, because a "1" is still pinned to a concrete description rather than the model's free-floating sense of badness. Anchored scales are the difference between a number that drifts with the weather and a number you can put on a chart and trust.
Calibration: trust, then verify the judge itself
The uncomfortable question with any judge is "who judges the judge". The answer is a calibration set: 30 to 50 outputs you have scored by hand, stored alongside the rubric.
Whenever you change the judge model or the rubric, replay the calibration set and look at the delta. If this rubric on this model scores the set at an average of 3.4, and the previous setup scored the same set at 3.5, you have a small, known shift you can account for. If the delta is large, you have changed your instrument, not discovered that your agent got worse, and you catch it before it pollutes the live chart.
Calibration is the step almost everyone skips, and it is exactly the step that separates "we have quality scores" from "we trust our quality scores". A judge you never calibrate is a thermometer you never check against boiling water; it produces confident numbers that may be measuring nothing.
Pin the model, version the rubric
Two operational rules keep the whole thing honest.
Pin the judge model to an explicit version, not a floating alias. If your judge tracks a "latest" pointer, the provider can change your measurement instrument overnight without telling you, and your quality chart will move for reasons that have nothing to do with your agent. This is the exact mechanism behind a class of "our scores dropped and nothing changed" incidents.
Version the rubric, and never silently re-score old runs against a new version. A run is scored against the rubric that was live when it ran, and that score is immutable. When you change the rubric, the dashboard marks the chart where the version flipped, so a step in the trend line is attributable to "we changed how we measure" rather than confused with "the agent changed". Conflating those two is the cardinal sin of quality measurement, and rubric versioning is what prevents it.
How AgentPing runs the judge
In AgentPing the judge runs in the control plane, but the model call goes out against a provider key you supply, so judge spend lands on your invoice and you choose exactly which model judges your agents. Checks run on every run, judges run on the sample rate you set per agent, scores land on the run record next to the cost and the timing, and the dashboard plots the rolling distribution with deploy markers. Calibration sets and rubric versions are first-class objects, because the parts of this that are easy to skip are the parts that decide whether the number is real. See Verify features for the implementation.
The judge is a powerful instrument and a fragile one. Wired carelessly it produces confident noise; wired with anchors, calibration, and a pinned model it produces a trend you can run a team on. If you are shipping prompt changes with no scoring on live traffic, you are flying on vibes. Get started and put a judge on your most customer-facing agent first.