LLM-as-judge in production: how it works, and where it goes wrong

Using a model to score your model is the only way to grade nuance at scale. It is also a measurement instrument that can drift, flatter, and contradict itself. Here is how to run it so the scores mean something.

Matt King

June 11, 2026 10 min read

At some point every team running agents in production hits the same wall. You want to know if the output is good, "good" is not something a regex can check, and you cannot read ten thousand outputs a day by hand. The only thing that scales to grade nuance is another model. So you reach for LLM-as-judge, and you are right to. You are also pointing a measurement instrument at your system that can drift, flatter, and contradict itself if you are not careful about how you wire it.

This is the practical guide to wiring it carefully. For the failure it is built to catch, see quality drift, the bug you only find in your support inbox.

What the judge is actually for

Be precise about the job. You are not trying to assign a true, defensible grade to every individual output. That is the thing LLM judges are weakest at, and chasing it leads to endless arguments about whether run 4,012 deserved a 3 or a 4.

The job is to watch a distribution. You want to know, continuously, whether the quality of your agent's output is holding steady, improving, or sliding. That is a question about the rolling shape of many scores, and it is robust to exactly the per-call noise that makes a single judgment shaky. A judge that is wrong by half a point on any given run, but wrong in a consistent direction, still tells you cleanly when the whole distribution shifts down after a deploy.

Hold onto that framing, because it changes every decision that follows. You are building a trend instrument, not a court.

Checks first, judges second

Before any model call, run the cheap layer. A deterministic check is anything code can verify for free: a JSON schema match, a required-field assertion, a value-in-range test, a length bound. Checks run on every single output, in milliseconds, at zero cost, and they catch structural failures cleanly.

The judge runs on top of that, on a sample, for the things checks cannot see: is the answer actually correct, is the tone right, did it address what was asked. Checks tell you the output is well-formed. The judge tells you the well-formed output is good. Run checks at 100% and judges at a sample rate of 5 to 10%, and you catch structural breakage for free while paying only cents to catch qualitative drift.

The four biases, and what to do about each

LLM judges have well-documented failure modes. None of them are disqualifying; all of them have mitigations. Ignoring them is what turns a quality score into a number that feels rigorous and means nothing.

Position bias. In a pairwise "which is better, A or B" setup, judges tend to favour whichever answer appears first. Mitigation: randomise the order, and for high-stakes comparisons run both orderings and average. If A wins in both positions, A actually won.

Verbosity bias. Judges reward length. A longer, more padded answer often scores higher than a tighter, better one, because more text reads as more thorough. Mitigation: say so in the rubric ("do not reward length; a complete answer in two sentences scores the same as a complete answer in two paragraphs"), and watch for a creeping correlation between output length and score, which is the tell.

Self-preference. A model rates outputs from its own family more generously. Mitigation: judge with a different model family than the one generating the output. If your agent runs on one provider, judge with another.

Leniency drift. Judges cluster toward the top of the scale, turning a 1-to-5 rubric into a de facto 3-to-5 rubric. Mitigation: anchor every score with a concrete example of what it looks like, and include genuinely bad reference outputs so the bottom of the scale is real to the judge.

Anchored scales: the single highest-leverage fix

If you do one thing from this post, do this. Do not ask for "a score from 1 to 5 for quality". That instruction means something slightly different to every model version, and your quality metric becomes a measurement of the judge's mood.

Instead, write what each score is, with examples drawn from real outputs:

routing_accuracy:
  1: "Routed to a clearly wrong queue (different department)."
  3: "Routed to an adjacent but suboptimal queue."
  5: "Routed to the exact correct queue per the queue charter."

Now the judge has fixed reference points. When the judge model changes under you, the anchors hold the scale in place, because a "1" is still pinned to a concrete description rather than the model's free-floating sense of badness. Anchored scales are the difference between a number that drifts with the weather and a number you can put on a chart and trust.

Calibration: trust, then verify the judge itself

The uncomfortable question with any judge is "who judges the judge". The answer is a calibration set: 30 to 50 outputs you have scored by hand, stored alongside the rubric.

Whenever you change the judge model or the rubric, replay the calibration set and look at the delta. If this rubric on this model scores the set at an average of 3.4, and the previous setup scored the same set at 3.5, you have a small, known shift you can account for. If the delta is large, you have changed your instrument, not discovered that your agent got worse, and you catch it before it pollutes the live chart.

Calibration is the step almost everyone skips, and it is exactly the step that separates "we have quality scores" from "we trust our quality scores". A judge you never calibrate is a thermometer you never check against boiling water; it produces confident numbers that may be measuring nothing.

Pin the model, version the rubric

Two operational rules keep the whole thing honest.

Pin the judge model to an explicit version, not a floating alias. If your judge tracks a "latest" pointer, the provider can change your measurement instrument overnight without telling you, and your quality chart will move for reasons that have nothing to do with your agent. This is the exact mechanism behind a class of "our scores dropped and nothing changed" incidents.

Version the rubric, and never silently re-score old runs against a new version. A run is scored against the rubric that was live when it ran, and that score is immutable. When you change the rubric, the dashboard marks the chart where the version flipped, so a step in the trend line is attributable to "we changed how we measure" rather than confused with "the agent changed". Conflating those two is the cardinal sin of quality measurement, and rubric versioning is what prevents it.

How AgentPing runs the judge

In AgentPing the judge runs in the control plane, but the model call goes out against a provider key you supply, so judge spend lands on your invoice and you choose exactly which model judges your agents. Checks run on every run, judges run on the sample rate you set per agent, scores land on the run record next to the cost and the timing, and the dashboard plots the rolling distribution with deploy markers. Calibration sets and rubric versions are first-class objects, because the parts of this that are easy to skip are the parts that decide whether the number is real. See Verify features for the implementation.

The judge is a powerful instrument and a fragile one. Wired carelessly it produces confident noise; wired with anchors, calibration, and a pinned model it produces a trend you can run a team on. If you are shipping prompt changes with no scoring on live traffic, you are flying on vibes. Get started and put a judge on your most customer-facing agent first.

What is LLM-as-judge?

LLM-as-judge is the practice of using one language model to score the output of another against a rubric. The judge reads the agent's input and output, applies a scoring rubric you wrote, and returns a number plus a rationale. It exists because the qualities you care about most in agent output (accuracy, tone, completeness, helpfulness) cannot be checked with a regex or a schema, and grading them by hand does not scale past a few dozen examples a week.

Is LLM-as-judge reliable enough for production?

It is reliable for detecting changes in a score distribution over time, which is what production monitoring actually needs. It is less reliable as an absolute grade of a single output. The trick is to use it for the job it is good at: you are watching whether the rolling distribution of scores moves, not whether run 4,012 truly deserved a 3 rather than a 4. Aggregate trends are robust to the per-call noise that makes individual judgments shaky.

What are the known biases in LLM judges?

The well-documented ones are position bias (preferring whichever answer is shown first in a pairwise comparison), verbosity bias (scoring longer answers higher regardless of quality), self-preference (a model rating its own family of outputs more generously), and leniency drift (clustering scores toward the top of the scale). Each has a mitigation: randomise position, cap or normalise for length, judge with a different model family than the one being judged, and anchor your scale with concrete examples of what each score looks like.

Should the judge be a stronger model than the agent?

Usually yes. A judge needs to catch errors the agent did not catch in itself, so giving the judge more capability than the agent buys you genuine signal. A common setup runs a Sonnet-tier judge over a Haiku-tier agent. Using the same model to judge itself invites self-preference bias and means the judge is blind to exactly the failure modes the agent is prone to, since they share them.

How do I stop my judge scores from drifting when the model updates?

Pin the judge model version explicitly rather than tracking a floating alias, anchor your rubric with concrete per-score examples so the judge has fixed reference points, and keep a calibration set of hand-scored outputs you replay whenever you change the judge model or the rubric. The replay produces a delta; if the average score on a fixed set moves, you changed the instrument, not the thing being measured, and you can correct before turning the new version on.