A team we spoke with last month shipped a small prompt change on a Tuesday. The system prompt for their support-triage agent had been a long block of guidance and the change was a clarity edit; a paragraph rearranged, two examples removed, one sentence reworded. The CI eval suite, which ran on a fixed set of 40 test cases, still passed. The PR merged. The change went out.
Two weeks later, the support team noticed they had been getting more "you misunderstood me" replies than usual. They thought it was customer-volume seasonality and ignored it for another four days. When the trend kept going, an engineer pulled a random sample of 100 triage decisions from the last seven days, scored them against the team's quality rubric by hand, and found the rubric score had dropped 22% from the baseline they had measured a month earlier.
The bisect took half a day. The model itself had silently been upgraded by the provider on the Tuesday before, and the new minor version interpreted one paragraph in the system prompt slightly differently. The team's prompt change happened to land on the same day. They had been chasing the prompt change for half the morning before someone thought to check the model version.
Two weeks of degraded triage shipped to customers. No alert fired because no alert had been configured for the thing that was actually breaking, which was the output quality itself. See LLM output quality drift for the diagnostic side of this story.
Why exceptions and latency miss this
Exceptions catch crashes. Latency catches slowdowns. Neither catches an agent that runs fine, returns plausible JSON, and produces subtly worse content. The output is well-formed; the JSON schema validates; the response code is 200. From every measurement an APM tool has access to, the agent is healthy.
The thing that changed is not in any of those signals. It is in the content of the output. A triage decision that used to route 85% of "billing" tickets to the right queue is now routing 70% to the right queue and 15% to a vaguely-adjacent queue. From the outside, the agent is doing the same thing. From the customer experience, it is materially worse.
To catch this you need a signal that reads the output, not just the metadata around it. That is what scoring is for.
Two-layer scoring: checks and judges
Production scoring has two layers and they cost two different things.
Checks are deterministic. A check is anything a small piece of code can verify without an LLM: a JSON schema match, a regex for required fields, a presence-of-key assertion, a value-in-range test for numeric outputs, a length bound on free text. Checks run in milliseconds, cost nothing, and return a hard pass or fail per run. You run them on every single run. They catch structural failures cleanly. They do not catch nuance.
Judges are an LLM call. A separate model reads the agent's output against a rubric and returns a score and a rationale. Judges catch the nuance checks cannot: tone, accuracy, completeness, helpfulness. They cost real money (typically $0.01 to $0.05 per call) and take seconds to run. You do not run them on every run; you sample.
The two layers are complementary. Checks tell you the agent is producing valid-shaped output. Judges tell you the valid-shaped output is good. Most teams run checks at 100% and judges at 5% to 10%. The structural failures get caught for free; the qualitative drift gets caught cheaply.
Writing a rubric that does not drift with the model
The risk with LLM-as-judge is that the judge itself drifts. If your rubric is loose, the judge will interpret it slightly differently as models change, and your "quality score" becomes a measurement of the judge's mood rather than the agent's output.
Two things help. First, anchored scales. Do not write "score 1 to 5 for quality". Write what a 1 looks like, what a 3 looks like, and what a 5 looks like, with concrete examples drawn from real outputs. The judge has actual reference points and the score stops drifting when the model changes.
Second, calibration sets. Pick 50 outputs you have hand-scored, store them with the rubric, and replay them against the judge whenever you change the rubric or the judge model. The replay produces a delta: this rubric on this model gives an average score of 3.4 on the calibration set; the previous rubric on the previous model gave 3.5. If the delta is large, you have changed the measurement, not the thing being measured, and you can adjust before turning the new version on.
Calibration is the bit most teams skip. It is also the bit that separates "we score outputs" from "we trust our scores".
Sample rate: 10% catches drift cheaply
The arithmetic on judge sample rate is straightforward. A judge call costs around $0.02 against a typical model on a typical rubric. An agent doing 10,000 runs a day at 10% sample rate is 1,000 judge calls, which is $20 a day or $600 a month. That is the per-agent cost of catching quality drift in production.
The reason 10% works is that drift detection runs on the distribution, not on individual scores. You are not trying to grade every output. You are trying to notice when the rolling mean of judge scores moves, or when the spread of judge scores widens. A 10% sample over a 24-hour window is 1,000 scored points on the chart. That is more than enough resolution to spot a 0.2-point drop in the rolling mean within hours.
Teams that need higher confidence on a specific agent (a customer-facing one, say) bump the sample to 25% and pay the linear extra cost. Teams running internal agents with lower stakes drop to 2% and accept the lower resolution. The right number is whatever produces a score chart you trust enough to act on.
Drift detection on the score distribution
A score distribution is more informative than a score mean. A drift in the mean is the obvious case: the agent used to score 4.1 on average, this week it is scoring 3.7. A drift in the spread is the subtle case: the mean is still 4.1, but the distribution has gone from "mostly 4s with a few 3s and 5s" to "mostly 4s with a long tail of 1s". The mean does not move; something has clearly broken on a subset of inputs.
AgentPing watches both. The default drift detector compares the rolling 7-day score distribution against the trailing 28-day baseline using a KS-style test. When the distributions diverge past a threshold you set, a drift alert fires with a link to the change in the chart. The alert tells you which way the distribution moved, when the divergence started, and the agent versions in play during both windows. See Verify features for the full implementation.
The deploy markers on the chart are the other half of this. Every run carries an agent version (typically a git SHA or a release tag from your code), and the chart shows a vertical line where the version changed. When drift hits, the line tells you whether it correlates with a recent deploy. In the story above, the team would have seen the drift start the day the model was silently upgraded, with no corresponding deploy marker on their side, which is the signal that the change is external.
A worked example: support triage rubric
A "support triage quality" rubric for the agent in the story above might look like this:
agent: support-triage
version: 3
scale: 1-5
criteria:
routing_accuracy:
weight: 0.5
1: "Routed to a clearly wrong queue (different department)."
3: "Routed to an adjacent but suboptimal queue."
5: "Routed to the exact correct queue per the queue charter."
priority_assignment:
weight: 0.3
1: "Priority off by two levels (P3 ticket marked P1)."
3: "Priority off by one level."
5: "Priority correct per the SLA matrix."
customer_summary:
weight: 0.2
1: "Summary misses the customer's actual question."
3: "Summary captures the question but loses key context."
5: "Summary captures question and context faithfully."
calibration_set: triage_calibration_v3.jsonl
The judge sees the agent's output, the original ticket, and this rubric. It returns a weighted score and a one-line rationale per criterion. The score lands on the run, the rationale lands in the event timeline, and the dashboard plots the rolling distribution.
When the story-team's drift hit, the chart would have shown the routing_accuracy criterion sliding first, two days before the others, with example rationales like "routed billing ticket to general-support" attached to the low-scoring runs. The on-call engineer opens the agent's page, sees the version-marker for the prompt change, sees no corresponding marker but a drop that started a day earlier, and within an hour has identified the provider model upgrade as the candidate. Two weeks becomes one afternoon.
If you ship prompt changes to production and have no scoring harness running on the live traffic, drift is already happening; you just do not measure it yet. Get started and wire up scoring on your most customer-facing agent first.