Quality drift: the bug you only find in your support inbox

Quality slid 22% over two weeks. Nothing crashed. The model upgrade had changed how the agent read its system prompt. Continuous scoring closes the gap between "we changed something" and "we know it got worse".

Matt King

April 28, 2026 12 min read

A team we spoke with last month shipped a small prompt change on a Tuesday. The system prompt for their support-triage agent had been a long block of guidance and the change was a clarity edit; a paragraph rearranged, two examples removed, one sentence reworded. The CI eval suite, which ran on a fixed set of 40 test cases, still passed. The PR merged. The change went out.

Two weeks later, the support team noticed they had been getting more "you misunderstood me" replies than usual. They thought it was customer-volume seasonality and ignored it for another four days. When the trend kept going, an engineer pulled a random sample of 100 triage decisions from the last seven days, scored them against the team's quality rubric by hand, and found the rubric score had dropped 22% from the baseline they had measured a month earlier.

The bisect took half a day. The model itself had silently been upgraded by the provider on the Tuesday before, and the new minor version interpreted one paragraph in the system prompt slightly differently. The team's prompt change happened to land on the same day. They had been chasing the prompt change for half the morning before someone thought to check the model version.

Two weeks of degraded triage shipped to customers. No alert fired because no alert had been configured for the thing that was actually breaking, which was the output quality itself. See LLM output quality drift for the diagnostic side of this story.

Why exceptions and latency miss this

Exceptions catch crashes. Latency catches slowdowns. Neither catches an agent that runs fine, returns plausible JSON, and produces subtly worse content. The output is well-formed; the JSON schema validates; the response code is 200. From every measurement an APM tool has access to, the agent is healthy.

The thing that changed is not in any of those signals. It is in the content of the output. A triage decision that used to route 85% of "billing" tickets to the right queue is now routing 70% to the right queue and 15% to a vaguely-adjacent queue. From the outside, the agent is doing the same thing. From the customer experience, it is materially worse.

To catch this you need a signal that reads the output, not just the metadata around it. That is what scoring is for.

Two-layer scoring: checks and judges

Production scoring has two layers and they cost two different things.

Checks are deterministic. A check is anything a small piece of code can verify without an LLM: a JSON schema match, a regex for required fields, a presence-of-key assertion, a value-in-range test for numeric outputs, a length bound on free text. Checks run in milliseconds, cost nothing, and return a hard pass or fail per run. You run them on every single run. They catch structural failures cleanly. They do not catch nuance.

Judges are an LLM call. A separate model reads the agent's output against a rubric and returns a score and a rationale. Judges catch the nuance checks cannot: tone, accuracy, completeness, helpfulness. They cost real money (typically $0.01 to $0.05 per call) and take seconds to run. You do not run them on every run; you sample.

The two layers are complementary. Checks tell you the agent is producing valid-shaped output. Judges tell you the valid-shaped output is good. Most teams run checks at 100% and judges at 5% to 10%. The structural failures get caught for free; the qualitative drift gets caught cheaply.

Writing a rubric that does not drift with the model

The risk with LLM-as-judge is that the judge itself drifts. If your rubric is loose, the judge will interpret it slightly differently as models change, and your "quality score" becomes a measurement of the judge's mood rather than the agent's output.

Two things help. First, anchored scales. Do not write "score 1 to 5 for quality". Write what a 1 looks like, what a 3 looks like, and what a 5 looks like, with concrete examples drawn from real outputs. The judge has actual reference points and the score stops drifting when the model changes.

Second, calibration sets. Pick 50 outputs you have hand-scored, store them with the rubric, and replay them against the judge whenever you change the rubric or the judge model. The replay produces a delta: this rubric on this model gives an average score of 3.4 on the calibration set; the previous rubric on the previous model gave 3.5. If the delta is large, you have changed the measurement, not the thing being measured, and you can adjust before turning the new version on.

Calibration is the bit most teams skip. It is also the bit that separates "we score outputs" from "we trust our scores".

Sample rate: 10% catches drift cheaply

The arithmetic on judge sample rate is straightforward. A judge call costs around $0.02 against a typical model on a typical rubric. An agent doing 10,000 runs a day at 10% sample rate is 1,000 judge calls, which is $20 a day or $600 a month. That is the per-agent cost of catching quality drift in production.

The reason 10% works is that drift detection runs on the distribution, not on individual scores. You are not trying to grade every output. You are trying to notice when the rolling mean of judge scores moves, or when the spread of judge scores widens. A 10% sample over a 24-hour window is 1,000 scored points on the chart. That is more than enough resolution to spot a 0.2-point drop in the rolling mean within hours.

Teams that need higher confidence on a specific agent (a customer-facing one, say) bump the sample to 25% and pay the linear extra cost. Teams running internal agents with lower stakes drop to 2% and accept the lower resolution. The right number is whatever produces a score chart you trust enough to act on.

Drift detection on the score distribution

A score distribution is more informative than a score mean. A drift in the mean is the obvious case: the agent used to score 4.1 on average, this week it is scoring 3.7. A drift in the spread is the subtle case: the mean is still 4.1, but the distribution has gone from "mostly 4s with a few 3s and 5s" to "mostly 4s with a long tail of 1s". The mean does not move; something has clearly broken on a subset of inputs.

AgentPing watches both. The default drift detector compares the rolling 7-day score distribution against the trailing 28-day baseline using a KS-style test. When the distributions diverge past a threshold you set, a drift alert fires with a link to the change in the chart. The alert tells you which way the distribution moved, when the divergence started, and the agent versions in play during both windows. See Verify features for the full implementation.

The deploy markers on the chart are the other half of this. Every run carries an agent version (typically a git SHA or a release tag from your code), and the chart shows a vertical line where the version changed. When drift hits, the line tells you whether it correlates with a recent deploy. In the story above, the team would have seen the drift start the day the model was silently upgraded, with no corresponding deploy marker on their side, which is the signal that the change is external.

A worked example: support triage rubric

A "support triage quality" rubric for the agent in the story above might look like this:

agent: support-triage
version: 3
scale: 1-5
criteria:
  routing_accuracy:
    weight: 0.5
    1: "Routed to a clearly wrong queue (different department)."
    3: "Routed to an adjacent but suboptimal queue."
    5: "Routed to the exact correct queue per the queue charter."
  priority_assignment:
    weight: 0.3
    1: "Priority off by two levels (P3 ticket marked P1)."
    3: "Priority off by one level."
    5: "Priority correct per the SLA matrix."
  customer_summary:
    weight: 0.2
    1: "Summary misses the customer's actual question."
    3: "Summary captures the question but loses key context."
    5: "Summary captures question and context faithfully."
calibration_set: triage_calibration_v3.jsonl

The judge sees the agent's output, the original ticket, and this rubric. It returns a weighted score and a one-line rationale per criterion. The score lands on the run, the rationale lands in the event timeline, and the dashboard plots the rolling distribution.

When the story-team's drift hit, the chart would have shown the routing_accuracy criterion sliding first, two days before the others, with example rationales like "routed billing ticket to general-support" attached to the low-scoring runs. The on-call engineer opens the agent's page, sees the version-marker for the prompt change, sees no corresponding marker but a drop that started a day earlier, and within an hour has identified the provider model upgrade as the candidate. Two weeks becomes one afternoon.

If you ship prompt changes to production and have no scoring harness running on the live traffic, drift is already happening; you just do not measure it yet. Get started and wire up scoring on your most customer-facing agent first.

What is the difference between checks and judges?

A check is deterministic: a JSON schema match, a regex, a presence-of-field assertion, a value-in-range test. It runs in milliseconds, costs nothing, and gives a hard pass or fail. A judge is an LLM call: a separate model reads the agent's output against a rubric and returns a score and a rationale. Judges catch nuance that checks cannot, but they cost real money and take seconds. Most teams run checks on every run and judges on a sample, so structural failures are caught for free and qualitative drift is caught cheaply.

How much does LLM-as-judge cost?

A typical judge call against a strong model is between $0.01 and $0.05 per run, depending on the rubric length and the agent's output size. At a 10% sample rate on an agent doing 10,000 runs a day, that is roughly $10 to $50 a day to score the agent. Most teams find that sampling 5% to 10% is enough to detect drift in the score distribution within hours, which is the only number that matters. You are not trying to grade every output; you are trying to notice when the distribution moves.

Can I bring my own provider key for judge calls?

Yes. The judge runs in the AgentPing control plane, but the model call goes out against a provider key you supply on the team settings page. That means judge spend lands on your OpenAI or Anthropic invoice, not ours, and you can pick exactly which model judges your agents. Teams typically run a stronger model as judge than the agent itself uses; a Sonnet-tier model judging an Haiku-tier agent catches subtle errors the agent will not flag in its own output.

How does drift detection work?

Every scored run produces a number on the rubric's scale (typically 1 to 5). The dashboard plots the rolling mean and the rolling distribution over a window you pick, usually 7 or 14 days. Drift detection runs on the distribution, not just the mean: a shift in the mean is the obvious case, but a widening spread (more 1s and more 5s, same mean) is also a drift signal and the system catches it. You configure a threshold per agent, and a drift alert fires when the current window crosses it relative to the baseline window.

Will a rubric change re-score old runs?

No, and that is deliberate. Rubrics are versioned. A run is scored against the rubric version that was active when it ran, and the score is immutable. If you change the rubric, runs from that point forward are scored against the new version, and the dashboard shows a vertical marker on the chart where the rubric version flipped. Re-scoring old runs against a new rubric is available as an explicit backfill action, but it never happens silently, because conflating "the agent got better" with "we changed how we measure" is exactly the bug you are trying to avoid.