000 verify · symptom

Your LLM output is getting worse. The dashboard says green.

A prompt change last Monday, a model auto-upgrade, a system prompt edit. Something shifted, and the answers your agent returns have been quietly degrading since. Production tells you everything is healthy. Support is filling up with tickets that sound similar. By the time someone bisects, the change is two weeks old.

001 how quality drift hides

The output looks fine. Until you compare it to last week.

LLM output is non-deterministic, so any single response could be a one-off. A small quality drop on a single run is indistinguishable from normal variance. Drift only shows up in the aggregate: a moving average dipping below threshold, a series of low-scoring runs sharing the same failure mode, a distribution that has subtly shifted left.

Where the signal is and isn't

  • Latency dashboards. Useless for quality. A degraded answer arrives in the same 800ms as a good one.
  • Error rate. Useless for quality. The agent didn't throw; it confidently returned the wrong thing.
  • Offline evals. Run once at deploy time, never again. Production traffic has a different distribution from your test set, and that's where drift lives.
  • Customer complaints. The most reliable signal, and the latest. By the time tickets arrive, the regression has been shipping output to everyone for days.
002 what catches it

Score every run, watch the distribution, page on drift.

Two layers of scoring catch drift between them. Deterministic checks on every run, free, at zero LLM cost. LLM-as-judge on a sampled subset for the qualitative dimensions (tone, correctness, completeness) that deterministic checks can't reach.

A real rubric

support-triage · rubric_v3.yaml judge config
agent: support-triage
rubric: |
  Score this support ticket triage on three dimensions:
  1. Correct department (1-5)
  2. Severity matches issue urgency (1-5)
  3. Customer tone preserved (1-5)
  Output JSON: {"department": int, "severity": int,
                "tone": int, "reasoning": string}
judge_model: claude-haiku-4-5
sample_rate: 0.10            # 10% of runs, plus all failures
pass_threshold: 4.0          # average across the three dimensions
calibration_anchors:
  good:
    - run_id: run_018f3a2b9c1d7e8fa4b9c2d7e8f1a3b6
  bad:
    - run_id: run_018f3a2b9c1d7e8fa4b9c2d7e8f1a3d4

The rubric is plain English. The judge sees your anchor outputs in-context, so scores stay comparable across versions. Edit and save; new runs score against v4, historical scores stay tagged to v3, drift detection runs against the version-aware baseline.

The three things to instrument

  • Deterministic checks on every output. JSON schema, regex, length bounds, required fields, tool-call assertions, numeric range. Six primitives, zero LLM cost, catch the shape regressions before they reach a customer.
  • LLM-as-judge on a sampled subset. Plain-English rubric, calibration anchors, hard per-team spend cap (default £50/month). Catches qualitative drift that deterministic checks can't see.
  • Drift detection on the score distribution. z-score against the trailing 14-day baseline. A single 2/5 is noise; the mean dropping from 4.2 to 3.8 over a week is signal. The drop is flagged on the dashboard with the low-scoring runs and the judge's reasoning.
003 frequently asked
My evals pass in CI. Why do I need scoring in production?
Offline evals run on a fixed test set. Production scoring runs on real user input distribution, which drifts over time. A prompt change can pass every CI test and still produce worse output on the long tail of inputs you didn't test. Production scoring is the only place that distribution lives.
Doesn't LLM-as-judge cost a lot?
Only if you score every run. The sensible default is to sample (10% of normal runs plus 100% of failures), bias toward high-cost and failed runs, and cap the per-team budget hard (default £50/month). The judge is also a cheap model: Claude Haiku 4.5 by default. Cost is bounded; the signal is not.
How do you keep judge scores comparable as the rubric evolves?
Calibration anchors. You tag a handful of good and bad runs by ID, and the judge prompt includes them in-context for every call. The anchors hold the scale stable so that "a 4 today" means roughly what "a 4 last week" meant, even when you edit the rubric. Each rubric edit creates a new version and scores are tagged with the version that generated them.
What's the difference between deterministic checks and LLM-as-judge?
Deterministic checks (JSON schema validation, regex, length bounds, required fields, tool-call assertions, numeric range) run on every output at zero LLM cost. They answer pass/fail questions about shape. LLM-as-judge answers quality questions about content (tone, correctness, completeness) on a sampled subset. Most teams run both layers.
How does drift detection avoid flapping on single bad runs?
Drift runs on the distribution, not single scores. A single 2/5 is noise. The mean moving from 4.2 to 3.8 over a week is signal. Statistical change detection (z-score against the trailing 14-day baseline) flags it on the dashboard, with the low-scoring runs and the judge's reasoning, so you see the root cause, not just the symptom.
004 read next

How AgentPing implements quality scoring.

Verify is the quality-scoring side of AgentPing. The features page walks through the rubric format, calibration anchors, sample rate dial, and drift detection thresholds. The docs go one level deeper.

Verify features Verify docs What is AI agent observability?