A prompt change last Monday, a model auto-upgrade, a system prompt edit. Something shifted, and the answers your agent returns have been quietly degrading since. Production tells you everything is healthy. Support is filling up with tickets that sound similar. By the time someone bisects, the change is two weeks old.
LLM output is non-deterministic, so any single response could be a one-off. A small quality drop on a single run is indistinguishable from normal variance. Drift only shows up in the aggregate: a moving average dipping below threshold, a series of low-scoring runs sharing the same failure mode, a distribution that has subtly shifted left.
Two layers of scoring catch drift between them. Deterministic checks on every run, free, at zero LLM cost. LLM-as-judge on a sampled subset for the qualitative dimensions (tone, correctness, completeness) that deterministic checks can't reach.
agent: support-triage rubric: | Score this support ticket triage on three dimensions: 1. Correct department (1-5) 2. Severity matches issue urgency (1-5) 3. Customer tone preserved (1-5) Output JSON: {"department": int, "severity": int, "tone": int, "reasoning": string} judge_model: claude-haiku-4-5 sample_rate: 0.10 # 10% of runs, plus all failures pass_threshold: 4.0 # average across the three dimensions calibration_anchors: good: - run_id: run_018f3a2b9c1d7e8fa4b9c2d7e8f1a3b6 bad: - run_id: run_018f3a2b9c1d7e8fa4b9c2d7e8f1a3d4
The rubric is plain English. The judge sees your anchor outputs in-context, so scores stay comparable across versions. Edit and save; new runs score against v4, historical scores stay tagged to v3, drift detection runs against the version-aware baseline.
Verify is the quality-scoring side of AgentPing. The features page walks through the rubric format, calibration anchors, sample rate dial, and drift detection thresholds. The docs go one level deeper.
Verify features → Verify docs → What is AI agent observability? →