Verify
Is quality holding? An agent that runs cleanly can still produce degraded output. Verify layers structured signal on top of every run.
Two layers
| Layer | What | Tier |
|---|---|---|
| Deterministic checks | Server-side rules: JSON schema, regex, length, required fields, tool calls | All tiers, no LLM cost |
| LLM-as-judge rubrics | Cheap model scores output against a rubric on sampled runs | Team and up |
Each layer is opt-in per agent. Run checks alone, judges alone, or both. Self-reported scores from your code (run.score(name, value)) feed the same distribution.
Drift detection
The signal isn't any single score; it's the distribution shifting. Verify maintains rolling 24h, 7d, and 30d averages per agent. Statistical change detection against the historical mean (last 14 days) fires alerts when the distribution shifts beyond threshold, on the same routes Pulse uses.
Low-scoring runs cluster by judge-reasoning embedding, so the dashboard tells you not just that quality dropped but that 22% of last week's failures share a root cause.
Cost transparency
A hard spend cap sits on every team (default £50/month, configurable). At 80% an email goes out; at 100% the judge pauses for the rest of the month. Sample rate is the dial: 10% by default, 1-100% configurable. Failed runs are always sampled regardless.