Verify

Is quality holding? An agent that runs cleanly can still produce degraded output. Verify layers structured signal on top of every run.

Two layers

Layer	What	Tier
Deterministic checks	Server-side rules: JSON schema, regex, length, required fields, tool calls	All tiers, no LLM cost
LLM-as-judge rubrics	Cheap model scores output against a rubric on sampled runs	Team and up

Each layer is opt-in per agent. Run checks alone, judges alone, or both. Self-reported scores from your code (run.score(name, value)) feed the same distribution.

Drift detection

The signal isn't any single score; it's the distribution shifting. Verify maintains rolling 24h, 7d, and 30d averages per agent. Statistical change detection against the historical mean (last 14 days) fires alerts when the distribution shifts beyond threshold, on the same routes Pulse uses.

Low-scoring runs cluster by judge-reasoning embedding, so the dashboard tells you not just that quality dropped but that 22% of last week's failures share a root cause.

Cost transparency

A hard spend cap sits on every team (default £50/month, configurable). At 80% an email goes out; at 100% the judge pauses for the rest of the month. Sample rate is the dial: 10% by default, 1-100% configurable. Failed runs are always sampled regardless.