Rubrics

A prompt plus a scoring scale, run as an LLM-as-judge against a sampled subset of finished runs. Use rubrics for things you can't express as a check: tone, helpfulness, faithfulness, calibration. Team and above.

Rubric format

Written in YAML in the dashboard:

agent: support-triage
rubric: |
  Score this support ticket triage on three dimensions:
  1. Correct department (1-5)
  2. Severity matches issue urgency (1-5)
  3. Customer tone preserved (1-5)
  Output JSON: {"department": int, "severity": int, "tone": int, "reasoning": string}
judge_model: claude-haiku-4-5
sample_rate: 0.1
pass_threshold: 4.0
calibration_anchors:
  good:
    - run_id: run_eu_018f3a2b9c1d7e8fa4b9c2d7e8f1a3b6
  bad:
    - run_id: run_eu_018f3a2b9c1d7e8fa4b9c2d7e8f1a3b7

Always elicit a reasoning field. The dashboard stores up to 2KB of judge reasoning alongside every score; it's what tells you why quality moved.

Sampling

Default 10%, configurable 1-100%
Stratified: biased toward failed and high-cost runs
Failed runs are always sampled, regardless of dial position
Per-agent override to 100% for critical agents

The dial is the cost knob. 10% on a high-volume agent gives plenty of statistical power.

Versioning

Every edit creates a new version (rubv_<region>_...). Scores are tagged with the version they were generated against. A rubric change doesn't retroactively re-score history. Re-scoring against a new rubric is available on demand (paid; judge tokens again).

Caching

Identical outputs scored against the same rubric version reuse the previous score. Agents that produce deterministic output for the same input pay for the judge once.

Bias mitigation

Pairwise comparisons: judge order is randomized across runs.
Absolute scoring: calibration anchors (10-20 "good" and "bad" examples) ground the judge.
Optional multi-judge ensemble: run through two judge models, average scores, flag disagreements.

Cost caps

Hard cap per team (default £50/month, configurable). Email at 80%; hard stop at 100%. Resets at the first of each month UTC. Dashboard shows month-to-date and forecast.

Judge failures

When the judge errors (provider timeout, malformed JSON, rate limit), the run is marked judge_status: error, distinct from unscored. Retries with exponential backoff up to three times. Judge errors are surfaced as a separate metric and don't pollute the score distribution.