Rubrics
A prompt plus a scoring scale, run as an LLM-as-judge against a sampled subset of finished runs. Use rubrics for things you can't express as a check: tone, helpfulness, faithfulness, calibration. Team and above.
Rubric format
Written in YAML in the dashboard:
agent: support-triage
rubric: |
Score this support ticket triage on three dimensions:
1. Correct department (1-5)
2. Severity matches issue urgency (1-5)
3. Customer tone preserved (1-5)
Output JSON: {"department": int, "severity": int, "tone": int, "reasoning": string}
judge_model: claude-haiku-4-5
sample_rate: 0.1
pass_threshold: 4.0
calibration_anchors:
good:
- run_id: run_eu_018f3a2b9c1d7e8fa4b9c2d7e8f1a3b6
bad:
- run_id: run_eu_018f3a2b9c1d7e8fa4b9c2d7e8f1a3b7
Always elicit a reasoning field. The dashboard stores up to 2KB of judge reasoning alongside every score; it's what tells you why quality moved.
Sampling
- Default 10%, configurable 1-100%
- Stratified: biased toward failed and high-cost runs
- Failed runs are always sampled, regardless of dial position
- Per-agent override to 100% for critical agents
The dial is the cost knob. 10% on a high-volume agent gives plenty of statistical power.
Versioning
Every edit creates a new version (rubv_<region>_...). Scores are tagged with the version they were generated against. A rubric change doesn't retroactively re-score history. Re-scoring against a new rubric is available on demand (paid; judge tokens again).
Caching
Identical outputs scored against the same rubric version reuse the previous score. Agents that produce deterministic output for the same input pay for the judge once.
Bias mitigation
- Pairwise comparisons: judge order is randomized across runs.
- Absolute scoring: calibration anchors (10-20 "good" and "bad" examples) ground the judge.
- Optional multi-judge ensemble: run through two judge models, average scores, flag disagreements.
Cost caps
Hard cap per team (default £50/month, configurable). Email at 80%; hard stop at 100%. Resets at the first of each month UTC. Dashboard shows month-to-date and forecast.
Judge failures
When the judge errors (provider timeout, malformed JSON, rate limit), the run is marked judge_status: error, distinct from unscored. Retries with exponential backoff up to three times. Judge errors are surfaced as a separate metric and don't pollute the score distribution.