Verify catches quality drift before your users do.

A clean run can still
be a bad one.

A run can finish without error and still be worse than yesterday's. Verify grades every run against standards you write in plain English, and catches the slow decline after a prompt or model change while it is happening, not when the tickets arrive.

summariser · judge score↓ 4.2 to 3.8
  • cites a source pass
  • answers the question pass
  • stays on policy fail

checks

Define what good looks like, once.

A rubric, a JSON schema, a handful of rules. These checks run on every single run at zero marginal cost, because no model is involved, so the obvious failures never reach a customer.

  • Schema and heuristic checks on every run, no model call needed
  • Write a rubric in plain English; no labelled training set
  • Catch the malformed output before it ships, not after
summariser · run #48,102checks
  • valid JSON shape pass
  • cites a source pass
  • answers the question asked pass
  • stays within content policy fail

llm-as-judge

Score the runs a check can't catch.

For the judgement calls, a separate model scores each run against your rubric. It grades against the standard you wrote, not its own opinion. You set the sample rate, score every run or one in fifty, so you decide exactly how much quality assurance costs.

  • A judge model grades the rest against the rubric you wrote, in plain English
  • The judge's score and reasoning recorded on every judged run
  • You set the sample rate (judge every run, or one in fifty), you control the spend
judge score · last 7 daysavg 4.1
  • 1 · 2 · 3 · 4 · 5 score band

drift detection

See the drop before your users do.

A prompt change on Monday quietly drops your average score. Support tickets arrive Wednesday. Verify watches the live distribution and flags the slide on day one, not after the churn.

  • Drift detection on today's distribution, not last month's batch
  • Surfaces a degraded prompt or model change as it happens
  • Average score and pass rate tracked over time, per agent and rubric
summariser · avg score↓ 4.2 to 3.8
quality drift on summariser since prompt v12 deploy · -9% in 3 days · flagged on dashboard

Built so quality can't slip quietly.

The difference between finding out from your dashboard and finding out from an angry customer.

rubrics

Plain-English rubrics

Describe what good looks like in a sentence; no eval framework, no labelled dataset.

checks

Schema and heuristics

Deterministic checks run on every run at zero marginal cost, catching the obvious breaks.

control

You set the sample rate

Score every run or one in fifty. Quality assurance costs exactly what you choose.

drift

Drift detection

A downward shift in the live judge-score distribution surfaces on your dashboard before the tickets do.

trends

Score history

Average score and pass rate tracked over time, so a slow regression is visible, not a surprise.

coverage

Score every run

Quality measured on live production traffic, not a stale offline batch from last sprint.

The decay you would never catch by hand.

Set the bar once

A rubric or schema you write in plain English becomes the bar every run is held to.

  • cites a source pass
  • on policy fail

Score every run

Checks on all of them, judge on a sample you control, all on the live stream.

See drift before users do

The distribution moves the day a prompt regresses, and you hear about it then.

Questions, answered.

What is the difference between checks and judge?
Deterministic checks (schema, regex, heuristics) run on every run at zero cost, because no model is involved. LLM-as-judge scores against a rubric you write in plain English, at a sample rate you choose.
Why would I trust an AI to grade an AI?
Two reasons. The deterministic checks are objective and do the floor-level catching with no model involved. The judge model only handles the subjective calls, and it grades against the rubric you wrote, so it measures your standard, not its own opinion. It is more consistent than a human spot-checking by hand, and it reads every sampled run instead of the handful a person would.
Do I need labelled data to score quality?
No. You define what good looks like once, as a rubric, a JSON schema, or a judge prompt, and every run is scored against it. No training set required.
How do I control what quality scoring costs?
With the sample rate. Deterministic checks are always free, because they make no model call. Only the LLM-as-judge calls cost money, and you decide how many runs get judged: every one, or one in fifty.
How does drift detection work?
Verify watches the score distribution on today's live traffic against the trailing baseline. When average quality slips after a prompt or model change, it surfaces in your dashboard before it surfaces in your support inbox.
Can I see quality trends over time?
Yes. Average judge score and check pass rate are tracked over time, per agent and rubric, so a slow regression shows up on the dashboard well before it shows up in your support queue.
+ get started

Point Verify at your agents. See it in minutes.

Verify catches quality drift before your users do. Two lines of code, or one curl. Live in minutes, free while we are in private beta.

Free to start. No card. The SDK never blocks your agents.