Verify knows your agents worked, not just that they ran.

Know your agents worked,
not just that they ran.

Health checks run on every run, free, instantly: did it complete, did it error, was it unusually slow or expensive. Evaluations go further: an AI review of whether the run achieved its goal, out of the box, with no scorers to write and no eval engineering.

Start free All features

summariser · judge score↓ 4.2 to 3.8

cites a source pass
answers the question pass
stays on policy fail

health checks

Every run checked, on every plan.

Did it complete, did it error, was it unusually slow or expensive, did a tool spin in a loop, did it produce output. Health checks run automatically on every run, free, with no limits, including on the Free plan.

Completion, errors, duration and cost anomalies, retry loops, empty output
Instant: results land with the run, no model call involved
Unlimited on every plan; this is the floor nothing slips under

daily-digest · run #48,102health

completed pass
error free pass
duration vs usual 98/100
retry loop fail

evaluations

Out of the box. Zero configuration.

An evaluation is an AI review of whether the run achieved its goal. No scorers to write, no labelled data, no eval engineering: give your runs a goal and AgentPing reviews the outcome, with the reasoning shown on every evaluated run.

Goal achieved or not, a quality score, named issues, and one-sentence reasoning
One field in the SDK (or one OpenTelemetry attribute) and it works from the first run
Disagree with one tap; your feedback sharpens future evaluations

evaluation · daily-digestquality 4/5

goal achieved yes
issues none
"Digest sent and the steps confirm delivery to #support."

your allowance

A number you choose, never a bill you discover.

Each plan includes a monthly evaluation allowance: 1,000 on Starter, 6,000 on Team, 20,000 on Business. Spread evenly across the month and concentrated on new and unusual runs, so the ten-thousandth identical success doesn't spend it. Never metered, never an overage.

Smart sampling: routine healthy runs don't use up your allowance
No single agent can take more than half the pool
Busy month? A $25 top-up adds 5,000 evaluations instantly
Per-team switch: staging and client teams evaluate only when you say so

evaluations used · june2,847 / 6,000

on pace spread evenly · 41% of runs covered by routine baseline

drift detection

See the drop before your users do.

A prompt change on Monday quietly drops your quality trend. Support tickets arrive Wednesday. Verify watches the live distribution and flags the slide on day one, not after the churn.

Quality trend per agent with coverage, on live production traffic
A routine agent that starts failing is evaluated again immediately
Drift surfaces on the dashboard before it surfaces in your inbox

daily-digest · quality trend↓ 84 to 58

quality drift on daily-digest since prompt v12 deploy · flagged on dashboard

Built so quality can't slip quietly.

The difference between finding out from your dashboard and finding out from an angry customer.

health

Unlimited health checks

Completion, errors, anomalies, retry loops, empty output. Every run, every plan, no limits.

zero config

Evaluations out of the box

No scorers to write, no eval engineering. Give runs a goal and the review works from run one.

allowance

Never a surprise bill

A monthly allowance by plan, $25 top-ups when you choose. Nothing metered, ever.

sampling

Smart sampling

Your allowance concentrates on new and unusual runs, not the ten-thousandth identical success.

control

Per-team switch

Evaluations are opt-in per team, so staging never quietly spends the allowance.

trends

Quality trend + coverage

Per-agent score history with coverage, measured on live production traffic.

The decay you would never catch by hand.

Give runs a goal

One field in the SDK or one OTel attribute; that's the whole setup.

goal "send the daily digest"

Every run answered

Health on all of them, evaluation on the runs that matter, automatically.

goal achieved yes
quality 4/5

See drift before users do

The trend moves the day a prompt regresses, and you hear about it then.

Questions, answered.

What is the difference between health checks and evaluations?

Health checks are automatic and unlimited on every plan: completion, errors, duration and cost anomalies, retry loops, empty output. Evaluations are an AI review of whether the run achieved its goal; each paid plan includes a monthly allowance.

What do I have to configure?

Nothing. Give your runs a goal (one field in the SDK, or one attribute on OpenTelemetry traces) and evaluations work from the first run. No scorers to write, no labelled data, no eval pipeline.

How is my evaluation allowance spent?

Evenly across the month, concentrated on new and unusual runs. Routine healthy runs are sampled lightly so the ten-thousandth identical success does not use up your allowance, and no single agent can take more than half of it.

What happens when the allowance runs out?

Health checks continue, unlimited. Evaluations pause until next month, a \$25 top-up (5,000 evaluations, current month only), or an upgrade. There is never a metered bill.

Can I keep evaluations off for some teams?

Yes. Each team has an "Evaluate runs" switch, and newly created teams start with it off, so a staging team or a client experiment never consumes your allowance unless you choose.

Why would I trust an AI to grade an AI?

The objective floor is covered by deterministic health checks with no model involved. The AI review handles the judgement call, whether the goal was achieved, and shows its reasoning on every run so you can check it. You can disagree with one tap, and that feedback sharpens future evaluations.

Point Verify at your agents. See it in minutes.

Verify knows your agents worked, not just that they ran. Two lines of code, or one curl. Live in minutes, free to start, 14-day trial on paid plans.

Start free See everything AgentPing does

Free to start. No card. The SDK never blocks your agents.

Know your agents worked, not just that they ran.