000 category

What is AI agent observability?

AI agent observability is the practice of watching production AI agents the way you'd watch any other production system: knowing what each one costs, whether it's still running on schedule, and whether the output it produces is still good. It overlaps with general LLM observability but the unit of analysis is the agent, not the API call.

001 why agents need their own observability

A web service crashes loudly. An agent rarely does.

Traditional APM tools (Datadog, New Relic, Sentry) track requests, latency, exceptions, CPU. Those failure modes are loud. An AI agent fails in three quieter ways, and none of them light up an APM dashboard. The agent quietly stops firing. The token bill creeps. The output gradually gets worse. By the time someone notices, the gap is weeks old.

The three failure modes

  • Silent breakage. A scheduled agent stops firing after a deploy. The cron job was removed during a refactor; no exception was thrown. The team finds out eleven days later when a customer complains the dashboard is stale.
  • Cost without attribution. The provider bill jumps from £2,400 to £8,900 in a month. The team has one API key shared across every agent. Three engineers spend a day bisecting which agent ate the spike.
  • Quality drift. A prompt change ships on Monday. By Wednesday the agent's average rubric score has dropped from 4.2 to 3.8. Support tickets arrive on Friday. The rollback fixes the future. Nothing fixes the past.
002 the three signals to watch

Cost, monitoring, and quality. One event per run.

For every agent you put in production, three signals matter. They share a common primitive: one telemetry record per agent run, capturing inputs, outputs, tokens, cost, status, and timing. From that one record, the three views fall out naturally.

Cost attribution

  • Tokens broken down by agent, customer, and feature. Tag customer_id and feature at run start so rollups are retroactive.
  • Server-side pricing from a rate card per (provider, model). The SDK never sends a cost number; mistakes about your own bill are impossible.
  • Cache-aware accounting. Prompt-cache reads on Anthropic and OpenAI cached_tokens are priced separately from fresh input.
  • Cost-per-successful-run as the headline metric, because cost over all runs hides loops and failures.
  • Anomaly detection on a 14-day baseline so a doubled spend pages you before the invoice lands.

Live monitoring + schedule freshness

  • Every finished run lands in the dashboard within a second, with status, latency, cost, and error signature.
  • Scheduled agents have a cron expression and a tolerance window. A missed run pages the on-call route within the grace period.
  • p95 latency rolled up across 24h / 7d / 30d. A sudden move usually means a model swap or a longer prompt.
  • Every run lands with its status (success, failed, timeout), so a bad deploy is obvious at a glance.
  • Run-level traces. Tool calls, LLM calls, parent and child runs, and the actual output. Step through any run.

Quality scoring with drift detection

  • Deterministic checks at zero LLM cost: JSON schema, regex, length bounds, required fields, tool-call assertions.
  • LLM-as-judge runs a rubric written in plain English against a sampled subset of runs. Failed runs are always sampled.
  • Calibration anchors. Tag a few good and bad runs by ID so the judge prompt sees in-context examples, and scores stay comparable as the rubric evolves.
  • Rubric versioning. Every edit creates a new version; scores are tagged with the version that generated them.
  • Drift detection on the distribution, not single scores. A single 2/5 is noise; the mean dropping from 4.2 to 3.8 over a week is signal.
003 vs llm monitoring

The unit of analysis is the agent, not the API call.

LLM monitoring tools focus on the call: tokens in, tokens out, latency, prompt content, provider error. Useful, but one level too low when the thing you ship to production is an agent that wraps several LLM calls, a few tool calls, and some control flow. AI agent observability rolls those up into a single run record and lets you ask agent-level questions: did this agent run, what did it cost, was its output still good.

004 minimum viable implementation

One event per run, captured at the edge. Everything else derives.

The smallest thing that works is a single event per agent run, captured by an SDK that never blocks the agent. From that event, the three views (Spend, Pulse, Verify) are derived. Most teams build a v0 of this themselves before deciding what to buy.

What the SDK looks like

example · python sdk 2 lines
import agentping
agentping.init(api_key=os.environ['AGENTPING_API_KEY'])

with agentping.run('support-triage', customer_id='acme-corp') as run:
    response = agent.handle(ticket)
    run.set_output(response)

The run ID is generated client-side before any network call. Telemetry sends on a background thread with a 2-second hard timeout. If the service is unreachable, the agent runs as if no SDK is installed.

What's in the run event

  • run_id (client-generated UUIDv7, available before any network call)
  • agent_id, team_id, optional customer_id and feature tags
  • started_at / finished_at (client time) and received_at (server time)
  • status: success / fail / timeout
  • Per-provider call records: provider, model, input_tokens, output_tokens, cached_tokens
  • Tool calls, child run IDs (for multi-agent traces), final output
005 frequently asked
Is AI agent observability just LLM monitoring under a new name?
No. LLM monitoring is about the API call: tokens, latency, prompt content. AI agent observability is one level up. The agent is the entity that matters and the run is the unit of analysis. An agent may call zero or fifty LLMs during a single run, and the question is whether the agent did its job, not whether each call was fast.
Do I need a tool, or can I build this myself?
You can build it. The primitive is one event per agent run capturing inputs, outputs, tokens, cost, and timing. Most teams find the build cost (ingest pipeline, rate card per provider and model, alert routing, dashboard, SDKs in every language they use) exceeds the cost of buying. The non-blocking SDK is harder to write correctly than it looks.
How is this different from APM like Datadog or Sentry?
APM watches exceptions, latency, CPU, request counts. An AI agent rarely throws an exception when it is broken; it returns a wrong answer with the same confidence as a right one. APM stays green. Agent observability watches the agent itself: did it run, did it cost what we expected, and was the output still good.
Does telemetry slow down my agent?
A well-written SDK never blocks the agent. The run ID is generated client-side before any network call, telemetry sends on a background thread, and the network call times out hard at 2 seconds. If the observability service is unreachable, the agent runs as if no SDK is installed.
What providers does cost attribution support?
Any provider you put on the rate card. Defaults ship for every Anthropic and OpenAI model. Cache-aware accounting (Anthropic prompt-cache reads, OpenAI cached_tokens) is priced separately from fresh input. Custom or self-hosted models work the same way once you add a rate card row.
How do you score quality without a labelled dataset?
Two layers. Deterministic checks (JSON schema, regex, length bounds, required fields, tool-call assertions) run on every output at zero LLM cost. LLM-as-judge runs your rubric, written in plain English, against a sampled subset of runs with calibration anchors so scores stay comparable as the rubric evolves.
006 read next

How AgentPing implements each of the three.

The features page walks through Spend, Pulse, and Verify with the specifics: rate card, alert routes, rubric format, drift detection thresholds. The docs go one level deeper into the API and SDK contract.

Features Spend docs Pulse docs Verify docs