AI agent observability is the practice of watching production AI agents the way you'd watch any other production system: knowing what each one costs, whether it's still running on schedule, and whether the output it produces is still good. It overlaps with general LLM observability but the unit of analysis is the agent, not the API call.
Traditional APM tools (Datadog, New Relic, Sentry) track requests, latency, exceptions, CPU. Those failure modes are loud. An AI agent fails in three quieter ways, and none of them light up an APM dashboard. The agent quietly stops firing. The token bill creeps. The output gradually gets worse. By the time someone notices, the gap is weeks old.
For every agent you put in production, three signals matter. They share a common primitive: one telemetry record per agent run, capturing inputs, outputs, tokens, cost, status, and timing. From that one record, the three views fall out naturally.
customer_id and feature at run start so rollups are retroactive.cached_tokens are priced separately from fresh input.LLM monitoring tools focus on the call: tokens in, tokens out, latency, prompt content, provider error. Useful, but one level too low when the thing you ship to production is an agent that wraps several LLM calls, a few tool calls, and some control flow. AI agent observability rolls those up into a single run record and lets you ask agent-level questions: did this agent run, what did it cost, was its output still good.
The smallest thing that works is a single event per agent run, captured by an SDK that never blocks the agent. From that event, the three views (Spend, Pulse, Verify) are derived. Most teams build a v0 of this themselves before deciding what to buy.
import agentping agentping.init(api_key=os.environ['AGENTPING_API_KEY']) with agentping.run('support-triage', customer_id='acme-corp') as run: response = agent.handle(ticket) run.set_output(response)
The run ID is generated client-side before any network call. Telemetry sends on a background thread with a 2-second hard timeout. If the service is unreachable, the agent runs as if no SDK is installed.
run_id (client-generated UUIDv7, available before any network call)agent_id, team_id, optional customer_id and feature tagsstarted_at / finished_at (client time) and received_at (server time)status: success / fail / timeoutprovider, model, input_tokens, output_tokens, cached_tokensThe features page walks through Spend, Pulse, and Verify with the specifics: rate card, alert routes, rubric format, drift detection thresholds. The docs go one level deeper into the API and SDK contract.