Traditional APM watches infrastructure. It is good at it. Sentry catches exceptions, Datadog measures latency and saturation, your application logs hold a searchable record of what happened on the box. For a web service that either responds or fails, that stack is roughly complete.
AI agents fail differently. A web service crashes loud: 500s, paging, customer screams. An agent rarely does. It keeps returning something. Usually plausible. Sometimes wrong. Always billable. The failure modes that matter for agents do not produce the signals APM was built to catch, and the tools you already have on the platform say everything is fine while the agent is degrading underneath them.
This post is the four-axis comparison most teams have to work out for themselves a few months into running agents in production. For the broader definitional read, see what is AI agent observability?
The four-axis comparison
Below is what each layer of the existing stack catches, and what it misses, when the thing under observation is an LLM-driven agent.
| Tool | What it catches | What it misses for agents |
|---|---|---|
| Sentry | Exceptions, unhandled errors, stack traces from inside your code. | The agent did not throw; it confidently returned the wrong answer. |
| Datadog | Latency, CPU, memory, request count, queue depth, infrastructure health. | None of that tells you the answer changed, or that the cost per run tripled. |
| OpenAI / provider usage | Tokens by API key, dollar spend by day, at the account level. | You cannot trace cost back to a specific agent, run, customer, or feature. The account-level number aggregates everyone using the key. |
| Your application logs | A wall of text, searchable at 03:00 when you know something is wrong. | Not browsable at planning time. No structure across runs. Latency between "something happened" and "we know what" is human-scale. |
Each tool is doing its job. None of them are doing the agent's job. The question "is this agent healthy" is not answerable from any of them, individually or together, because the data they collect is about the platform, not the agent.
What AI agent observability actually has to measure
The first time a team builds this from scratch they tend to converge on the same three signals.
Cost attribution per agent, customer, and feature. Not "what did the team spend on tokens", but "which agent spent it, on which customer, for which feature". The provider invoice is the absolute number; the attribution is what tells you which work caused it. Without attribution, every spend incident is a forensic exercise on raw provider logs; with it, the dashboard answers the question in seconds.
Live monitoring with schedule freshness. Not just "did this request return 200", but "did this scheduled job actually fire in its expected window". Scheduled agents are common and the absence of work is itself a signal. A nightly summariser that went silent on the 14th is the worst case here; cheap to instrument, expensive to miss.
Quality scoring with drift detection. Not just "the response was valid JSON", but "the response was a good answer". Drift is a leading indicator; by the time a customer complains, the trend line has been falling for a week. Continuous scoring on a rubric you defined once, with alerts on the distribution, closes the gap between "we changed something" and "we know it got worse".
Those three signals are what the three pillars in AgentPing are named after (Spend, Pulse, Verify), but the naming is not the interesting part. The interesting part is that none of them are visible from any tool in the existing APM stack, and all three are catchable from a single per-run event.
Why logs and metrics are not enough
The natural unit of observation for an agent is the run. Not a log line, not a counter, not a span; the run.
A run is the operation the agent was asked to perform: classify a ticket, enrich a lead, summarise a document, answer a question. It has a start, an end, an outcome, a cost, and an event timeline inside it (the prompts, the tool calls, the model responses). Everything you want to know about the agent rolls up to the run.
Logs are too granular. A run produces dozens of log lines; you cannot reason about the run from the lines without re-aggregating them, and you cannot do that aggregation reliably across services. Metrics are too coarse. A counter of "agent calls" or a histogram of "agent latency" loses the per-run identity entirely; you cannot drill from "throughput dropped" to "which runs were affected".
The run is the operational atom. Build the system around it. Every event you emit attaches to a run id. The dashboard is a list of runs you can filter, group, and drill into. The cost roll-ups, the schedule checks, the quality scores all live on the run record. From there everything else falls out.
The minimum viable telemetry
The smallest amount of data that gives you the three pillars is six fields per run.
run_id: a UUIDv7-derived identifier, client-generated, idempotent.agent_id: a stable identifier for the agent.started_at: when the agent began work, in client time.finished_at: when it stopped, in client time.status: one ofok,fail,partial.cost(ormodel+input_tokens+output_tokensfor the rate card to compute it).
From those six fields you get per-agent cost attribution (group by agent, sum cost), schedule freshness (compare finished_at against the expected cadence), and per-agent run timelines. Six fields. One line of SDK call wrapping the agent.
Adding inputs and outputs to the run unlocks the next layer: quality scoring, drill-down on a bad run, audit log for compliance. Adding an event timeline inside the run unlocks the next: which tool call took the time, which model call produced the bad output. Each layer adds capability; none of them are mandatory to get the first three pillars working.
Start with six fields. Add the rest when you need them.
How adding AgentPing changes the on-call experience
The pager scenarios change in three concrete ways.
Before: customer emails support to say their nightly digest has not arrived. Support escalates to engineering. Engineering checks the cron. Cron is broken. Time-to-detect: 11 days. After: the schedule checker pages on-call 30 minutes after the missed window. The page carries the agent id, the last successful run, and a link to amend or replay. Time-to-detect: minutes.
Before: the monthly provider invoice triples. Finance asks the CTO why. The CTO spends three days joining provider logs against deploy records to find the agent that caused it. Time-to-detect: 33 days. After: the spend baseline alert fires the morning after a new agent spikes its cost. The page shows the per-run cost spike with example inputs. Time-to-detect: hours.
Before: support tickets tick up over two weeks. A monthly sample review finds quality dropped 22%. The team rolls back a prompt change. They cannot tell which customers got the degraded output. Time-to-detect: 17 days. After: the drift alert fires when the rolling score distribution diverges. The dashboard shows the score chart, the deploy marker, and example low-scoring runs. Time-to-detect: a day or two.
In all three cases, the existing APM stack stayed green throughout. The signal was never going to come from there. It was always going to need a new layer of measurement, built around the run as the operational atom, with cost, schedule, and quality as first-class properties.
That is what AI agent observability is. Not a new APM, not a Sentry replacement, not a logging tool. A different unit of measurement, sitting next to the platform stack you already have, watching the thing the platform stack was never designed to watch.
If you are running agents on top of Sentry plus Datadog plus a provider dashboard and a wall of logs, you already have the platform layer. What you do not have is the agent layer. Get started and wire up the three pillars on your most expensive agent first.