Beyond Sentry and Datadog: what observability for AI agents actually means

Traditional APM watches infrastructure. AI agents fail differently. A web service crashes loud; an agent keeps returning something. Usually plausible. Sometimes wrong. Always billable.

Matt King

April 21, 2026 10 min read

Traditional APM watches infrastructure. It is good at it. Sentry catches exceptions, Datadog measures latency and saturation, your application logs hold a searchable record of what happened on the box. For a web service that either responds or fails, that stack is roughly complete.

AI agents fail differently. A web service crashes loud: 500s, paging, customer screams. An agent rarely does. It keeps returning something. Usually plausible. Sometimes wrong. Always billable. The failure modes that matter for agents do not produce the signals APM was built to catch, and the tools you already have on the platform say everything is fine while the agent is degrading underneath them.

This post is the four-axis comparison most teams have to work out for themselves a few months into running agents in production. For the broader definitional read, see what is AI agent observability?

The four-axis comparison

Below is what each layer of the existing stack catches, and what it misses, when the thing under observation is an LLM-driven agent.

Tool	What it catches	What it misses for agents
Sentry	Exceptions, unhandled errors, stack traces from inside your code.	The agent did not throw; it confidently returned the wrong answer.
Datadog	Latency, CPU, memory, request count, queue depth, infrastructure health.	None of that tells you the answer changed, or that the cost per run tripled.
OpenAI / provider usage	Tokens by API key, dollar spend by day, at the account level.	You cannot trace cost back to a specific agent, run, customer, or feature. The account-level number aggregates everyone using the key.
Your application logs	A wall of text, searchable at 03:00 when you know something is wrong.	Not browsable at planning time. No structure across runs. Latency between "something happened" and "we know what" is human-scale.

Each tool is doing its job. None of them are doing the agent's job. The question "is this agent healthy" is not answerable from any of them, individually or together, because the data they collect is about the platform, not the agent.

What AI agent observability actually has to measure

The first time a team builds this from scratch they tend to converge on the same three signals.

Cost attribution per agent, customer, and feature. Not "what did the team spend on tokens", but "which agent spent it, on which customer, for which feature". The provider invoice is the absolute number; the attribution is what tells you which work caused it. Without attribution, every spend incident is a forensic exercise on raw provider logs; with it, the dashboard answers the question in seconds.

Live monitoring with schedule freshness. Not just "did this request return 200", but "did this scheduled job actually fire in its expected window". Scheduled agents are common and the absence of work is itself a signal. A nightly summariser that went silent on the 14th is the worst case here; cheap to instrument, expensive to miss.

Quality scoring with drift detection. Not just "the response was valid JSON", but "the response was a good answer". Drift is a leading indicator; by the time a customer complains, the trend line has been falling for a week. Continuous scoring on a rubric you defined once, with alerts on the distribution, closes the gap between "we changed something" and "we know it got worse".

Those three signals are what the three pillars in AgentPing are named after (Spend, Pulse, Verify), but the naming is not the interesting part. The interesting part is that none of them are visible from any tool in the existing APM stack, and all three are catchable from a single per-run event.

Why logs and metrics are not enough

The natural unit of observation for an agent is the run. Not a log line, not a counter, not a span; the run.

A run is the operation the agent was asked to perform: classify a ticket, enrich a lead, summarise a document, answer a question. It has a start, an end, an outcome, a cost, and an event timeline inside it (the prompts, the tool calls, the model responses). Everything you want to know about the agent rolls up to the run.

Logs are too granular. A run produces dozens of log lines; you cannot reason about the run from the lines without re-aggregating them, and you cannot do that aggregation reliably across services. Metrics are too coarse. A counter of "agent calls" or a histogram of "agent latency" loses the per-run identity entirely; you cannot drill from "throughput dropped" to "which runs were affected".

The run is the operational atom. Build the system around it. Every event you emit attaches to a run id. The dashboard is a list of runs you can filter, group, and drill into. The cost roll-ups, the schedule checks, the quality scores all live on the run record. From there everything else falls out.

The minimum viable telemetry

The smallest amount of data that gives you the three pillars is six fields per run.

run_id: a UUIDv7-derived identifier, client-generated, idempotent.
agent_id: a stable identifier for the agent.
started_at: when the agent began work, in client time.
finished_at: when it stopped, in client time.
status: one of ok, fail, partial.
cost (or model + input_tokens + output_tokens for the rate card to compute it).

From those six fields you get per-agent cost attribution (group by agent, sum cost), schedule freshness (compare finished_at against the expected cadence), and per-agent run timelines. Six fields. One line of SDK call wrapping the agent.

Adding inputs and outputs to the run unlocks the next layer: quality scoring, drill-down on a bad run, audit log for compliance. Adding an event timeline inside the run unlocks the next: which tool call took the time, which model call produced the bad output. Each layer adds capability; none of them are mandatory to get the first three pillars working.

Start with six fields. Add the rest when you need them.

How adding AgentPing changes the on-call experience

The pager scenarios change in three concrete ways.

Before: customer emails support to say their nightly digest has not arrived. Support escalates to engineering. Engineering checks the cron. Cron is broken. Time-to-detect: 11 days. After: the schedule checker pages on-call 30 minutes after the missed window. The page carries the agent id, the last successful run, and a link to amend or replay. Time-to-detect: minutes.

Before: the monthly provider invoice triples. Finance asks the CTO why. The CTO spends three days joining provider logs against deploy records to find the agent that caused it. Time-to-detect: 33 days. After: the spend baseline alert fires the morning after a new agent spikes its cost. The page shows the per-run cost spike with example inputs. Time-to-detect: hours.

Before: support tickets tick up over two weeks. A monthly sample review finds quality dropped 22%. The team rolls back a prompt change. They cannot tell which customers got the degraded output. Time-to-detect: 17 days. After: the drift alert fires when the rolling score distribution diverges. The dashboard shows the score chart, the deploy marker, and example low-scoring runs. Time-to-detect: a day or two.

In all three cases, the existing APM stack stayed green throughout. The signal was never going to come from there. It was always going to need a new layer of measurement, built around the run as the operational atom, with cost, schedule, and quality as first-class properties.

That is what AI agent observability is. Not a new APM, not a Sentry replacement, not a logging tool. A different unit of measurement, sitting next to the platform stack you already have, watching the thing the platform stack was never designed to watch.

If you are running agents on top of Sentry plus Datadog plus a provider dashboard and a wall of logs, you already have the platform layer. What you do not have is the agent layer. Get started and wire up the three pillars on your most expensive agent first.

Why will not Sentry catch a bad LLM response?

Sentry watches for exceptions and unhandled errors. A bad LLM response is neither; it is a successfully returned string that happens to be wrong. The HTTP call to the provider returns 200, the JSON parses, the agent function returns cleanly, no stack trace is generated. From Sentry's perspective the request is healthy. The thing that has broken is the meaning of the output, and meaning is not a signal Sentry was built to read.

Can I use AgentPing alongside my existing APM?

Yes, and we recommend it. AgentPing covers what APM cannot: cost per agent, schedule freshness, output quality. Your existing APM still covers infrastructure: CPU, memory, latency, exceptions, throughput. The two are complementary because they answer different questions. A typical setup keeps Sentry or Datadog for the platform layer and wires AgentPing in for the agent layer; the same run id can be propagated as a trace id so you can pivot between the two views on a single incident.

Does AgentPing replace Datadog for AI workloads?

No. Datadog measures infrastructure: hosts, containers, queues, request latency. If your agent is slow because the underlying pod is starved of memory, Datadog is the tool that shows you. AgentPing measures the agent itself: which runs cost how much, which scheduled jobs missed their window, which outputs scored badly on your rubric. The question "is the infrastructure healthy" stays with Datadog. The question "is the agent doing its job" sits with AgentPing.

What is the minimum data I need to send to get value?

A run id, an agent id, a start time, an end time, a status, and a cost (or the model and token counts so we compute the cost). That is six fields and one line of SDK call. From those six fields you get cost attribution per agent, schedule freshness, and a baseline you can alert on. Adding inputs, outputs, and an event timeline gives you quality scoring and drill-down, but you do not need them to capture the first three blind spots.

How does AgentPing handle cost attribution when an agent calls a tool that calls another agent?

Every run can declare a parent_run_id, which the SDK propagates automatically when one agent invokes another. The dashboard shows runs in a tree: a "lead-enrichment" run can have child runs for "company-research", "person-research", and "summariser", each with their own cost. The aggregate cost rolls up to the parent so finance sees a single number per top-level invocation, and the engineering view drills into where the spend went. Attribution is correct regardless of how many tools or sub-agents a run touches.