Three blind spots when running AI agents in production

Silent breakage, mystery bills, and quality drift. The three failure modes that teams shipping AI agents keep walking into, and the smallest amount of instrumentation that catches each one on day one.

The three production incidents we hear about most often when we talk to teams shipping AI agents are not the ones the LLM vendor docs warn you about. They are not jailbreaks, prompt injections, or hallucinations. They are operational. They are mundane. They are the kind of failure that a few well-placed metrics would have caught on day one, but instead caught a customer first.

This post walks through the three patterns, with real numbers from real teams, and shows what changes when each agent run is treated as a first-class operational event rather than a black box.


Blind spot 01: the silent failure

A team we spoke with ran a support-triage agent that classified incoming tickets and routed them to the right queue. It had run cleanly for months. Then one Tuesday, an upstream API that the agent depended on for customer metadata started returning empty strings instead of customer ids. The agent did what LLMs do; it produced plausible-looking JSON output anyway, with the customer field set to "unknown".

The output was syntactically valid. The downstream queue accepted it. The dashboard said everything was fine.

The agent had been broken for eleven days when a customer wrote in to ask why nobody had replied to their ticket.

Eleven days of garbage routing. Roughly four thousand tickets pushed into the wrong queue. The team had health checks on the API, error rates on the queue, even a dashboard for the agent's run count. None of them fired because none of them measured what mattered, which was whether the agent was actually doing its job.

What catches this

You need a heartbeat for every scheduled agent, plus a definition of "expected" beyond just "ran without throwing". The cheapest version is a record per agent run that includes the inputs and outputs, indexed by agent id, with an expected cadence. If a scheduled agent skips its window, you get paged. If a sampled output stops matching the rubric you defined for "good", you get paged. The two together cover both modes of silent failure: the agent that stopped running and the agent that kept running but stopped working.

The bar is low. You do not need an elaborate eval suite to catch this kind of failure. You need a tool that knows the agent should have run at 09:00 and a tool that knows what a sensible classification result looks like. For a deeper read, see silent AI agent failure.


Blind spot 02: the mystery bill

A different team, a Series A SaaS company. Their token bill in March was £2,400. In April it was £8,900. They knew the absolute number because the invoice from the provider was right there in the inbox, but they did not know which agent had caused the jump, because every agent on the team shared the same API key.

They spent three days reconstructing the cost breakdown from raw provider logs, joining timestamps to deployment records to git commits to figure out that a new "research-agent" they had shipped in late March was retrying on every soft failure, and the retries were carrying the full conversation history each time. One bad agent had run up roughly two-thirds of the month's bill on its own.

That story is unusually clean. More often, the bill creeps. A prompt change here, a longer context window there, a new feature that ships an extra tool call per turn. By the time anyone notices, you cannot point at the cause, only the cumulative effect.

What catches this

Per-agent cost attribution, computed at the time of the run, not the time of the bill. Every event carries the model, the input tokens, the output tokens, and a price resolved against a rate card. The dashboard shows cost-per-run by agent, day-over-day. A spend baseline is computed for each agent. If an agent's spend doubles overnight, you find out the morning of, not the first of next month.

Cost-per-successful-run is the metric finance actually wants to budget around. It is also the metric that catches loops, because a loop on a single agent inflates its cost-per-run by an order of magnitude in a few hours, well before the month-end bill arrives. See why your token bill keeps growing for the full pattern.


Blind spot 03: the quality drift

The third pattern is the hardest to instrument because it does not look like a failure. Everything still runs. The bill is normal. The dashboard is green. The output is just worse than it used to be.

A team shipped what looked like a small prompt tweak to a content-generation agent. A few words rearranged, a system prompt updated for clarity. They had no scoring harness in production, only in CI on a small fixed set of test cases that all still passed.

Two weeks later, they ran their monthly sample review and found that output quality had dropped 22% by their internal rubric. They rolled back the prompt change. They cannot fully reconstruct what content went out in those two weeks. They cannot tell which customers got the degraded version. The rollback fixed the future. It did nothing about the past.

What catches this

Score every run, in production, against a rubric you defined once. The rubric can be a JSON schema for shape, a checks pass/fail for must-have content, an LLM-as-judge call for nuance, or any combination. Whatever it is, the score gets written to the same event as the run itself, so you can plot it over time, by agent, by version, and watch the trend line move when you ship.

Drift is a leading indicator. By the time a customer complains, the trend line has been falling for a week. By the time you catch it in a quarterly review, you have shipped a month of degraded output to production. Continuous scoring closes the gap between "we changed something" and "we know it got worse". See LLM output quality drift for the full pattern.


How AgentPing addresses each one

The reason these three blind spots get talked about together is that they share a common shape. They all need the same primitive, which is a record per agent run that captures inputs, outputs, tokens, cost, and timing, tagged with the agent id and a run id. Build that primitive once and all three problems become tractable.

That is what AgentPing is. The SDK call is one line wrapping your existing agent. The event flows to our edge, gets a cost computed against a rate card, gets a heartbeat checked against the agent's schedule, and gets scored against any rubric you have defined. The dashboard surfaces each pillar separately so the on-call engineer, the finance owner, and the prompt engineer each have a page that answers their question without scrolling past everyone else's noise.

  • Pulse covers silent failure. Every scheduled agent has an expected cadence. When it misses, you get paged with the last successful run, the inputs the next run would have taken, and a link to amend or replay.
  • Spend covers the mystery bill. Every run has a cost attached, every agent has a spend baseline, and any agent that doubles its baseline overnight triggers an alert before the bill lands.
  • Verify covers quality drift. Every run is scored against the rubric you defined for that agent, on a continuous basis. The dashboard shows the score over time, broken down by version of the agent, so you can attribute a drop to the deploy that caused it.

If you are running agents in production today, you do not necessarily need AgentPing. You do need the underlying primitive. Build it yourself, buy it from us, do whatever fits your infrastructure. What you do not want is to keep running blind. The three stories above are not edge cases. They are the typical first-year experience of any team that ships agents without instrumentation, and the cost of catching them late is consistently higher than the cost of catching them on day one.


AgentPing is built for exactly this. If silent failure, mystery bills, or quality drift are problems you have hit, or expect to hit soon, get started and we will get you set up.

FAQ frequently asked
Why do AI agents fail silently more often than traditional services?
A typical web service either returns a response or throws an error your monitoring picks up. An AI agent often does neither. It runs, it produces output, the output is technically valid JSON, and only a human reading the result later realises it has been degenerating for days. There is no exception, no 500, no failed health check. The bytes flow, the queue drains, the dashboards stay green. What looks like uptime is actually a slow regression that only surfaces when a customer complains.
What is cost attribution and why does it matter for agents?
Cost attribution is the practice of tagging every token spent with the agent, customer, and feature that caused it. Without it, your provider bill arrives as a single line item and you have no way to answer "which agent caused last month's spike" or "what does it cost us to serve this customer". With it, every run carries its own cost ticket, you can compare cost-per-successful-run across agents, and you can set per-agent budgets that page you the moment one breaks its own baseline.
How is monitoring AI agents different from monitoring web services?
Three differences matter. First, the output is non-deterministic, so traditional assertion-based testing breaks down and you need rubric-based scoring instead. Second, the cost-per-request is variable and high, so a small loop bug can cost thousands of pounds before it trips a normal alert. Third, scheduled agents are common, so absence of work is itself a signal you need to monitor; a service is broken if it serves zero requests for an hour and was supposed to run every minute.
Can I instrument all three blind spots with one tool?
Yes, that is the design AgentPing is built around. The SDK emits a single event per agent run that carries the inputs, outputs, tokens, cost, and timing. From that one event we derive heartbeat schedules (Pulse), per-agent cost baselines (Spend), and quality scores (Verify). You do not need to wire up three separate vendors and try to correlate them; the same run id ties everything together.
What is the minimum I need to do today to catch these failures?
Add one SDK call wrapping each agent run, send it to a service that retains the events for at least 30 days, and set up three things on top of those events: a heartbeat per scheduled agent, a daily spend baseline per agent, and a scoring rubric for the outputs that matter. That is roughly an hour of work and covers all three blind spots. You can do it with AgentPing or build it yourself; the important thing is that you do it before the first incident, not after.