Anatomy of an £8,900 token bill

A composite incident pieced together from three real teams. One agent, one retry loop, one ballooning context window, and a bill nobody saw coming until the first of the month. What the daily view would have shown, and on which day.

This is a composite. The numbers come from three different teams who have hit variants of the same problem. None of them caught it in time; all of them wanted to.

The setup. Series A SaaS, six agents in production, single shared OpenAI account, no per-agent attribution. Monthly token bill had been £2,400 for the previous two months, predictable enough that nobody was watching it daily. Finance signed off on the March invoice and moved on. The April invoice was £8,900.

Below is what actually happened, day by day, and what each daily view should have shown if the team had been looking.


The incident, reconstructed

A new agent called research-agent had shipped on March 28th, four days before the start of April. Its job was to enrich an incoming lead with public-web research, then produce a summary. The team had tested it in CI against a fixed set of leads; everything passed.

In production, around 8% of leads failed the agent's first call to a downstream tool. The agent had been written with retry-on-failure baked into its loop. The retry passed the full conversation history back to the model, which is a normal-looking thing to do because the model needs the context to know what it was trying to accomplish. What the team did not realise was that the retry policy had no cap; it would retry until either the call succeeded or the underlying context window filled.

For most failing leads, the retry succeeded on the second or third attempt and burned a modest amount of extra tokens. For the 1% of leads that were structurally unrecoverable, the agent would retry six, ten, twenty times. Each retry added roughly 4,000 tokens of context. By the time the loop bottomed out at the model's hard limit, the single agent run had consumed 80,000 tokens of input for what should have been a 6,000-token operation.

A handful of those a day was tolerable. Three or four hundred a day, which is what research-agent was doing at the volume the new lead pipeline was sending it, was not.

Daily spend, in retrospect

If the team had been writing one cost-tagged event per agent run from day one, the daily totals would have looked roughly like this:

Date       | content-writer | research-agent | support-triage | email-cls | other  | TOTAL
-----------|----------------|----------------|----------------|-----------|--------|--------
Mar 28     |    £72         |    £4          |    £18         |    £6     |   £8   |  £108
Mar 29     |    £74         |   £61          |    £19         |    £6     |   £9   |  £169
Mar 30     |    £71         |  £142          |    £17         |    £6     |   £8   |  £244
Mar 31     |    £73         |  £198          |    £18         |    £6     |   £9   |  £304
Apr 01     |    £75         |  £234          |    £18         |    £6     |   £9   |  £342
Apr 02     |    £74         |  £261          |    £17         |    £6     |   £8   |  £366
Apr 03     |    £72         |  £278          |    £18         |    £6     |   £9   |  £383

The signal is in the second column. research-agent goes from £4 a day to £142 a day in 48 hours, then keeps climbing. By April 3rd it is producing more spend than content-writer, which is a customer-facing agent that the team had always understood to be their biggest cost line.

That is the moment a baseline-based alert should have fired. The agent's first 24 hours of production traffic produced £4 of spend; an hour later it was producing £6 per hour, which annualises to a £52,000-a-year run rate, on an agent that was supposed to be a £2,000-a-year fixed-cost. The fact-of-the-jump is the alert; the absolute number is the size of it.

What the team saw instead

Nothing. They had no per-agent attribution. The shared API key meant the provider dashboard showed one rising line for the whole account, and that line had been rising for the previous two months as the company grew, so a 2x rise over the course of two weeks looked unremarkable. Nobody on the team was looking at provider spend daily, and even if they had been, they would not have been able to break it down by agent.

The bug was caught on May 1st, when the April invoice landed in finance's inbox and finance asked the CTO why the token line had nearly quadrupled. The CTO spent three days reconstructing the cost breakdown from raw provider logs, joining timestamps against deploy records, before identifying research-agent as the cause.

Time-to-detect: 33 days. Direct cost of the bug: roughly £6,500 of avoidable spend, plus three engineer-days of forensic work. The agent itself was fixed in 40 minutes once they had the data.


What spend attribution would have shown, on day one

Instrumenting cost attribution at the agent level changes the shape of this incident in three concrete ways.

  • Day one. research-agent ships on March 28th. Its first 50 runs cost roughly £4 total. The dashboard records that as its baseline.
  • Day two. Spend has multiplied by 15x in 24 hours. A baseline-anomaly alert fires. The on-call engineer opens the agent's page, sees the per-run token counts spiking on certain inputs, and within an hour has localised the bug to the retry loop.
  • Day three onwards. The bug is fixed before it has had time to compound. Direct cost of the bug: about £65, all incurred on the first day before the alert fired.

The mechanism is not exotic. The agent emits one event per run, the event carries model, input_tokens, output_tokens, and an agent_id. The price for that model is resolved against a rate card (we keep one for the major providers; you can override or add your own). The result is a per-run cost, attributable to the agent that produced it, with one row per run.

From that table everything else falls out:

  • Per-agent spend by day, week, and month, with a rolling baseline.
  • Cost-per-successful-run, which is the metric you want for budget conversations.
  • Per-customer and per-feature cost, if your agent tags those into the event (the SDK takes arbitrary metadata).
  • Top-spenders leaderboard, so finance can ask "what's our most expensive agent" and get an answer in seconds.

None of this requires you to change the way the agent itself works. The cost data lives in your observability layer, computed against the same events you already need for everything else (heartbeats, quality scoring, audit log).


The rate card matters

One detail that often gets missed when teams build this themselves. The provider's billed price is the source of truth, but it is also the price you find out about last. If you want to see cost in real time, you need a rate card that you maintain alongside the events, and you need to update it when the provider changes prices. A model that was £15/M tokens last quarter might be £10/M this quarter; if your rate card is wrong, your dashboard is confidently wrong.

AgentPing ships with a maintained rate card for the major providers. You can override any line in it for your account if you have negotiated custom pricing, or add new models the moment a provider ships them. The point is that the rate card is a first-class object, version-controlled, auditable, and applied at the event level, not after the fact in a spreadsheet. See the Spend features for the full implementation.


The takeaway

The £8,900 bill is not the interesting number. The interesting number is the gap between day two and day thirty-three, because everything that happened in that gap was avoidable.

Per-agent cost attribution is not a nice-to-have observability feature. It is the difference between catching a bad deploy on the morning of and finding out about it on the first of next month. If you are running agents on a shared provider key, with no per-run cost data, you are running on the same setup the team in this story had. The bug is not in your code yet, but the absence of attribution is the bug already.

If you would rather not write this yourself, that is exactly what AgentPing is for. Get started and wire up per-agent cost attribution on your most expensive agent first.

FAQ frequently asked
How can a single agent run up most of a monthly bill?
Two compounding factors. First, retry-on-failure with no cap, which multiplies the request count for any class of input the agent is failing on. Second, conversation-history accumulation, where each retry carries everything that came before it as context, so the input token count grows quadratically with the retry count. Combine the two and a single agent that fails ten percent of the time can easily produce ten times more spend than its peers, all on the same provider account, all on the same shared key.
Why does provider billing arrive too late to catch this kind of bug?
Provider invoices land monthly. By the time you see the number, the bug has been live for between one and thirty days. Even providers that offer near-real-time usage dashboards typically aggregate at the account level, not the agent level, so you can see total spend rising without being able to point at which of your agents caused it. The gap between "spend changed" and "we know why" is where the damage compounds.
What is a spend baseline and how is it different from a budget?
A budget is a fixed monthly ceiling. A baseline is a rolling per-agent expectation, learned from the agent's own recent history. A budget tells you that you are spending too much overall; a baseline tells you that one specific agent is suddenly spending twice as much as it did yesterday, which is the signal you actually want during an incident. The two are complementary; you want both, but a baseline catches anomalies a budget never will.
Can I attribute cost without changing my agent code?
You need at minimum a tag on the request that identifies the agent. With AgentPing the SDK call generates a run id and emits the cost data; with a homegrown setup you would log model, prompt and completion tokens, and an agent id to a table somewhere and join against a rate card. Either way, the attribution lives in your code or in your wrapper, not in the provider invoice. The provider does not know your agents exist; you have to tell your tooling which run belongs to which.