This is a composite. The numbers come from three different teams who have hit variants of the same problem. None of them caught it in time; all of them wanted to.
The setup. Series A SaaS, six agents in production, single shared OpenAI account, no per-agent attribution. Monthly token bill had been £2,400 for the previous two months, predictable enough that nobody was watching it daily. Finance signed off on the March invoice and moved on. The April invoice was £8,900.
Below is what actually happened, day by day, and what each daily view should have shown if the team had been looking.
The incident, reconstructed
A new agent called research-agent had shipped on March 28th, four days before the start of April. Its job was to enrich an incoming lead with public-web research, then produce a summary. The team had tested it in CI against a fixed set of leads; everything passed.
In production, around 8% of leads failed the agent's first call to a downstream tool. The agent had been written with retry-on-failure baked into its loop. The retry passed the full conversation history back to the model, which is a normal-looking thing to do because the model needs the context to know what it was trying to accomplish. What the team did not realise was that the retry policy had no cap; it would retry until either the call succeeded or the underlying context window filled.
For most failing leads, the retry succeeded on the second or third attempt and burned a modest amount of extra tokens. For the 1% of leads that were structurally unrecoverable, the agent would retry six, ten, twenty times. Each retry added roughly 4,000 tokens of context. By the time the loop bottomed out at the model's hard limit, the single agent run had consumed 80,000 tokens of input for what should have been a 6,000-token operation.
A handful of those a day was tolerable. Three or four hundred a day, which is what research-agent was doing at the volume the new lead pipeline was sending it, was not.
Daily spend, in retrospect
If the team had been writing one cost-tagged event per agent run from day one, the daily totals would have looked roughly like this:
Date | content-writer | research-agent | support-triage | email-cls | other | TOTAL
-----------|----------------|----------------|----------------|-----------|--------|--------
Mar 28 | £72 | £4 | £18 | £6 | £8 | £108
Mar 29 | £74 | £61 | £19 | £6 | £9 | £169
Mar 30 | £71 | £142 | £17 | £6 | £8 | £244
Mar 31 | £73 | £198 | £18 | £6 | £9 | £304
Apr 01 | £75 | £234 | £18 | £6 | £9 | £342
Apr 02 | £74 | £261 | £17 | £6 | £8 | £366
Apr 03 | £72 | £278 | £18 | £6 | £9 | £383
The signal is in the second column. research-agent goes from £4 a day to £142 a day in 48 hours, then keeps climbing. By April 3rd it is producing more spend than content-writer, which is a customer-facing agent that the team had always understood to be their biggest cost line.
That is the moment a baseline-based alert should have fired. The agent's first 24 hours of production traffic produced £4 of spend; an hour later it was producing £6 per hour, which annualises to a £52,000-a-year run rate, on an agent that was supposed to be a £2,000-a-year fixed-cost. The fact-of-the-jump is the alert; the absolute number is the size of it.
What the team saw instead
Nothing. They had no per-agent attribution. The shared API key meant the provider dashboard showed one rising line for the whole account, and that line had been rising for the previous two months as the company grew, so a 2x rise over the course of two weeks looked unremarkable. Nobody on the team was looking at provider spend daily, and even if they had been, they would not have been able to break it down by agent.
The bug was caught on May 1st, when the April invoice landed in finance's inbox and finance asked the CTO why the token line had nearly quadrupled. The CTO spent three days reconstructing the cost breakdown from raw provider logs, joining timestamps against deploy records, before identifying research-agent as the cause.
Time-to-detect: 33 days. Direct cost of the bug: roughly £6,500 of avoidable spend, plus three engineer-days of forensic work. The agent itself was fixed in 40 minutes once they had the data.
What spend attribution would have shown, on day one
Instrumenting cost attribution at the agent level changes the shape of this incident in three concrete ways.
- Day one.
research-agentships on March 28th. Its first 50 runs cost roughly £4 total. The dashboard records that as its baseline. - Day two. Spend has multiplied by 15x in 24 hours. A baseline-anomaly alert fires. The on-call engineer opens the agent's page, sees the per-run token counts spiking on certain inputs, and within an hour has localised the bug to the retry loop.
- Day three onwards. The bug is fixed before it has had time to compound. Direct cost of the bug: about £65, all incurred on the first day before the alert fired.
The mechanism is not exotic. The agent emits one event per run, the event carries model, input_tokens, output_tokens, and an agent_id. The price for that model is resolved against a rate card (we keep one for the major providers; you can override or add your own). The result is a per-run cost, attributable to the agent that produced it, with one row per run.
From that table everything else falls out:
- Per-agent spend by day, week, and month, with a rolling baseline.
- Cost-per-successful-run, which is the metric you want for budget conversations.
- Per-customer and per-feature cost, if your agent tags those into the event (the SDK takes arbitrary metadata).
- Top-spenders leaderboard, so finance can ask "what's our most expensive agent" and get an answer in seconds.
None of this requires you to change the way the agent itself works. The cost data lives in your observability layer, computed against the same events you already need for everything else (heartbeats, quality scoring, audit log).
The rate card matters
One detail that often gets missed when teams build this themselves. The provider's billed price is the source of truth, but it is also the price you find out about last. If you want to see cost in real time, you need a rate card that you maintain alongside the events, and you need to update it when the provider changes prices. A model that was £15/M tokens last quarter might be £10/M this quarter; if your rate card is wrong, your dashboard is confidently wrong.
AgentPing ships with a maintained rate card for the major providers. You can override any line in it for your account if you have negotiated custom pricing, or add new models the moment a provider ships them. The point is that the rate card is a first-class object, version-controlled, auditable, and applied at the event level, not after the fact in a spreadsheet. See the Spend features for the full implementation.
The takeaway
The £8,900 bill is not the interesting number. The interesting number is the gap between day two and day thirty-three, because everything that happened in that gap was avoidable.
Per-agent cost attribution is not a nice-to-have observability feature. It is the difference between catching a bad deploy on the morning of and finding out about it on the first of next month. If you are running agents on a shared provider key, with no per-run cost data, you are running on the same setup the team in this story had. The bug is not in your code yet, but the absence of attribution is the bug already.
If you would rather not write this yourself, that is exactly what AgentPing is for. Get started and wire up per-agent cost attribution on your most expensive agent first.