Open the OpenAI usage dashboard right now and you will see a chart. It is a real number, it is your real spend, and it is almost useless for running a business, because it answers a question nobody is asking. Nobody wants to know "what did the account spend yesterday". People want to know which agent, which customer, and which feature spent it. The account-level number cannot tell you any of those, and the gap between the number you have and the number you need is where most cost problems hide.
This post is the breakdown most teams reconstruct by hand a few months in, usually during an incident. For the incident itself, see the anatomy of an £8,900 token bill; this is the build, not the postmortem.
Why the provider number cannot answer your questions
Three structural problems with the provider dashboard.
It aggregates by key. Most teams ship several agents behind one API key because rotating per-agent keys is operational friction nobody signs up for early. The moment two agents share a key, the provider can no longer separate them, and neither can you from its dashboard.
It reports late. Usage data lands on a delay, aggregated by day. A loop that starts at 09:00 is not visible as a distinct spike until the next day's bucket fills, by which point the loop has run for hours.
It has no idea what an agent is. The provider sees requests. It does not see your support-triage agent, your enterprise customer, or your lead-scoring feature, because those are concepts that live in your code. Nothing the provider can show you will ever carry those labels, because you never sent them.
The fix for all three is the same: compute cost at the point of the run, in your own code, tagged with the dimensions you care about.
The unit that makes this work: the cost-tagged run
The whole system rests on one primitive, a record per agent run that carries enough to compute and attribute cost:
{
"run_id": "run_018f4c2a...",
"agent_id": "support-triage",
"model": "gpt-4o",
"input_tokens": 2480,
"output_tokens": 514,
"customer_id": "cus_8821",
"feature": "inbox-autoreply",
"parent_run_id": null,
"started_at": "2026-06-09T09:14:02Z",
"finished_at": "2026-06-09T09:14:05Z",
"status": "ok"
}
The price is not in that payload, and that is deliberate. You resolve the model name against a rate card at ingest time, so the cost is computed consistently and stays correct even when you backfill or when a provider changes prices. The run carries the facts; the rate card turns the facts into money.
From this one row, every breakdown below is a GROUP BY.
Breakdown 1: by agent
Group by agent_id, sum the resolved cost, bucket by day. This is the breakdown you alert on, because an agent is the unit that goes wrong. A loop, a runaway retry, a context window that quietly doubled after a prompt change; all of them show up first as one agent's line pulling away from the others.
The metric that matters here is not raw spend but cost-per-successful-run. Raw spend tells you an agent is expensive; cost-per-successful-run tells you whether it is expensive because it does a lot of valuable work or because it is burning tokens on failures and retries. An agent whose cost-per-run is flat but whose cost-per-successful-run is climbing is failing more often, and that is a signal a raw total will hide inside its own average.
Set a baseline per agent and alert on deviation from it, not on an absolute threshold. An absolute budget tells you the month is over budget after the damage. A baseline tells you research-agent is spending three times what it spent yesterday, this morning, which is the only version of the signal that is actionable.
Breakdown 2: by customer
Group by customer_id and you get cost-to-serve, which is the number that decides whether a plan is profitable.
This breakdown earns its keep in two ways. First, it protects margins on flat-rate plans. A single power user on a £99 tier who runs your most expensive agent two hundred times a day can quietly cost more than they pay, and you will never see it in a per-account-blind dashboard. Per-customer attribution turns "this plan is profitable on average" into "this plan is profitable for everyone except these four accounts", which is a fact you can act on.
Second, it is the foundation for usage-based pricing if you ever move that way. You cannot bill for what you cannot measure. The same per-run cost data that protects your flat-rate margins is exactly what you would meter against if you added a usage component later.
The only requirement is that your agent knows which customer it is acting for at run time, which it almost always does, and passes that id into the event. One field.
Breakdown 3: by feature
Group by feature and you can finally answer "is this surface worth its token bill". A feature that costs £1,200 a month in tokens and drives retention is a bargain. The same £1,200 on a feature nobody uses is pure waste, and without the breakdown the two are indistinguishable inside the account total.
This is the breakdown product teams want and almost never have. It turns the LLM line on the P&L from a fixed cost-of-doing-business into a per-feature investment you can reason about, cut, or double down on. When someone proposes shipping a new agent-backed feature, the feature breakdown on the existing ones is the evidence for what it will plausibly cost.
Multi-agent: rolling up the call tree
Real systems are not flat. A lead-enrichment run calls company-research, which calls a summariser. If each of those emits its own run with no linkage, your dashboard shows three unrelated rows and the true cost of one lead enrichment is scattered across them.
The fix is a parent run id. The child run records the id of the run that spawned it; the dashboard renders the tree and rolls child cost up into the parent.
lead-enrichment £0.41 (rollup)
├─ company-research £0.18
├─ person-research £0.15
└─ summariser £0.08
Finance sees £0.41 per lead. Engineering sees that two-thirds of it is research calls and knows where to optimise. Same data, two altitudes, because the tree is preserved instead of flattened.
What you actually have to do
The whole thing is less work than it sounds.
- Emit one event per agent run with
agent_id,model, token counts, andstarted_at/finished_at. That is the floor; it already gives you the agent breakdown. - Add
customer_idandfeaturewhere the agent knows them. Two optional fields, and you unlock the other two breakdowns. - Add
parent_run_idwhen one agent calls another. The SDK propagates this automatically; by hand it is one value threaded through the call. - Resolve cost against a maintained rate card at ingest, so the numbers are right and stay right.
That is the build whether you do it yourself or use AgentPing. The SDK call that wraps your agent carries these fields and the rate card ships maintained for the major providers, so you get all three breakdowns from one integration. See the Spend features for the dashboard side.
The provider dashboard is the bill. It is not the answer. The moment you are running more than one agent on a shared key, the absence of attribution is already costing you the ability to see your own spend. Get started and wire up agent-level attribution on your most expensive agent first.