How to track OpenAI costs by agent, customer, and feature

The provider dashboard gives you one number for the whole account. That number cannot answer the three questions you actually have. Here is how to break spend down by agent, by customer, and by feature, and what each breakdown unlocks.

Open the OpenAI usage dashboard right now and you will see a chart. It is a real number, it is your real spend, and it is almost useless for running a business, because it answers a question nobody is asking. Nobody wants to know "what did the account spend yesterday". People want to know which agent, which customer, and which feature spent it. The account-level number cannot tell you any of those, and the gap between the number you have and the number you need is where most cost problems hide.

This post is the breakdown most teams reconstruct by hand a few months in, usually during an incident. For the incident itself, see the anatomy of an £8,900 token bill; this is the build, not the postmortem.


Why the provider number cannot answer your questions

Three structural problems with the provider dashboard.

It aggregates by key. Most teams ship several agents behind one API key because rotating per-agent keys is operational friction nobody signs up for early. The moment two agents share a key, the provider can no longer separate them, and neither can you from its dashboard.

It reports late. Usage data lands on a delay, aggregated by day. A loop that starts at 09:00 is not visible as a distinct spike until the next day's bucket fills, by which point the loop has run for hours.

It has no idea what an agent is. The provider sees requests. It does not see your support-triage agent, your enterprise customer, or your lead-scoring feature, because those are concepts that live in your code. Nothing the provider can show you will ever carry those labels, because you never sent them.

The fix for all three is the same: compute cost at the point of the run, in your own code, tagged with the dimensions you care about.


The unit that makes this work: the cost-tagged run

The whole system rests on one primitive, a record per agent run that carries enough to compute and attribute cost:

{
  "run_id":       "run_018f4c2a...",
  "agent_id":     "support-triage",
  "model":        "gpt-4o",
  "input_tokens": 2480,
  "output_tokens": 514,
  "customer_id":  "cus_8821",
  "feature":      "inbox-autoreply",
  "parent_run_id": null,
  "started_at":   "2026-06-09T09:14:02Z",
  "finished_at":  "2026-06-09T09:14:05Z",
  "status":       "ok"
}

The price is not in that payload, and that is deliberate. You resolve the model name against a rate card at ingest time, so the cost is computed consistently and stays correct even when you backfill or when a provider changes prices. The run carries the facts; the rate card turns the facts into money.

From this one row, every breakdown below is a GROUP BY.


Breakdown 1: by agent

Group by agent_id, sum the resolved cost, bucket by day. This is the breakdown you alert on, because an agent is the unit that goes wrong. A loop, a runaway retry, a context window that quietly doubled after a prompt change; all of them show up first as one agent's line pulling away from the others.

The metric that matters here is not raw spend but cost-per-successful-run. Raw spend tells you an agent is expensive; cost-per-successful-run tells you whether it is expensive because it does a lot of valuable work or because it is burning tokens on failures and retries. An agent whose cost-per-run is flat but whose cost-per-successful-run is climbing is failing more often, and that is a signal a raw total will hide inside its own average.

Set a baseline per agent and alert on deviation from it, not on an absolute threshold. An absolute budget tells you the month is over budget after the damage. A baseline tells you research-agent is spending three times what it spent yesterday, this morning, which is the only version of the signal that is actionable.


Breakdown 2: by customer

Group by customer_id and you get cost-to-serve, which is the number that decides whether a plan is profitable.

This breakdown earns its keep in two ways. First, it protects margins on flat-rate plans. A single power user on a £99 tier who runs your most expensive agent two hundred times a day can quietly cost more than they pay, and you will never see it in a per-account-blind dashboard. Per-customer attribution turns "this plan is profitable on average" into "this plan is profitable for everyone except these four accounts", which is a fact you can act on.

Second, it is the foundation for usage-based pricing if you ever move that way. You cannot bill for what you cannot measure. The same per-run cost data that protects your flat-rate margins is exactly what you would meter against if you added a usage component later.

The only requirement is that your agent knows which customer it is acting for at run time, which it almost always does, and passes that id into the event. One field.


Breakdown 3: by feature

Group by feature and you can finally answer "is this surface worth its token bill". A feature that costs £1,200 a month in tokens and drives retention is a bargain. The same £1,200 on a feature nobody uses is pure waste, and without the breakdown the two are indistinguishable inside the account total.

This is the breakdown product teams want and almost never have. It turns the LLM line on the P&L from a fixed cost-of-doing-business into a per-feature investment you can reason about, cut, or double down on. When someone proposes shipping a new agent-backed feature, the feature breakdown on the existing ones is the evidence for what it will plausibly cost.


Multi-agent: rolling up the call tree

Real systems are not flat. A lead-enrichment run calls company-research, which calls a summariser. If each of those emits its own run with no linkage, your dashboard shows three unrelated rows and the true cost of one lead enrichment is scattered across them.

The fix is a parent run id. The child run records the id of the run that spawned it; the dashboard renders the tree and rolls child cost up into the parent.

lead-enrichment          £0.41   (rollup)
├─ company-research      £0.18
├─ person-research       £0.15
└─ summariser            £0.08

Finance sees £0.41 per lead. Engineering sees that two-thirds of it is research calls and knows where to optimise. Same data, two altitudes, because the tree is preserved instead of flattened.


What you actually have to do

The whole thing is less work than it sounds.

  • Emit one event per agent run with agent_id, model, token counts, and started_at / finished_at. That is the floor; it already gives you the agent breakdown.
  • Add customer_id and feature where the agent knows them. Two optional fields, and you unlock the other two breakdowns.
  • Add parent_run_id when one agent calls another. The SDK propagates this automatically; by hand it is one value threaded through the call.
  • Resolve cost against a maintained rate card at ingest, so the numbers are right and stay right.

That is the build whether you do it yourself or use AgentPing. The SDK call that wraps your agent carries these fields and the rate card ships maintained for the major providers, so you get all three breakdowns from one integration. See the Spend features for the dashboard side.


The provider dashboard is the bill. It is not the answer. The moment you are running more than one agent on a shared key, the absence of attribution is already costing you the ability to see your own spend. Get started and wire up agent-level attribution on your most expensive agent first.

FAQ frequently asked
Why is the OpenAI usage dashboard not enough to track agent costs?
The OpenAI usage dashboard aggregates by API key and by day, at the account level. If three agents share one key, the dashboard shows their combined spend as a single rising line and gives you no way to split it. It also reports after the fact, on a delay, so a runaway agent is already a day or more into the damage by the time the number updates. To answer "which agent spent this" you need attribution computed at the point of the run, tagged with an agent id, not a report read back from the provider.
What three dimensions should I attribute LLM cost to?
Agent, customer, and feature. Agent tells you which part of your system is expensive and is the dimension you alert on. Customer tells you your cost-to-serve, which is what you need to keep a plan profitable and to spot a single account that is quietly costing you more than it pays. Feature tells you whether a specific product surface is worth its token bill. All three come from the same per-run event; you just attach the tags when you emit it.
Do I need to change my agent code to attribute cost?
You need to attach a small amount of metadata to each run: an agent id always, and a customer id or feature tag where you have them. With the AgentPing SDK that is a few fields on the call that wraps your agent. With a homegrown setup you log the model, prompt tokens, completion tokens, and those tags to a table and join against a rate card. Either way the attribution lives in your wrapper, because the provider has no idea your agents or customers exist; you are the only one who can tell the data which run belongs to which.
How do I track cost when one agent calls another?
Propagate a parent run id. When a top-level agent invokes a sub-agent or a tool that is itself an agent, the child run records the parent id, and the dashboard rolls child costs up into the parent. That way finance sees one number per top-level invocation while engineering can drill into where inside the call tree the spend actually went. Without parent linkage, a multi-agent workflow looks like a pile of unrelated runs and you lose the ability to say what a single user action truly cost.
What is cost-per-successful-run and why track it?
It is total agent spend divided by the number of runs that actually succeeded, rather than by all runs. It is the honest unit cost, because a run that failed, retried five times, and produced nothing still cost you tokens. Tracking raw cost-per-run hides waste inside a healthy-looking average; tracking cost-per-successful-run surfaces it, and it is the number you want in front of you for any budgeting or pricing conversation.