Here is a bill most teams are paying and have never looked at closely. Your agent has a long system prompt: instructions, a tool schema, a handful of few-shot examples, maybe a chunk of retrieved context that stays constant across a session. That prefix might be two or three thousand tokens. It is identical on every single call. And on every single call, most teams pay full input price for it, as if the provider had never seen it before.
Prompt caching is the mechanism for not doing that, and it is one of the highest-return, lowest-effort cost levers available to an agent team. It is also one of the easiest to get silently wrong. This post is how it works, when it pays, and how to know which side of the line you are on. For the broader picture, see why your token bill keeps growing.
What caching actually does
When you send a prompt, the provider has to process the input tokens before it can generate anything. Prompt caching lets the provider store the processed form of a stable prefix and reuse it. You mark the part of the prompt that does not change, the provider keeps it warm for a short time-to-live, and any request that arrives within that window sharing the same prefix reads from the cache instead of reprocessing it from scratch.
The reused tokens are billed at a fraction of the normal input rate, often between 10% and 50% depending on the provider and cache tier, and the request is faster because the expensive prefix processing is skipped. You are paying full freight once to warm the cache, then a steep discount on every reuse inside the lifetime.
The shape that matters: the discount applies only to the cached prefix. The variable tail of your prompt and the entire output are billed normally. So the saving you realise is the discount rate times the fraction of your tokens that live in the stable prefix.
The agents that benefit most
Caching pays in direct proportion to two things: how big your stable prefix is, and how often you reuse it before it expires.
The ideal candidate is an agent with a long fixed system prompt and a short variable payload, called frequently. A classifier with a 2,500-token instruction-and-examples block that processes a 200-token ticket, fired hundreds of times an hour, is the dream case. Almost all of its input is the stable prefix, and the reuse rate is high enough that the cache never goes cold. Caching can take a large bite out of that agent's bill.
The poor candidate is an agent with a short prompt, a huge variable payload, or a low call rate. If your prefix is 200 tokens and your variable input is 4,000, caching the prefix barely moves the bill. If you call the agent once an hour and the cache lifetime is five minutes, every call is a cold miss and you may even be paying the write premium for nothing.
The honest summary: caching is close to free money for high-frequency, fixed-prefix agents, and close to a no-op for low-frequency, variable-heavy ones. Knowing which of yours are which is the whole game.
The trap: caching that silently breaks
This is the part that turns a saving into a liability. Caching keys on an exact prefix match. The provider reuses the cache only if the front of your prompt is byte-for-byte what it was last time.
So consider what breaks the match:
- A timestamp or a "current date" injected near the top of the system prompt.
- A per-user or per-session value placed before the stable instructions.
- A tool list that gets reordered because it is built from a map with no stable ordering.
- A system message assembled by string concatenation whose pieces shift around.
Any of these moves the prefix, and the moment the prefix moves, every call is a cache miss billed at the full input rate. The agent keeps working perfectly. The output is unchanged. The only thing that changes is the bill, and it changes in a direction nobody has a dashboard pointed at. A prompt refactor that looks purely cosmetic can quietly double an agent's input cost by pushing a variable value ahead of the cacheable block.
This is the single most common caching failure: not that teams never turn it on, but that they turn it on, see the saving, and then break it months later with an innocent-looking change and never notice.
How to know if you are actually saving
You cannot manage what you cannot see, and caching is invisible without token-level instrumentation. Providers report a cached-token count on each response. The question is whether your observability captures it.
If it does, you can compute the number that matters: your cache hit rate per agent, and the realised discount in money. You can watch that hit rate over time and catch the day it falls off a cliff because someone moved a timestamp. You can see which agents are benefiting and which are paying the write premium for reuse that never comes.
If it does not, you are flying blind on a line item that can swing by large percentages. The default assumption "we enabled caching, so it is working" is exactly the assumption that lets a broken prefix run for months.
The instrumentation is the same per-run event you already want for cost attribution generally. The run records input_tokens, output_tokens, and a cached_tokens count; the rate card applies the cached rate to the cached portion and the full rate to the rest. Now your per-agent cost reflects caching accurately, and your cache hit rate is just another series on the chart, with the same baseline alerting as any other spend metric. A hit rate that drops is a spend anomaly like any other, and it pages you the morning it happens rather than surfacing on the invoice.
A short checklist
- Identify your high-frequency agents with a long, fixed prefix. They are where the money is.
- Move everything stable to the front of the prompt and everything variable to the back, so the cacheable prefix is as large and as stable as possible.
- Never inject a timestamp, per-user value, or unordered list ahead of the stable block. That one rule prevents most silent breakage.
- Capture
cached_tokensper run and compute hit rate per agent. - Baseline the hit rate and alert on a drop, so a prompt refactor that breaks the cache pages you instead of surfacing on the bill a month later.
Prompt caching is rare among cost levers in that it costs almost nothing to claim and asks for no quality trade-off. The catch is that claiming it once is not the same as keeping it, and keeping it requires seeing it. AgentPing captures cached-token counts per run and applies the cached rate in the Spend view, so your hit rate is monitored like any other metric. Get started and check whether the agent you think is cached actually is.