Prompt caching: the discount you are probably not claiming

Most agents send the same long system prompt on every single call and pay full price for it every time. Prompt caching can cut that portion of the bill by up to 90%. Here is how it works, when it pays, and how to tell whether you are leaving the money on the table.

Here is a bill most teams are paying and have never looked at closely. Your agent has a long system prompt: instructions, a tool schema, a handful of few-shot examples, maybe a chunk of retrieved context that stays constant across a session. That prefix might be two or three thousand tokens. It is identical on every single call. And on every single call, most teams pay full input price for it, as if the provider had never seen it before.

Prompt caching is the mechanism for not doing that, and it is one of the highest-return, lowest-effort cost levers available to an agent team. It is also one of the easiest to get silently wrong. This post is how it works, when it pays, and how to know which side of the line you are on. For the broader picture, see why your token bill keeps growing.


What caching actually does

When you send a prompt, the provider has to process the input tokens before it can generate anything. Prompt caching lets the provider store the processed form of a stable prefix and reuse it. You mark the part of the prompt that does not change, the provider keeps it warm for a short time-to-live, and any request that arrives within that window sharing the same prefix reads from the cache instead of reprocessing it from scratch.

The reused tokens are billed at a fraction of the normal input rate, often between 10% and 50% depending on the provider and cache tier, and the request is faster because the expensive prefix processing is skipped. You are paying full freight once to warm the cache, then a steep discount on every reuse inside the lifetime.

The shape that matters: the discount applies only to the cached prefix. The variable tail of your prompt and the entire output are billed normally. So the saving you realise is the discount rate times the fraction of your tokens that live in the stable prefix.


The agents that benefit most

Caching pays in direct proportion to two things: how big your stable prefix is, and how often you reuse it before it expires.

The ideal candidate is an agent with a long fixed system prompt and a short variable payload, called frequently. A classifier with a 2,500-token instruction-and-examples block that processes a 200-token ticket, fired hundreds of times an hour, is the dream case. Almost all of its input is the stable prefix, and the reuse rate is high enough that the cache never goes cold. Caching can take a large bite out of that agent's bill.

The poor candidate is an agent with a short prompt, a huge variable payload, or a low call rate. If your prefix is 200 tokens and your variable input is 4,000, caching the prefix barely moves the bill. If you call the agent once an hour and the cache lifetime is five minutes, every call is a cold miss and you may even be paying the write premium for nothing.

The honest summary: caching is close to free money for high-frequency, fixed-prefix agents, and close to a no-op for low-frequency, variable-heavy ones. Knowing which of yours are which is the whole game.


The trap: caching that silently breaks

This is the part that turns a saving into a liability. Caching keys on an exact prefix match. The provider reuses the cache only if the front of your prompt is byte-for-byte what it was last time.

So consider what breaks the match:

  • A timestamp or a "current date" injected near the top of the system prompt.
  • A per-user or per-session value placed before the stable instructions.
  • A tool list that gets reordered because it is built from a map with no stable ordering.
  • A system message assembled by string concatenation whose pieces shift around.

Any of these moves the prefix, and the moment the prefix moves, every call is a cache miss billed at the full input rate. The agent keeps working perfectly. The output is unchanged. The only thing that changes is the bill, and it changes in a direction nobody has a dashboard pointed at. A prompt refactor that looks purely cosmetic can quietly double an agent's input cost by pushing a variable value ahead of the cacheable block.

This is the single most common caching failure: not that teams never turn it on, but that they turn it on, see the saving, and then break it months later with an innocent-looking change and never notice.


How to know if you are actually saving

You cannot manage what you cannot see, and caching is invisible without token-level instrumentation. Providers report a cached-token count on each response. The question is whether your observability captures it.

If it does, you can compute the number that matters: your cache hit rate per agent, and the realised discount in money. You can watch that hit rate over time and catch the day it falls off a cliff because someone moved a timestamp. You can see which agents are benefiting and which are paying the write premium for reuse that never comes.

If it does not, you are flying blind on a line item that can swing by large percentages. The default assumption "we enabled caching, so it is working" is exactly the assumption that lets a broken prefix run for months.

The instrumentation is the same per-run event you already want for cost attribution generally. The run records input_tokens, output_tokens, and a cached_tokens count; the rate card applies the cached rate to the cached portion and the full rate to the rest. Now your per-agent cost reflects caching accurately, and your cache hit rate is just another series on the chart, with the same baseline alerting as any other spend metric. A hit rate that drops is a spend anomaly like any other, and it pages you the morning it happens rather than surfacing on the invoice.


A short checklist

  • Identify your high-frequency agents with a long, fixed prefix. They are where the money is.
  • Move everything stable to the front of the prompt and everything variable to the back, so the cacheable prefix is as large and as stable as possible.
  • Never inject a timestamp, per-user value, or unordered list ahead of the stable block. That one rule prevents most silent breakage.
  • Capture cached_tokens per run and compute hit rate per agent.
  • Baseline the hit rate and alert on a drop, so a prompt refactor that breaks the cache pages you instead of surfacing on the bill a month later.

Prompt caching is rare among cost levers in that it costs almost nothing to claim and asks for no quality trade-off. The catch is that claiming it once is not the same as keeping it, and keeping it requires seeing it. AgentPing captures cached-token counts per run and applies the cached rate in the Spend view, so your hit rate is monitored like any other metric. Get started and check whether the agent you think is cached actually is.

FAQ frequently asked
What is prompt caching?
Prompt caching lets a provider store the processed form of a stable prefix of your prompt (a long system prompt, a tool schema, a big set of few-shot examples) so that subsequent requests reusing that prefix are billed at a steep discount and processed faster. You mark the cacheable portion, the provider keeps it warm for a short time-to-live, and any request that arrives within that window and shares the prefix reads from the cache instead of reprocessing the whole thing. The reused tokens are charged at a fraction of the normal input rate.
How much does prompt caching actually save?
It depends on the provider and the cache type, but cached input tokens are commonly billed at somewhere between 10% and 50% of the normal input rate, with the largest discounts on the highest reuse. The saving only applies to the cached prefix, not the variable part of the prompt or the output, so the real-world reduction in your bill depends on what fraction of your tokens live in the stable prefix. Agents with a long fixed system prompt and a short variable payload benefit most.
When is prompt caching not worth it?
When your stable prefix is short, when your requests are spread far enough apart that the cache expires between them, or when there is some writing overhead to populate the cache that you never recoup because reuse is too low. Some providers charge a small premium to write to the cache on the first call, so caching a prefix you only reuse once or twice can cost more than it saves. Caching pays when a substantial, stable prefix is reused many times inside the cache lifetime.
How do I know if my agent is benefiting from caching?
You need per-run visibility into cached versus uncached input tokens. Providers report a cached-token count on each response; if your observability captures it, you can compute your cache hit rate and see the realised discount per agent. Without that breakdown you are guessing, and the most common failure is assuming caching is on and working when a prompt change quietly broke the prefix and dropped your hit rate to near zero.
Why would caching silently stop working?
Caching keys on an exact prefix match. If anything near the front of your prompt changes between requests (a timestamp, a per-user value, a reordered tool list, a dynamically built system message) the prefix no longer matches and every call becomes a cache miss billed at full rate. These regressions are invisible without token-level monitoring, because the agent still works perfectly; only the bill changes, and it changes in a direction nobody is watching.