All articles
LLM Economics

Your Load Balancer Is Quietly Inflating Your LLM Bill

Standard load balancers scatter requests across servers at random, which breaks prefix caching and makes you pay full price for tokens you already cached. Prefix-aware routing fixes it.

Jun 22, 2026 · 4 min read
Your Load Balancer Is Quietly Inflating Your LLM Bill

Key takeaways

  • Prefix caching lets a server skip recomputing a prompt it has already seen, and providers often bill those cached tokens at a steep discount.
  • A standard load balancer routes by chance, so repeat requests rarely land on the server holding the cache. The cache misses, and you pay full price.
  • Prefix-aware (KV-cache-aware) routing sends requests with the same prefix to the same server, recovering the discount and cutting latency.
  • The cache discount you penciled into your margin model is optimistic unless your routing is cache-aware.
  • Model both a naive and a cache-aware hit rate before you trust the savings.

What is prefix caching, in plain terms?

When a model reads your prompt, it builds an internal cache (the KV cache) for those tokens. If the next request starts with the same prefix, say the same long system prompt, the server can reuse that work instead of redoing it. Less compute means lower cost and faster responses, and many providers bill those reused input tokens at a large discount. For apps that send the same system prompt on every call, prefix caching is one of the easiest wins on the menu.

Why do standard load balancers break it?

A standard load balancer spreads requests evenly or randomly across servers for resilience and balance. That is exactly wrong for caching. The cache lives on whichever server handled the first request. When your repeat request lands on a different server by chance, that server has no cache, so it recomputes from scratch and you pay full price. The cruel twist: the more servers you add to scale, the lower your odds of a cache hit. Your infrastructure ends up fighting your cost optimization.

How big is the hidden cost?

It is the gap between the cache discount you assumed and the hit rate you actually get. Say you modeled an 80% cache hit and your routing delivers 20%. Most of the discount you priced into your margins never shows up. That number is illustrative, but the trap is real: the savings look great on paper and quietly evaporate in production, and nothing in your dashboard screams about it.

How do you fix it? Prefix-aware routing

The fix is to route by content, not by chance. Prefix-aware, or KV-cache-aware, routing inspects the start of each request and sends matching prefixes to the same server, so the cache is there when the request arrives. You keep the resilience of running multiple servers and stop throwing away the cache on every other call. Hit rates climb, latency drops, and the discount you modeled finally lands in the bill.

What this means for your pricing model

Here is the contrarian part. Most pricing spreadsheets treat the cached-token discount as a flat percentage, a clean line in the margin math. It is not clean. Your real discount depends on infrastructure you may not have tuned, and it can swing with a config change. Before you bank those savings in a plan or a price cut, model two scenarios: a naive hit rate and a cache-aware one. The gap between them is your margin's exposure to a routing decision most teams never audit.

The takeaway: prefix caching only pays off if your routing respects it, so model the realistic hit rate before you price against it. You can run both the naive and cache-aware scenarios on token costs and margins in Calcaas in a few minutes.

Frequently asked questions

What is KV cache routing?

KV cache routing, also called prefix-aware routing, sends requests that share a prompt prefix to the same server so they hit that server's existing cache. It replaces random load balancing for LLM traffic, raising cache hit rates and recovering the provider's cached-token discount.

Why do load balancers reduce cache hits?

Standard load balancers distribute requests evenly or randomly for resilience, ignoring where a prompt's cache lives. Because the cache sits on the server that handled the first request, randomly routed repeats usually land elsewhere and miss it, forcing a full recompute at full price.

Does prefix caching actually lower cost?

Yes, when requests reuse the same prefix and routing sends them to the cached server. Providers often bill cached input tokens at a large discount, and skipping recomputation also cuts latency. The savings only materialize if your routing is cache-aware, otherwise the hit rate stays low.

How should this affect my pricing model?

Do not assume the full cached-token discount. Model a realistic cache hit rate, then a cache-aware one, and price against the conservative case. That keeps your margins honest if routing underperforms or a config change quietly drops your hit rate. Place this JSON-LD inside a `<script type="application/ld+json">` tag in the page head. The questions and answers must match the visible FAQ text exactly.

More from the blog

The Margin Memo

Pricing math, in your inbox.

One short note a week on AI pricing, token economics, and margin. No spam, unsubscribe anytime.