LLM Inference Cost at Scale: Napkin Math for Founders

To estimate LLM inference cost, multiply tokens per request by requests per month by your blended price per million tokens, then stress-test each assumption before you trust the total.

Jun 23, 2026 · 4 min read

$LLM Inference Cost at Scale: Napkin Math for Founders$

Key takeaways

Inference cost is mostly token volume times price per token. Get the token count right before anything else.
Output tokens are usually billed several times higher than input tokens, so chatty responses dominate the bill.
Tiny per-request token differences compound into large monthly numbers once you hit real traffic.
Self-hosting swaps a clean per-token price for fixed GPU hours plus utilization risk.
Model a realistic token profile per user, not a best-case demo.

Why does napkin math beat a pricing spreadsheet you never finish?

Most founders overthink inference cost. You do not need a perfect model on day one, you need a defensible estimate you can act on this week. Napkin math gets you 80 percent of the answer with three inputs: average tokens per request, requests per active user per month, and your blended price per million tokens. Everything else is refinement.

The value is not the exact number. It is seeing which input your cost is most sensitive to, so you know what to measure carefully and what to ignore.

How do you estimate cost per request?

Start with one request. Count input tokens (the prompt, system message, retrieved context, and history) and expected output tokens. A rough rule: one token is about four characters of English, so 1,000 tokens is roughly 750 words.

Then apply price. Providers bill input and output separately, and output is typically the more expensive side. For example, say a model charges $0.50 per 1M input tokens and $1.50 per 1M output tokens. A request with 2,000 input and 500 output tokens costs (2,000 / 1,000,000 x $0.50) plus (500 / 1,000,000 x $1.50), which is $0.001 plus $0.00075, about $0.00175 per request. Treat those rates as illustrative and plug in your provider's current numbers.

Why do output tokens quietly dominate the bill?

Because they are priced higher and because generation is where products get verbose. A summarizer that returns three bullets behaves very differently from an agent that writes long explanations or chains tool calls. If your output-to-input ratio creeps up, your cost curve bends with it.

One practical addition to the usual napkin framing: track output share as its own metric. If output tokens are 60 to 80 percent of your spend, prompt trimming barely helps, and your real lever is shorter, more structured responses.

How does this scale to a monthly bill?

Multiply up. Say cost per request is about $0.00175 and an active user makes 300 requests a month. That is roughly $0.53 per user per month in raw inference. At 1,000 active users, around $525. At 50,000, north of $26,000. The per-request number felt like rounding error; the monthly total decides your gross margin.

This is the moment to compare per-user cost against per-user price. If you charge $15 a month and inference runs $0.53, you have room. If a power user makes 5,000 requests, that same user costs roughly $8.75, and a flat plan starts leaking margin.

When does self-hosting beat per-token API pricing?

Self-hosting replaces a variable per-token price with fixed GPU hours. A rented accelerator costs the same whether you run it at 10 percent or 90 percent utilization, so the math only works once you have steady, high volume to keep the hardware busy. Below that line, per-token APIs are usually cheaper and far less operational hassle. Above it, owned or reserved capacity can cut unit cost sharply, but you take on utilization and reliability risk.

The honest version: most early products should stay on per-token APIs until a clear, sustained workload justifies the switch.

The takeaway: get tokens per request and output share right, multiply out to a monthly number, and compare it to price before you optimize anything. If you want to skip the arithmetic, you can model token costs, user tiers, and margins side by side in Calcaas.

Frequently asked questions

How do I estimate LLM inference cost quickly?

Multiply average tokens per request by requests per month, then by your blended price per million tokens. Count input and output tokens separately because they are priced differently. This napkin estimate is usually accurate enough to guide pricing decisions.

Why are output tokens more expensive than input tokens?

Generation is more compute-intensive than reading a prompt, so providers typically price output tokens higher, often several times the input rate. That means verbose responses, not long prompts, usually drive most of your bill.

Does self-hosting LLMs save money?

Only at sustained high utilization. Self-hosting converts a per-token price into fixed GPU hours, so it pays off when you keep the hardware busy. At low or bursty volume, per-token APIs are usually cheaper and simpler.

What token profile should I use for estimates?

Use a realistic average per active user, not a best-case demo. Track output share as its own number, since it often dominates spend and changes which optimizations actually matter. Note: place the JSON-LD above inside a <script type="application/ld+json"> tag in the page head.

More from the blog

GPU Cloud Providers in Europe 2026: The Real Cost of Data Residency

LLM Economics

Jun 23, 20264 min read

GPU Cloud Providers in Europe 2026: The Real Cost of Data Residency

European GPU clouds offer B200 and H200 capacity with EU data residency and sovereignty, but residency usually carries a price premium that you should model as part of cost per token, not treat as a free checkbox.

Custom AI Chips vs NVIDIA in 2026: What It Means for Your Inference Cost

LLM Economics

Jun 23, 20263 min read

Custom AI Chips vs NVIDIA in 2026: What It Means for Your Inference Cost

Hyperscaler custom chips like Trainium, Google TPU, Maia, and Meta MTIA are built to cut the provider's cost of serving AI, but that only lowers your bill if it shows up as a cheaper per-token price or GPU-hour rate.

Oracle Cloud GPU Pricing in 2026: H100 vs H200 vs B200 Per-Hour Cost

LLM Economics

Jun 23, 20263 min read

Oracle Cloud GPU Pricing in 2026: H100 vs H200 vs B200 Per-Hour Cost

Oracle Cloud prices H100, H200, and B200 GPUs at different per-hour rates, but the cheapest choice depends on your model size and utilization, not on which chip is newest.

The Margin Memo

Pricing math, in your inbox.

One short note a week on AI pricing, token economics, and margin. No spam, unsubscribe anytime.