All articles
LLM Economics

TPU 8i vs NVIDIA Rubin and B200: Cost Per Token for LLM Inference (2026)

The accelerator with the best benchmark is not always the cheapest per token, because cost per token depends on price per hour, real throughput, and how much migration and lock-in you have to amortize.

Jun 23, 2026 · 4 min read
TPU 8i vs NVIDIA Rubin and B200: Cost Per Token for LLM Inference (2026)

Key takeaways

  • Cost per token, not raw benchmark throughput, is the number that decides inference economics.
  • Cost per token = effective hourly price / (tokens per second x 3,600 x utilization).
  • A faster accelerator can still lose on cost if its hourly price or your utilization is worse.
  • Migration cost (porting, re-tuning, lock-in) should be amortized into per-token cost.
  • Re-run the comparison per model and per workload; rankings move with batch size and sequence length.

Why is cost per token the only fair comparison?

Because benchmarks measure speed, and bills measure money. TPU 8i, NVIDIA Rubin, and B200 each post different throughput on different models, but throughput alone tells you nothing about margin. Two accelerators can hit similar tokens per second and still differ on cost per token if their hourly prices or achievable utilization differ.

The formula to hold in your head: cost per token equals effective hourly price divided by (tokens per second x 3,600 x utilization). Every comparison reduces to those three inputs.

How do you turn a benchmark into a cost number?

Take the published throughput for your model class, apply your realistic utilization, and divide the effective hourly price by the resulting hourly token volume. Do it for each accelerator under the same conditions: same model, same sequence lengths, same batch strategy. Then you can rank by dollars per million tokens instead of by a headline tokens-per-second figure.

This is where rankings often flip. An accelerator that looks fastest on a spec sheet can land mid-pack on cost once its hourly price and your utilization are folded in.

What about migration cost and lock-in?

This is the line most comparisons leave out. Moving between TPU and NVIDIA stacks is not free: you may re-tune serving code, swap kernels or runtimes, re-validate quality, and accept some platform lock-in. A 10 to 20 percent per-token saving can evaporate if you only run that workload for a few months, because the migration effort has to be amortized over the tokens you actually serve.

An observation worth adding: treat migration as a fixed cost spread across projected token volume. The break-even is not is it cheaper per token but is it cheaper per token for long enough to pay back the switch.

When should you switch accelerators?

When the per-token saving, net of migration and amortized over realistic volume, clears a meaningful margin and your workload is stable enough to stay put. For short-lived or fast-changing workloads, the cheapest move is often to stay on what you have. For large, durable inference workloads, even a modest per-token gap is worth chasing.

The takeaway: convert every benchmark to dollars per million tokens at your utilization, then subtract amortized migration before you crown a winner. You can model accelerator cost per token and compare options side by side in Calcaas.

Frequently asked questions

Is TPU cheaper than NVIDIA B200 for LLM inference?

It depends on the model, the effective hourly price, and achievable utilization. Cost per token, not raw throughput, decides it, and the ranking can change between models and batch settings. Always normalize before concluding.

How do I calculate cost per token for an accelerator?

Divide the effective hourly price by tokens per second times 3,600 times utilization. That converts a benchmark and a rental rate into dollars per token, which you can compare directly across TPU, Rubin, and B200.

Does migration cost matter when switching GPUs or TPUs?

Yes. Porting serving code, re-tuning, re-validating quality, and lock-in are real costs. Amortize them over projected token volume; a small per-token saving may not pay back a switch for a short-lived workload.

Which accelerator should I pick for inference in 2026?

The one with the lowest cost per token at your real utilization, after accounting for amortized migration and your tolerance for lock-in. For durable, high-volume workloads, even a modest per-token edge compounds. Note: place the JSON-LD above inside a <script type="application/ld+json"> tag in the page head.

More from the blog

The Margin Memo

Pricing math, in your inbox.

One short note a week on AI pricing, token economics, and margin. No spam, unsubscribe anytime.