TPU 8i vs NVIDIA Rubin and B200: Cost Per Token for LLM Inference (2026)

The accelerator with the best benchmark is not always the cheapest per token, because cost per token depends on price per hour, real throughput, and how much migration and lock-in you have to amortize.

Jun 23, 2026 · 4 min read

TPU 8i vs NVIDIA Rubin and B200: Cost Per Token for LLM Inference (2026)

Key takeaways

Cost per token, not raw benchmark throughput, is the number that decides inference economics.
Cost per token = effective hourly price / (tokens per second x 3,600 x utilization).
A faster accelerator can still lose on cost if its hourly price or your utilization is worse.
Migration cost (porting, re-tuning, lock-in) should be amortized into per-token cost.
Re-run the comparison per model and per workload; rankings move with batch size and sequence length.

Why is cost per token the only fair comparison?

Because benchmarks measure speed, and bills measure money. TPU 8i, NVIDIA Rubin, and B200 each post different throughput on different models, but throughput alone tells you nothing about margin. Two accelerators can hit similar tokens per second and still differ on cost per token if their hourly prices or achievable utilization differ.

The formula to hold in your head: cost per token equals effective hourly price divided by (tokens per second x 3,600 x utilization). Every comparison reduces to those three inputs.

How do you turn a benchmark into a cost number?

Take the published throughput for your model class, apply your realistic utilization, and divide the effective hourly price by the resulting hourly token volume. Do it for each accelerator under the same conditions: same model, same sequence lengths, same batch strategy. Then you can rank by dollars per million tokens instead of by a headline tokens-per-second figure.

This is where rankings often flip. An accelerator that looks fastest on a spec sheet can land mid-pack on cost once its hourly price and your utilization are folded in.

What about migration cost and lock-in?

This is the line most comparisons leave out. Moving between TPU and NVIDIA stacks is not free: you may re-tune serving code, swap kernels or runtimes, re-validate quality, and accept some platform lock-in. A 10 to 20 percent per-token saving can evaporate if you only run that workload for a few months, because the migration effort has to be amortized over the tokens you actually serve.

An observation worth adding: treat migration as a fixed cost spread across projected token volume. The break-even is not is it cheaper per token but is it cheaper per token for long enough to pay back the switch.

When should you switch accelerators?

When the per-token saving, net of migration and amortized over realistic volume, clears a meaningful margin and your workload is stable enough to stay put. For short-lived or fast-changing workloads, the cheapest move is often to stay on what you have. For large, durable inference workloads, even a modest per-token gap is worth chasing.

The takeaway: convert every benchmark to dollars per million tokens at your utilization, then subtract amortized migration before you crown a winner. You can model accelerator cost per token and compare options side by side in Calcaas.

Frequently asked questions

Is TPU cheaper than NVIDIA B200 for LLM inference?

It depends on the model, the effective hourly price, and achievable utilization. Cost per token, not raw throughput, decides it, and the ranking can change between models and batch settings. Always normalize before concluding.

How do I calculate cost per token for an accelerator?

Divide the effective hourly price by tokens per second times 3,600 times utilization. That converts a benchmark and a rental rate into dollars per token, which you can compare directly across TPU, Rubin, and B200.

Does migration cost matter when switching GPUs or TPUs?

Yes. Porting serving code, re-tuning, re-validating quality, and lock-in are real costs. Amortize them over projected token volume; a small per-token saving may not pay back a switch for a short-lived workload.

Which accelerator should I pick for inference in 2026?

The one with the lowest cost per token at your real utilization, after accounting for amortized migration and your tolerance for lock-in. For durable, high-volume workloads, even a modest per-token edge compounds. Note: place the JSON-LD above inside a <script type="application/ld+json"> tag in the page head.

More from the blog

GPU Cloud Providers in Europe 2026: The Real Cost of Data Residency

LLM Economics

Jun 23, 20264 min read

GPU Cloud Providers in Europe 2026: The Real Cost of Data Residency

European GPU clouds offer B200 and H200 capacity with EU data residency and sovereignty, but residency usually carries a price premium that you should model as part of cost per token, not treat as a free checkbox.

Custom AI Chips vs NVIDIA in 2026: What It Means for Your Inference Cost

LLM Economics

Jun 23, 20263 min read

Custom AI Chips vs NVIDIA in 2026: What It Means for Your Inference Cost

Hyperscaler custom chips like Trainium, Google TPU, Maia, and Meta MTIA are built to cut the provider's cost of serving AI, but that only lowers your bill if it shows up as a cheaper per-token price or GPU-hour rate.

Oracle Cloud GPU Pricing in 2026: H100 vs H200 vs B200 Per-Hour Cost

LLM Economics

Jun 23, 20263 min read

Oracle Cloud GPU Pricing in 2026: H100 vs H200 vs B200 Per-Hour Cost

Oracle Cloud prices H100, H200, and B200 GPUs at different per-hour rates, but the cheapest choice depends on your model size and utilization, not on which chip is newest.

The Margin Memo

Pricing math, in your inbox.

One short note a week on AI pricing, token economics, and margin. No spam, unsubscribe anytime.