NVIDIA B200 Cloud Pricing in 2026: How to Compare Per-Hour GPU Costs
B200 rental prices vary widely across clouds, so the number that matters is not dollars per hour but dollars per million tokens once you factor in throughput and utilization.
Jun 23, 2026 · 3 min read
Key takeaways
B200 on-demand rates differ substantially between hyperscalers and specialist GPU clouds.
Per-hour price is the headline; cost per 1M tokens is the metric that decides margins.
Reserved and spot capacity can cut the hourly rate sharply versus on-demand.
Utilization is the hidden multiplier: an idle B200 still bills full price.
Convert GPU-hour cost to per-token cost before comparing against per-token API pricing.
Why is B200 pricing so different across providers?
Because providers are not selling the same thing. A hyperscaler bundles support, networking, compliance, and reliability into the hourly rate; a specialist GPU cloud often strips that down to compete on raw price. Add region, contract length, and on-demand versus reserved versus spot, and the same chip can carry very different sticker prices.
The honest approach is to normalize: same GPU count, same commitment level, same region, then look at the spread rather than any single quote.
What number should you actually compare?
Not dollars per hour. Dollars per million tokens. A cheap hourly rate on a GPU you cannot keep busy is more expensive than a higher rate at full utilization.
The bridge is throughput. If a B200 instance sustains T tokens per second for your model and batch settings, it produces T x 3,600 tokens per hour. Divide the hourly rate by that volume and you get cost per token. The provider with the lower hourly price does not always win once throughput and utilization differ.
How do reserved and spot pricing change the picture?
A lot. On-demand is the most expensive way to rent a B200. Committing to reserved capacity typically lowers the effective hourly rate in exchange for a term, and spot or preemptible instances can be cheaper still in exchange for interruption risk. For steady inference workloads, reserved capacity usually wins; for bursty or batch jobs, spot can be the cheapest route if your system tolerates interruptions.
When does renting a B200 beat per-token API pricing?
When you have enough sustained, predictable volume to keep the GPU busy. Self-serving on rented B200s turns a per-token price into a fixed hourly cost, so the per-token math only beats an API once utilization is high. Below that, a managed per-token API is usually cheaper and simpler. Above it, owned throughput can undercut API pricing, but you absorb utilization and ops risk.
The takeaway: collect per-hour rates, then convert every one of them to cost per million tokens at realistic utilization before you choose. You can model GPU-hour cost, throughput, and per-token economics side by side in Calcaas.
Frequently asked questions
How much does it cost to rent an NVIDIA B200 in the cloud?
It varies widely by provider, region, and commitment, with on-demand being the most expensive and reserved or spot capacity considerably cheaper. Rather than rely on one headline rate, compare normalized quotes for the same GPU count and term.
Is per-hour GPU price the right way to compare providers?
No. Per-hour price is only the input. The decision metric is cost per million tokens, which depends on throughput and utilization. A higher hourly rate at full utilization can be cheaper per token than a low rate on an idle GPU.
Should I rent B200s or use a per-token API?
Rent when you have sustained, high utilization that keeps the hardware busy; otherwise a per-token API is usually cheaper and simpler. Convert your expected GPU-hour cost to per-token cost and compare directly.
What makes B200 cloud pricing drop?
Longer commitments through reserved capacity and interruption-tolerant spot or preemptible instances both lower the effective hourly rate versus on-demand. Region and provider type also matter. Note: place the JSON-LD above inside a <script type="application/ld+json"> tag in the page head.