Self-Hosting vs API: The Real Cost Math Behind '1/6 the Price'

Self-hosting an open LLM can cost a fraction of a frontier API, but only when your GPUs stay busy. The honest comparison is GPU dollars per hour divided by your actual throughput, versus the API price per token.

Jun 21, 2026 · 3 min read

Self-Hosting vs API: The Real Cost Math Behind '1/6 the Price'

Key takeaways

A deployment guide claims self-hosting GLM-5.2 (a 744B-parameter MoE with 1M context) costs about 1/6 of a frontier API on coding.
Self-hosting does not lower cost per token. It converts a variable per-token cost into a fixed per-hour cost.
The real comparison is (GPU $/hour / your tokens per hour) vs API $/token.
Utilization decides everything. At low utilization, the API often wins despite a higher sticker price.
Below your breakeven, self-hosting is more expensive, not cheaper.

Why can '1/6 the cost' be misleading?

The claim is real, but it quietly compares two different kinds of cost. An API charges you per token: do nothing, pay nothing. Self-hosting on rented H200s or B200s charges you per hour, whether the GPUs are slammed or idle. So '1/6 the cost' converts a variable cost into a fixed one and then compares them as if they were the same unit. They are not.

What is the honest self-host vs API comparison?

The honest comparison is not $/token vs $/token. It is:

(GPU $/hour / your actual tokens per hour) vs API $/token

The left side is controlled entirely by one thing the headline never mentions: utilization.

Why does GPU utilization decide everything?

Say a self-host setup hits its cheap per-token rate at, for example, 80-90% GPU utilization, throughput humming, batches full. At that load, 1/6 of the API cost can be real. Now run the same rig at 15% utilization because your traffic is spiky and daytime-only. Your tokens per hour collapse, but your $/hour does not. The effective cost per token can quietly cross above the API you were trying to beat. The sticker price assumes best-case load; your actual cost depends on your traffic shape.

Three cost regimes

High, steady volume: self-hosting can genuinely win, since fixed cost amortizes across full GPUs. The 1/6 claim lives here.
Spiky or low volume: APIs usually win, because you only pay when you call and someone else eats the idle time.
The messy middle: a real breakeven calculation, not a vibe. You need your tokens per hour and your GPU $/hour to know.

How do you model self-host vs API before you migrate?

1Estimate real throughput at your traffic shape, not the benchmark's. Daytime-only or bursty traffic sets your effective tokens per hour.
2Get the all-in GPU $/hour, including idle hours, not just active ones.
3Divide to get your true effective $/token, then compare to the API.
4Add the operational tax: ops time, reliability, scaling. The API price bundles these in; your self-host bill does not.

A 1M-context, 744B MoE running at FP8 is impressive engineering. But the cost story is not 'self-host is 6x cheaper.' It is '6x cheaper at high utilization, and possibly more expensive below your breakeven.' You can model self-host $/hour divided by throughput against API $/token across utilization levels in Calcaas before you commit a GPU budget.

Frequently asked questions

Is self-hosting an LLM cheaper than using an API?

Only at high, steady GPU utilization. Self-hosting trades a per-token cost for a fixed per-hour GPU cost, so it is cheaper only when throughput is high enough to push your effective cost per token below the API price.

What is the breakeven point for self-hosting an LLM?

It is the utilization level where (GPU $/hour / your tokens per hour) equals the API price per token. Above it, self-hosting wins; below it, the API is cheaper.

Why does GPU utilization matter for inference cost?

Because a rented GPU costs the same per hour whether busy or idle. Low utilization means fewer tokens per hour over the same fixed cost, which raises your effective cost per token.

What hidden costs come with self-hosting an LLM?

Idle GPU hours, ops and on-call time, reliability engineering, and scaling. None appear in the per-hour GPU rate, but all are bundled into an API's per-token price.

#Self-Hosting#GPU#Inference

More from the blog

GPU Cloud Providers in Europe 2026: The Real Cost of Data Residency

LLM Economics

Jun 23, 20264 min read

GPU Cloud Providers in Europe 2026: The Real Cost of Data Residency

European GPU clouds offer B200 and H200 capacity with EU data residency and sovereignty, but residency usually carries a price premium that you should model as part of cost per token, not treat as a free checkbox.

Custom AI Chips vs NVIDIA in 2026: What It Means for Your Inference Cost

LLM Economics

Jun 23, 20263 min read

Custom AI Chips vs NVIDIA in 2026: What It Means for Your Inference Cost

Hyperscaler custom chips like Trainium, Google TPU, Maia, and Meta MTIA are built to cut the provider's cost of serving AI, but that only lowers your bill if it shows up as a cheaper per-token price or GPU-hour rate.

Oracle Cloud GPU Pricing in 2026: H100 vs H200 vs B200 Per-Hour Cost

LLM Economics

Jun 23, 20263 min read

Oracle Cloud GPU Pricing in 2026: H100 vs H200 vs B200 Per-Hour Cost

Oracle Cloud prices H100, H200, and B200 GPUs at different per-hour rates, but the cheapest choice depends on your model size and utilization, not on which chip is newest.

The Margin Memo

Pricing math, in your inbox.

One short note a week on AI pricing, token economics, and margin. No spam, unsubscribe anytime.