Self-Hosting vs API: The Real Cost Math Behind '1/6 the Price'
Self-hosting an open LLM can cost a fraction of a frontier API, but only when your GPUs stay busy. The honest comparison is GPU dollars per hour divided by your actual throughput, versus the API price per token.
Jun 21, 2026 · 3 min read
Key takeaways
A deployment guide claims self-hosting GLM-5.2 (a 744B-parameter MoE with 1M context) costs about 1/6 of a frontier API on coding.
Self-hosting does not lower cost per token. It converts a variable per-token cost into a fixed per-hour cost.
The real comparison is (GPU $/hour / your tokens per hour) vs API $/token.
Utilization decides everything. At low utilization, the API often wins despite a higher sticker price.
Below your breakeven, self-hosting is more expensive, not cheaper.
Why can '1/6 the cost' be misleading?
The claim is real, but it quietly compares two different kinds of cost. An API charges you per token: do nothing, pay nothing. Self-hosting on rented H200s or B200s charges you per hour, whether the GPUs are slammed or idle. So '1/6 the cost' converts a variable cost into a fixed one and then compares them as if they were the same unit. They are not.
What is the honest self-host vs API comparison?
The honest comparison is not $/token vs $/token. It is:
(GPU $/hour / your actual tokens per hour) vs API $/token
The left side is controlled entirely by one thing the headline never mentions: utilization.
Why does GPU utilization decide everything?
Say a self-host setup hits its cheap per-token rate at, for example, 80-90% GPU utilization, throughput humming, batches full. At that load, 1/6 of the API cost can be real. Now run the same rig at 15% utilization because your traffic is spiky and daytime-only. Your tokens per hour collapse, but your $/hour does not. The effective cost per token can quietly cross above the API you were trying to beat. The sticker price assumes best-case load; your actual cost depends on your traffic shape.
Three cost regimes
High, steady volume: self-hosting can genuinely win, since fixed cost amortizes across full GPUs. The 1/6 claim lives here.
Spiky or low volume: APIs usually win, because you only pay when you call and someone else eats the idle time.
The messy middle: a real breakeven calculation, not a vibe. You need your tokens per hour and your GPU $/hour to know.
How do you model self-host vs API before you migrate?
1Estimate real throughput at your traffic shape, not the benchmark's. Daytime-only or bursty traffic sets your effective tokens per hour.
2Get the all-in GPU $/hour, including idle hours, not just active ones.
3Divide to get your true effective $/token, then compare to the API.
4Add the operational tax: ops time, reliability, scaling. The API price bundles these in; your self-host bill does not.
A 1M-context, 744B MoE running at FP8 is impressive engineering. But the cost story is not 'self-host is 6x cheaper.' It is '6x cheaper at high utilization, and possibly more expensive below your breakeven.' You can model self-host $/hour divided by throughput against API $/token across utilization levels in Calcaas before you commit a GPU budget.
Frequently asked questions
Is self-hosting an LLM cheaper than using an API?
Only at high, steady GPU utilization. Self-hosting trades a per-token cost for a fixed per-hour GPU cost, so it is cheaper only when throughput is high enough to push your effective cost per token below the API price.
What is the breakeven point for self-hosting an LLM?
It is the utilization level where (GPU $/hour / your tokens per hour) equals the API price per token. Above it, self-hosting wins; below it, the API is cheaper.
Why does GPU utilization matter for inference cost?
Because a rented GPU costs the same per hour whether busy or idle. Low utilization means fewer tokens per hour over the same fixed cost, which raises your effective cost per token.
What hidden costs come with self-hosting an LLM?
Idle GPU hours, ops and on-call time, reliability engineering, and scaling. None appear in the per-hour GPU rate, but all are bundled into an API's per-token price.