Self-Hosting vs API: When Local LLMs Actually Cost Less

Local open models can run inference at near-zero marginal cost when you reuse hardware you already own, but they are rarely truly free once you count electricity, throughput limits, and engineering time.

Jun 23, 2026 · 4 min read

Self-Hosting vs API: When Local LLMs Actually Cost Less

Key takeaways

Free local inference usually means no per-token API bill, not zero cost.
Reusing existing hardware (a dev machine, an idle server) is what makes local models cheap.
Per-token APIs win for low or bursty volume; local wins for steady, high-volume background jobs.
Quality and latency trade-offs decide whether a smaller local model is good enough.
Compare local cost per task against the API cost per task before committing.

What does free local inference really mean?

It means you stopped paying per token to a provider, not that the work costs nothing. When a team runs an open model locally to handle a background task like triaging pull requests or issues, the marginal cost of each run drops toward the cost of electricity and the time the machine is busy. If the hardware was already paid for and otherwise idle, that can round to almost nothing.

The honest accounting adds three buckets the API hides: hardware (or its amortized share), electricity and cooling, and the engineering time to set up and maintain the stack. Free is the marginal view; cheap is the full view.

When do local models genuinely beat an API?

For steady, high-volume, latency-tolerant work. Background automation is the sweet spot: tasks that run constantly, do not need the strongest frontier model, and can wait a second or two. Repo triage, classification, tagging, and bulk enrichment all fit. Here local inference on owned hardware can undercut per-token API pricing by a wide margin because you are not paying a premium per call.

When does a per-token API still win?

For low, bursty, or quality-critical work. If volume is small, the API bill is trivial and not worth replacing with hardware and maintenance. If load is spiky, you would buy hardware that sits idle most of the time, which is the opposite of efficient. And if the task needs top-tier reasoning, a smaller local model may not clear the quality bar, and a cheaper-looking option that produces worse results is not actually cheaper.

How do you decide? A quick founder playbook.

Estimate cost per task both ways. For the API, that is tokens per task times the provider's rate. For local, it is the machine's running cost for the time the task takes, plus an amortized slice of hardware and setup. Then multiply by tasks per month.

For example, say a triage task is cheap per call on an API but you run tens of thousands a month. The API total adds up, while the same job on an idle machine you already own costs little beyond power. Flip the volume to a few hundred a month and the API almost always wins on simplicity. Treat these as illustrative and plug in your own numbers.

The takeaway: local models are not free, they are cheap at high, steady volume on hardware you already have. Model local cost per task against API cost per task in Calcaas before you migrate a workload.

Frequently asked questions

Are local LLMs really free to run?

Not exactly. You stop paying per-token API fees, but you take on hardware, electricity, and engineering time. If you reuse hardware that is already paid for and otherwise idle, the marginal cost per run can be very low, close to the cost of electricity.

When is self-hosting cheaper than an API?

For steady, high-volume, latency-tolerant tasks like background triage or bulk classification. At that scale, running an open model on owned hardware can undercut per-token API pricing significantly because there is no per-call premium.

When should I just use a per-token API?

When volume is low or bursty, or the task needs top-tier quality. Small or infrequent workloads do not justify buying and maintaining hardware, and a weaker local model that lowers output quality is not a real saving.

How do I compare local vs API cost?

Estimate cost per task both ways, then multiply by monthly volume. For the API use tokens per task times the rate; for local use the machine's running cost plus amortized hardware and setup. The cheaper option flips with volume. Note: place the JSON-LD above inside a <script type="application/ld+json"> tag in the page head.

More from the blog

AI Spend Controls vs Cost Forecasting: How to Set a Cap That Actually Fits

Founder Guides

Jun 21, 20264 min read

AI Spend Controls vs Cost Forecasting: How to Set a Cap That Actually Fits

A spend cap limits the damage of a bad month, but it can't tell you what your AI budget should be. Forecast your token cost per user first, then set the cap above your power users.

GPU Cloud Providers in Europe 2026: The Real Cost of Data Residency

LLM Economics

Jun 23, 20264 min read

GPU Cloud Providers in Europe 2026: The Real Cost of Data Residency

European GPU clouds offer B200 and H200 capacity with EU data residency and sovereignty, but residency usually carries a price premium that you should model as part of cost per token, not treat as a free checkbox.

Custom AI Chips vs NVIDIA in 2026: What It Means for Your Inference Cost

LLM Economics

Jun 23, 20263 min read

Custom AI Chips vs NVIDIA in 2026: What It Means for Your Inference Cost

Hyperscaler custom chips like Trainium, Google TPU, Maia, and Meta MTIA are built to cut the provider's cost of serving AI, but that only lowers your bill if it shows up as a cheaper per-token price or GPU-hour rate.

The Margin Memo

Pricing math, in your inbox.

One short note a week on AI pricing, token economics, and margin. No spam, unsubscribe anytime.