The Inference Inflection: Why AI Margins Now Live in Tokens, Not Training

In short: the cost and margin of an AI product have moved from one-time training to per-request inference, so your unit economics now rise and fall with token costs.

Jul 2, 2026 · 4 min read

The Inference Inflection: Why AI Margins Now Live in Tokens, Not Training

Key takeaways

Industry leaders are openly reframing AI around inference. Sam Altman says OpenAI has "to become an AI inference company now," and Noam Brown calls inference compute "a strategic resource, currently undervalued."
NVIDIA's Jensen Huang argues compute demand has risen by roughly 10,000x as AI shifted from training to inference, with usage up around 100x.
For builders, this means your cost of goods sold is now dominated by tokens served per user, not model training.
Open-model price pressure is real: cited examples include Qwen 3.5 Plus around $3 per 1M output tokens and MiMo-V2.5 Pro near $1/$3 per 1M input/output tokens.
The highest-leverage move now is modeling inference cost per user and per tier before you set your price.

What is the inference inflection?

The "inference inflection" is the point where inference, not training, becomes the main driver of both compute demand and product value. Latent Space's AINews framed it after a run of unusually blunt comments: Sam Altman saying OpenAI has to "become an AI inference company," and researcher Noam Brown calling inference compute an undervalued strategic resource. At NVIDIA GTC, Jensen Huang put it plainly: "the inference inflection has arrived."

The practical version for anyone shipping software: training is a fixed, one-time cost that a handful of labs pay. Inference is the variable cost you pay every single time a user hits your product. That is the number that shows up in your margin.

Why does inference matter more than training now?

Because usage exploded. Huang's framing is that compute demand has climbed roughly 10,000x, with raw usage up around 100x, as models moved from being trained to being run constantly for reasoning, agents, and everyday tasks. Every time an agent thinks, reads, or acts, it generates tokens, and every token is an inference call.

For a SaaS founder that reframes the cost question. Training economics belong to OpenAI, Anthropic, and their peers. Inference economics belong to you. Your gross margin is essentially revenue per user minus the tokens that user burns.

How does this change your unit economics?

It makes token math your core financial model. Say, illustratively, a power user triggers several agent steps per session and each session consumes a meaningful chunk of input and output tokens. At current prices that can be cents per session, which sounds trivial until you multiply by heavy users on a flat monthly plan. A small number of power users can quietly erase the margin funded by everyone else.

The fix is not guesswork. Before you price a tier, model the expected tokens per user, apply real per-model rates, and check the resulting margin at both the median and the heavy-usage end. That is exactly the kind of simulation you can run in Calcaas.

What about provider price pressure?

It is moving in your favor, if you shop around. The same AINews roundup flagged how fast capable open models are getting cheaper, pointing to Qwen 3.5 Plus at roughly $3 per 1M output tokens and MiMo-V2.5 Pro near $1/$3 per 1M tokens. Token efficiency is improving too: IBM's Granite 4.1 8B reportedly used only 4M output tokens on one intelligence benchmark versus 78M for a comparable peer, which is a large gap in real spend even before headline price differences.

The lesson is not "always pick the cheapest model." It is that provider choice is now a margin lever, and the right model depends on your workload, quality bar, and token profile. Comparing providers on your actual usage, not their marketing table, is where the savings hide.

How do you get ahead of it?

Treat inference as COGS and design your pricing around it: model tokens per user, compare providers on your real workload, and set tiers that stay profitable at the heavy end. You can model all of this, across providers and user tiers, in Calcaas.

Frequently asked questions

What is the inference inflection?

It is the shift, named in Latent Space's AINews and echoed by leaders like Sam Altman and Jensen Huang, where inference rather than training becomes the primary driver of AI compute demand and product value. In business terms, the recurring cost of running models now matters more than the one-time cost of training them.

Why is inference now more important than training for founders?

Training is a fixed cost paid by a few large labs, while inference is a variable cost you pay on every user request. As usage scales, inference becomes your dominant cost of goods sold and the main input to your gross margin.

How do I calculate AI inference cost per user?

Estimate the input and output tokens a typical user generates over a billing period, multiply by your model's per-token rates, and compare that against the revenue that user pays. Doing this for both median and heavy users reveals whether your pricing tiers stay profitable.

Does switching to cheaper open models fix AI margins?

It can help, but only if the cheaper model meets your quality bar for the workload. Provider choice is a real margin lever, yet the best option depends on your token profile and use case, so compare models on your actual usage rather than headline prices.

ShareX LinkedIn Facebook

More from the blog

Claude Sonnet 5 Pricing: What $2/$10 per Million Tokens Means for Your Margins

LLM Economics

Jul 1, 20264 min read

Claude Sonnet 5 Pricing: What $2/$10 per Million Tokens Means for Your Margins

Claude Sonnet 5 launches at $2 per million input tokens and $10 per million output tokens (introductory, through August 31, 2026), then $3/$15 - but a new tokenizer means your real cost depends on tokens per task, not the sticker rate.

The Economy of Tokens: Why Faster Inference Doesn't Always Cut Your AI Bill

LLM Economics

Jun 30, 20264 min read

The Economy of Tokens: Why Faster Inference Doesn't Always Cut Your AI Bill

Faster inference frameworks like DeepSeek's DSpark speed up output by 60 to 85%, but if you call a hosted API you pay per token, not per second, so your bill only drops when you control the serving stack or cut the tokens themselves.

Claude Sonnet 5 Pricing: What the Cheaper Agent Model Really Costs

LLM Economics

Jun 30, 20264 min read

Claude Sonnet 5 Pricing: What the Cheaper Agent Model Really Costs

Claude Sonnet 5 launches at $2 per million input tokens and $10 per million output tokens (introductory pricing through August 31, 2026), less than half the price of Opus 4.8, but a new tokenizer and a scheduled rate increase mean your real cost depends on the workload you run.

The Margin Memo

Pricing math, in your inbox.

One short note a week on AI pricing, token economics, and margin. No spam, unsubscribe anytime.