Inference Got 1,000x Cheaper — So Why Is Everyone Spending More?

Three years ago, running a GPT-4-class model cost roughly $20 per million tokens. Today the same caliber of output runs at$ 0.40 per million — a 50x drop in sticker price alone, and closer to 1,000x when you factor in throughput gains on modern hardware. And yet the global inference market ballooned from $12 billion in 2023 to an estimated$ 55 billion this year. Something doesn't add up. Or rather, it adds up perfectly — just not the way most people expect.

The Spending Paradox

In 2023, training consumed 67% of all AI compute. By 2026, the ratio has flipped: serving models now eats 67% of compute budgets. The market grew 4.5x in a period where unit costs fell roughly 1,000x, which means aggregate token volume has exploded by something like 4,500x. Every dollar saved got reinvested into more tokens.

Four Forces, Compounding

The 1,000x number isn't a single breakthrough. It's four independent vectors that multiply:

Driver	Per-generation gain	How it works
Hardware (Hopper → Blackwell → Rubin)	2–3x	Better silicon, higher FLOPS/watt
Software	2–3x	Continuous batching, PagedAttention, KV-cache compression
Architecture (dense → MoE)	3–5x	Sparse activation — only ~25% of params fire per token
Quantization & distillation	2–4x	FP8/FP4 formats, knowledge distillation to smaller models

Stack the midpoints: 2.5 × 2.5 × 4 × 3 ≈ 75x per cycle. We've been through roughly two full cycles since late 2022. The math checks out.

What makes this unusual compared to, say, CPU performance over the same window is that software and architecture improvements are moving as fast as silicon. Moore's Law gave us transistor density on a predictable curve. The token economy has multiple independent curves all tilting downward at once, and they compound multiplicatively rather than additively. That's why the decline looks exponential rather than linear.

Model-level innovations deserve more credit than they usually get. The shift from dense transformers to Mixture-of-Experts alone delivered a 3–5x cost reduction without any hardware change whatsoever — pure algorithmic leverage. Combine that with speculative decoding (draft with a small model, verify with the big one) and you can serve frontier-quality completions at near-mid-tier cost.

Jevons Runs on CUDA

William Stanley Jevons observed in 1865 that as coal engines became more efficient, total coal consumption went up. Cheaper energy per unit of work made entirely new categories of work rational.

Same mechanism, token by token. At $20/million tokens, you call the API only when a human explicitly asks. At$ 0.40, you run speculative completions on every incoming request, pre-generate five candidates and pick the best, or let an agent loop through 50 tool calls to close a ticket. At $0.05 — where quantized open-source models sit today — you embed model calls into background daemons that never stop running.

The cost floor hasn't just lowered. It has collapsed enough to change what "using AI" means architecturally.

What $0.05 Makes Viable

An agent making 200 LLM calls to resolve one support ticket now costs $0.01 in tokens. In 2023, the same workflow ran$ 4. Always-on summarization of every Slack thread, every email, every ticket — not because someone asked, but because it's cheaper than not doing it. Edge devices running distilled 1–3B parameter models handle tasks that once required round-trips to GPU clusters.

The shift: from "model calls as a service you invoke" to "model calls as ambient infrastructure."

The Blackwell Numbers

Real deployment data from Q1 2026 makes the hardware story concrete.

Sully.ai migrated medical note generation from Hopper to Blackwell and reported a 90% cost drop per query — a full 10x reduction — while response times improved 65%. Latitude, a gaming platform, went from 20 cents per million tokens on Hopper MoE models to 5 cents on Blackwell with NVFP4 quantization: a 4x win. Decagon slashed customer-service query costs by 6x while keeping latency under 400 milliseconds.

The GB200 NVL72 rack claims a 10x cost-per-token improvement for reasoning MoE models over the previous generation. And NVIDIA is already previewing Rubin with another promised 10x over Blackwell. If that trajectory holds — and hardware roadmaps are the one thing Jensen actually delivers on schedule — mid-tier model serving drops below $0.01 per million tokens by 2028.

At that price point, the token itself is effectively free. The cost becomes the engineer's time to build the system around it.

Who Loses in a Deflationary Token Economy

Pure API resellers. When the cost floor drops faster than you can build margin, the only defensible positions are owning the silicon or owning the model. The "we host open-source weights on rented GPUs" layer is in a pricing death spiral where last quarter's aggressive rate is this quarter's above-market embarrassment.

The other casualty: teams that spent 2024 training best-in-class dense 70B models. The market moved to MoE architectures delivering equivalent quality at 3–5x lower serving cost. That training investment doesn't buy a deployment advantage anymore. The question isn't "how cheap can a token get?" — it's what you build when tokens are practically free.

#The Spending Paradox

#Four Forces, Compounding

#Jevons Runs on CUDA

#What $0.05 Makes Viable

#The Blackwell Numbers

#Who Loses in a Deflationary Token Economy