Q1 2026 delivered more custom inference silicon than any quarter in history. Google deployed Ironwood. Amazon shipped Trainium3. Microsoft lit up Maia 200 in Azure. Meta put four generations of MTIA on its roadmap. And NVIDIA responded at GTC with the B300 Blackwell Ultra — a 1,400-watt dual-die monster that pushes 15 PFLOPS of FP4 compute.

For anyone running inference at scale, the question has shifted from "which NVIDIA GPU?" to "do I even need NVIDIA?"

Here's the scorecard.

Ironwood (Google) Trainium3 (Amazon) Maia 200 (Microsoft) B300 (NVIDIA)
Process Custom 5nm-class TSMC 3nm TSMC 3nm TSMC 4NP (dual-die)
FP8 Perf. 4.6 PFLOPS 2.52 PFLOPS ~7.5 PFLOPS¹ ~7.5 PFLOPS²
HBM 192 GB HBM3e 144 GB HBM3e 216 GB HBM3e 288 GB HBM3e
Mem. BW 7.37 TB/s 4.9 TB/s 7.0 TB/s 8.0 TB/s
TDP 157 W 1,400 W
Max Scale 9,216 chips/pod 1M chips (UltraServer) 72 GPUs/rack
Where GCP only AWS only Azure only Multi-cloud

¹ Estimated from Microsoft's "3× Trainium3 FP4" claim. ² Half of B300's 15 PFLOPS FP4 rating.

Raw compute doesn't decide inference economics. Two other metrics do: performance per watt and performance per dollar.

The Efficiency Gap

Google's Ironwood pulls 157 watts. NVIDIA's B300 pulls 1,400. Even adjusting for the B300's higher absolute FP8 throughput (~7.5 vs. 4.6 PFLOPS), Ironwood delivers roughly 29 TFLOPS per watt compared to the B300's ~5.4. That's a 5.4× efficiency advantage.

At data center scale, power is the binding constraint — not chip cost. Electricity, cooling, and the physical infrastructure to deliver sustained megawatts to a single building are what actually gate how much inference you can run. A 9,216-chip Ironwood pod producing 42.5 ExaFLOPS of inference compute does it at a fraction of the power envelope an equivalent NVIDIA deployment would demand. The gap compounds at scale: one Ironwood pod replaces roughly the output of several full NVIDIA racks while drawing a fraction of the wattage. That translates directly into fewer transformers, fewer backup generators, and less chilled water.

Google claims 2× the perf/watt over its own TPU v5p, and the migration numbers back it up. Companies moving from H100 clusters to TPU v6e pods — the previous generation — already reported 3.8× cost improvements. Character.AI cut its serving costs by that margin and went public with the figures. Waymark did the same. These weren't toy benchmarks; they were production inference workloads running real user traffic. With Ironwood pushing the efficiency envelope even further, the gap between Google's custom silicon and general-purpose GPUs on a per-watt basis is now wide enough that ignoring it requires active justification, not just inertia.

Trainium3's Scale Bet

Amazon's play isn't per-chip dominance. Trainium3's specs look modest — 2.52 PFLOPS FP8, 144 GB HBM3e — but UltraServer connects up to one million chips with 40% better energy efficiency per chip and 4× at the rack level. Both Anthropic and OpenAI run workloads on it. When your deployment is measured in hundreds of thousands of chips, aggregate throughput and interconnect topology matter more than any single die.

Maia 200's Brute Force

Microsoft's entry matches or exceeds the B300 on raw compute — ~7.5 PFLOPS FP8, 140 billion transistors on TSMC 3nm, 272 MB of on-chip SRAM — while carrying 72 fewer GB of memory. It went live in US Central, purpose-built for the large-batch inference Azure runs internally for Copilot and OpenAI's API. Microsoft claims 30% better performance per dollar than its previous-gen fleet, which leaned heavily on B200s.

Where NVIDIA Still Wins

The B300 leads on three dimensions that still matter.

Memory. 288 GB of HBM3e is unmatched. For large MoE models or long-context inference that blows past 192 GB (Ironwood) or 144 GB (Trainium3), there's no substitute. Memory capacity determines whether a model fits on a single chip or requires sharding across multiple devices — and sharding introduces latency, complexity, and failure modes that no amount of raw FLOPS can offset. As context windows grow and mixture-of-experts architectures proliferate (Mixtral, DBRX, and their successors all blow past 100 GB for full-precision weights), the B300's memory headroom becomes a genuine architectural advantage, not just a spec-sheet win.

Software. CUDA's ecosystem advantage compounds over a decade of tooling, profilers, and library optimization. Custom silicon means custom toolchains — TensorFlow/JAX on TPUs, Neuron SDK on Trainium, proprietary stacks on Maia. Migration cost is real. A team with 50,000 lines of CUDA-optimized inference code doesn't switch because a benchmark looks better. They switch when the total cost of rewriting, revalidating, and redeploying is less than the infrastructure savings — and for most teams, that crossover hasn't happened yet.

Availability. You can rent a B300 from dozens of providers tomorrow. Every custom chip locks you to a single cloud. If that provider changes pricing or deprecates an instance type, your leverage is exactly zero.

What This Means for Your Bill

The inference cost curve has already collapsed — from 20 per million tokens for GPT-4-class output in late 2022 to roughly 0.40 today. Custom silicon accelerates that trajectory, but the savings come attached to a cloud contract.

If you're running over 10,000 GPU-hours per month on a single cloud and your workloads skew inference-heavy, benchmark against that provider's custom silicon this quarter. The reported 3–4× cost reductions from H100-to-TPU migrations aren't marketing — they're audited infrastructure spend. If you need multi-cloud portability or run mixed training-and-inference workloads, NVIDIA remains the default. Just an increasingly expensive one.

NVIDIA's monopoly on serious inference compute ended sometime in the last six months. Whether the savings from custom silicon justify the lock-in depends on your scale, your cloud commitment, and how much you trust your provider not to change the deal. For a growing number of teams doing the arithmetic, it already does.