Anthropic just told us where the inference money is going, and it's not where most people expected. Last week's Broadcom filing revealed that the company — now running at a $30 billion annual revenue clip — has committed to roughly 3.5 gigawatts of next-generation TPU compute starting in 2027. That's not a typo. Three point five gigawatts. For context, a single large data center runs on about 100-150 MW. Anthropic is reserving enough power for something like 25-30 massive facilities, all running Google's custom silicon instead of Nvidia GPUs.

From 9B to 30B in Fifteen Months

The revenue trajectory here matters because it explains why Anthropic can make infrastructure bets this size. At the end of 2025, the company was at roughly 9 billion in annualized revenue. By April 2026, that number has tripled. The customer base has followed: over 1,000 enterprise customers now spend more than 1 million annually, up from around 500 in February.

When a company growing at that rate commits to a specific chip architecture years in advance, it's not a casual preference — it's a calculated bet on where $/token goes next. And Anthropic, which also runs workloads on AWS Trainium and Nvidia hardware, chose to make its largest single infrastructure commitment on Google TPUs.

What Ironwood Actually Brings to the Table

The TPU at the center of this deal is Google's seventh-generation Ironwood, and the raw specs explain part of the rationale.

Spec Ironwood (TPU v7) Nvidia B200 Delta
Peak FP8 4,614 TFLOPS ~4,500 TFLOPS +2.5%
HBM per chip 192 GB 192 GB Even
HBM bandwidth 7.37 TB/s 8 TB/s -8%
Max cluster 9,216 chips 576 GPUs (NVL72×8) 16×
Perf/watt vs prior gen ~1.5× Ironwood leads

On a per-chip basis, these two are close enough that the comparison is almost boring. The B200 edges ahead on memory bandwidth; Ironwood has a slight FP8 advantage. Neither gap matters at the margin.

The real divergence is in the last two rows. Ironwood clusters scale to 9,216 chips with Google's optical circuit switching fabric, while Nvidia's NVL72 racks max out at 72 GPUs per domain before you start paying a networking tax across domains. And the 2× generational improvement in performance per watt — compared to Trillium's already-decent efficiency — compounds fast when your power bill is measured in gigawatts.

The 30% TCO Question

Here's where it gets interesting for anyone running an inference budget. Multiple analyses peg Ironwood's total cost of ownership at approximately 30% below an equivalent Nvidia deployment. That gap comes from three places: power efficiency (lower watts per FLOP), interconnect efficiency (optical switching avoids expensive InfiniBand fabrics at scale), and vertical integration (Google controls the chip, the compiler, the orchestration layer, and the cooling infrastructure).

Thirty percent doesn't sound dramatic until you multiply it by Anthropic's scale. If you're spending, say, 5 billion a year on inference compute — a plausible number given their revenue — a 30% TCO reduction means 1.5 billion in annual savings. That's not a rounding error. That's the difference between profitability and burning cash.

But there's a catch. That 30% number assumes you're running workloads that map well to Google's JAX/XLA stack. Ironwood's compiler toolchain is narrower than CUDA's ecosystem. If your model architecture or serving framework doesn't play nice with XLA, you eat the migration cost — and migration costs are where TCO estimates quietly fall apart.

Why Nvidia Should Be Worried (and Why They're Not Panicking)

Nvidia isn't sitting still. Jensen Huang announced at GTC that the Vera Rubin platform will deliver "10× cheaper inference than Blackwell," which would obliterate the current TPU cost advantage if it holds. The company also acquired Groq's inference technology for $20 billion in December and is shipping the Groq 3 LPX platform — a rack of 128 LPUs that claims 35× higher throughput per megawatt when paired with Vera Rubin NVL72.

Translation: Nvidia is hedging. GPUs for training, LPUs for inference, and a networking stack to tie them together. They're building an inference-specific answer precisely because they see companies like Anthropic voting with their power budgets.

Meanwhile, the GPU rental market tells its own story. H100 one-year contracts have jumped 40% since October, from 1.70/hr to 2.35/hr. On-demand capacity is essentially sold out across all tiers. The demand curve for Nvidia hardware hasn't broken — it's just that the biggest buyers are starting to diversify their supply chain.

What This Means for Everyone Else

If you're not Anthropic-sized, the immediate takeaway is simpler than it looks. The hyperscalers are bifurcating the inference market: custom silicon for their largest customers (TPUs, Trainium, whatever Microsoft is cooking), and Nvidia GPUs for everyone in the long tail.

For mid-tier inference workloads — say, running a fine-tuned 70B model for a few thousand concurrent users — Nvidia still owns the ecosystem. CUDA, TensorRT-LLM, vLLM, the entire serving stack assumes Nvidia hardware. Switching to TPUs means rewriting your deployment pipeline, retraining your ops team, and accepting vendor lock-in to Google Cloud.

The real question is whether Ironwood's pricing eventually trickles down to smaller Cloud TPU instances. Right now, Google Cloud's TPU v5p pricing runs about $4.20/hr per chip. If Ironwood instances land at a meaningful discount to equivalent H100/B200 pricing — and Google has every incentive to make that happen — the calculus changes for a much larger set of customers.

Until then, Anthropic's 3.5 GW commitment is a leading indicator, not a trailing one. The biggest buyer of inference compute in the world just decided that the future of $/token runs through custom silicon. The rest of the market will take a few years to catch up — or prove them wrong.