Twelve months ago, if you asked an ML platform team what kept them up at night, the answer was GPU availability. Fair enough — H100 lead times stretched past six months, and spot prices on secondary markets were absurd. But something shifted in Q1 2026 that most teams haven't fully internalized yet: the constraint moved. The bottleneck isn't the GPU die anymore. It's the memory stacked on top of it.
HBM Sold Out, Everywhere, Through Year-End
SK Hynix controls 62% of the HBM market. Micron holds 21%. Samsung trails at 17%, still working through qualification delays with NVIDIA. All three have the same message for anyone asking about availability: sold out through 2026.
Micron's fiscal Q1 2026 earnings tell the story in a single line — $13.64 billion in revenue, up 57% year-over-year, with gross margins above 50%. For a memory company that was scraping by at 22% margins two years prior, that's not a recovery. That's a regime change. The DRAM industry consolidated from ten players in 2009 down to three today, and those three are printing money on AI demand.
The demand side isn't hard to parse. Every major hyperscaler — Microsoft, Google, Meta, Amazon — is in an arms race, and HBM is the ammunition. TrendForce's characterization is blunt: the reallocation of memory capacity toward AI datacenters is "permanent." Every wafer allocated to an HBM stack for a data center accelerator is a wafer denied to consumer devices.
The Numbers That Hurt
DRAM prices rose 50–55% in Q1 2026 versus Q4 2025. TrendForce called it "unprecedented." Counterpoint Research puts the actual figure even higher in some segments — 80–90%. GDDR7, the memory used in consumer graphics cards, saw a 246% price spike through 2025 alone.
Here's what that means across GPU generations:
| Component | H100 (2023) | B200 (2025) | Rubin (2026 est.) |
|---|---|---|---|
| HBM capacity | 80 GB (HBM3) | 192 GB (HBM3e) | 288+ GB (HBM4) |
| Bandwidth | 3.35 TB/s | 8.0 TB/s | 11+ TB/s |
| Est. memory cost share | ~35% of BOM | ~45% of BOM | ~55% of BOM |
Stare at that last row. When memory was a third of a GPU's bill of materials, it was a line item. When it crosses half, it is the budget.
Packaging: The Bottleneck Behind the Bottleneck
Even if Samsung and SK Hynix hit their production targets — both kicked off HBM4 mass production in February, right on schedule — the chips still need packaging. TSMC's CoWoS (Chip-on-Wafer-on-Substrate) process is the only viable path for high-end AI accelerators, and NVIDIA has locked up over 60% of that capacity through 2027.
TSMC committed $56 billion in capex to scale CoWoS from roughly 35,000 wafers/month in late 2024 to 130,000–150,000 by end of 2026. Progress is real. But advanced packaging has become as difficult and capital-intensive as wafer fabrication itself — yield improvements are incremental, not exponential. To bridge the gap, TSMC is farming out 240,000–270,000 wafers annually to OSAT partners like Amkor and SPIL, which helps throughput but introduces quality variance that some customers aren't thrilled about.
Net effect: even when HBM4 chips exist in sufficient quantity, the packaging pipeline throttles how fast they reach finished accelerators.
What This Does to Your $/Token
The irony is thick. Inference cost per token has fallen roughly 50× since late 2022 — from 20/million tokens for GPT-4-class models to around 0.40 today. Blackwell-based deployments push that down another 2.5× compared to Hopper. The software and architecture side of optimization has been spectacular: continuous batching, FP8 quantization via vLLM and SGLang, workload-aware scheduling.
But hardware procurement is moving the other direction. If you're buying or renting capacity right now, a growing slice of what you pay isn't for compute — it's for memory bandwidth. An H100 at 2.69/hr on one provider versus 9.98/hr on another isn't just a margin spread; it reflects how each vendor absorbed (or passed through) the memory cost spike.
For teams running large models, the arithmetic is simple but ugly. A 70B parameter model at 16-bit precision needs roughly 140 GB just for weights. Add KV cache overhead at scale — about 0.5 MB per token on a 7B model, scaling linearly — and you're in 192+ GB territory. That means HBM3e or HBM4, which means paying the full premium.
Gaming Takes the Hit
NVIDIA slashed RTX 50-series production by 30–40% in H1 2026. The RTX 50 Super lineup, originally planned for early this year, slipped to Q3 at best. PC vendors including Lenovo, Dell, HP, Acer, and ASUS warned of 15–20% price hikes across their lineups.
This isn't just a gamer grievance. It's a market signal. Consumer GPUs use GDDR, not HBM, but wafer capacity is fungible at the fab level. When SK Hynix can sell high-bandwidth stacks at 50%+ margins versus GDDR at commodity pricing, the allocation decision makes itself.
The Horizon
Samsung plans to reach 250,000 HBM wafers/month by late 2026, up from 170,000 currently. SK Hynix is scaling its Icheon and Cheongju fabs in parallel. Micron's CHIPS Act–funded Idaho facility won't produce wafers until mid-2027. Supply-chain analysts tentatively pencil in "some stabilization" around 2027, which in semiconductor-speak means "still tight, just less catastrophically so."
The HBM total addressable market is projected at 100 billion by 2028, up from 35 billion in 2025. That's not a shortage that resolves — it's a market that restructures around a new center of gravity.
Plan your AI infrastructure budget accordingly. The die is getting cheaper per FLOP. The memory feeding it is not.