For the first time in the MLPerf inference benchmarks, AMD posted numbers that don't require mental gymnastics to interpret. The MI355X didn't just participate — it tied NVIDIA's B200 on one key test, beat it on another, and crossed a million tokens per second in multi-node configurations. MLCommons published the Inference v6.0 results on April 1, and the implications for anyone paying cloud GPU bills are worth working through.

The headline number, unpacked

AMD scaled 94 MI355X GPUs across multiple nodes to surpass one million tokens per second on Llama 2 70B and GPT-OSS-120B. On a single node, one MI355X pushed 100,282 tokens per second on the Llama 2 70B Server benchmark — a 3.1x improvement over the MI325X, AMD's previous-generation part that most of the industry treated as "interesting but not threatening."

The silicon behind these results: 288 GB of HBM3E per chip, 8 TB/s of memory bandwidth, and native FP4/FP6 support delivering 20 PFLOPS in reduced-precision formats. That last figure is roughly double the B200's FP4/FP6 capacity, which explains a lot about where AMD's gains come from.

Blackwell, meet competition

MLPerf's value is that it forces vendors onto the same playing field. Here's how the MI355X stacked up against NVIDIA's B200 on Llama 2 70B, single-node:

Scenario MI355X relative to B200
Offline 100% (tied)
Server 97%
Interactive 119%

And against NVIDIA's newer B300, which most cloud providers haven't deployed yet:

Scenario MI355X relative to B300
Offline 92%
Server 93%
Interactive 104%

That Interactive score deserves a second look. Interactive mode simulates real conversational workloads — chatbots, coding assistants, agent loops with streaming output. It's the closest MLPerf gets to what production inference actually looks like. AMD winning that scenario against both B200 and B300 isn't a rounding error.

Outside MLPerf, the picture holds up. On DeepSeek-R1 using vLLM, AMD reports 1.4x higher throughput at scale compared to B200. In FP4 specifically, the MI355X runs DeepSeek-R1 about 20% faster. The advantage concentrates at 64–128 concurrent requests — exactly the concurrency range where most real-world serving happens and where memory bandwidth becomes the bottleneck.

The dollar question

Benchmarks are interesting. Cost per token is what determines procurement decisions.

B200 cloud pricing is mature and legible: 2.25/hr at the cheapest providers, 4–5/hr at most neo-clouds, roughly 12–16/hr from hyperscalers. The market average sits around 4.79/hr per GPU.

MI355X pricing barely exists yet. DigitalOcean announced MI350X GPU Droplets in February with MI355X liquid-cooled racks arriving next quarter. Crusoe has early listings. A handful of others are in various stages of deployment. But you can't walk up to twenty providers and spin up a cluster the way you can with Blackwell.

AMD claims 40% more tokens generated per dollar of infrastructure cost versus B200. If that survives contact with production — which is a big qualifier — the math is straightforward. A company burning 2M monthly on inference at B200 rates would spend 1.4M for identical throughput on MI355X hardware. Over a year, that's $7.2M in savings. The kind of number that gets a CFO's attention.

Three reasons to stay skeptical

ROCm is not CUDA. The software gap has narrowed substantially, but it hasn't closed. Production teams consistently report that migrating from NVIDIA to AMD involves integration overhead — driver edge cases, kernel optimizations that don't port cleanly, monitoring and profiling tools built around NVIDIA assumptions. If your ops team burns a week chasing a memory leak in a ROCm kernel, your tokens-per-dollar calculation goes sideways. Framework support through vLLM and SGLang has improved dramatically, but the long tail of tooling still favors green.

Supply remains the constraint. You can rent B200 instances from over twenty providers right now. MI355X? Three or four, with waitlists. The broader GPU supply crunch — H100 rental prices have climbed 40% since October, on-demand capacity is sold out across the board — could theoretically benefit AMD if they deliver volume. But the memory shortage hitting NVIDIA hits AMD too: both architectures depend on the same constrained HBM3E supply chain. AMD can't just will more chips into existence because the benchmarks look good.

MLPerf is a beauty contest. Both vendors submit meticulously optimized configurations that showcase peak performance under ideal conditions. Your actual deployment — with its specific model architecture, batch size distribution, sequence length profile, latency SLAs, and whatever else makes your workload uniquely yours — may tell a very different story. The MI355X's Interactive advantage may evaporate once you add streaming output constraints and a 200ms p99 target on a 405B parameter model.

So who should move?

Mid-size inference operators running 50–200 GPUs with contracts up for renewal: get MI355X evaluation hardware. The performance gap has closed enough that not testing AMD is leaving potential savings unexamined.

Hyperscalers and 1,000+ GPU shops: you probably already have MI355X in your test racks. Your internal numbers matter more than MLPerf.

Startups building their first inference cluster: B200 remains the pragmatic choice. NVIDIA's ecosystem depth, documentation quality, and community knowledge base reduce the operational risk that matters most when your engineering team is small and their time is your scarcest resource.

The real shift here isn't about one benchmark. It's that "AMD or NVIDIA for inference?" has become a legitimate procurement question instead of a rhetorical one. Whether AMD can convert benchmark parity into actual cloud availability before NVIDIA ships the next generation — that's the $7.2M question.