AI Infrastructure4 min read

NVIDIA Blackwell tops MLPerf Training 6.0 with scale and speed

NVIDIA won every MLPerf Training v6.0 test, submitting across all workloads and scaling DeepSeek-V3 to 8,192 Blackwell Ultra GPUs.

The Brieftide

TL;DR

  • 01NVIDIA won every MLPerf Training v6.0 test, submitting across all workloads and scaling DeepSeek-V3 to 8,192 Blackwell Ultra GPUs.
  • 02The company was also the only platform to submit results across all new and existing workloads, including massive MoE models such as DeepSeek-V3.
  • 03NVIDIA won every MLPerf Training v6.0 benchmark and set multiple time-to-train records, including DeepSeek-V3 (671B MoE) trained in 2.02 minutes on 8,192 GPUs.

NVIDIA Blackwell delivered a clean sweep in MLPerf Training v6.0 on June 16, 2026, winning every benchmark and posting the fastest time-to-train and the highest per-accelerator performance on every test. The company was also the only platform to submit results across all new and existing workloads, including massive MoE models such as DeepSeek-V3.

What did NVIDIA achieve in MLPerf Training v6.0?

NVIDIA won every MLPerf Training v6.0 benchmark and set multiple time-to-train records, including DeepSeek-V3 (671B MoE) trained in 2.02 minutes on 8,192 GPUs. Other MLPerf v6.0 highlights from NVIDIA submissions include GPT-OSS 20B trained in 7.43 minutes on 512 GPUs, Llama 3.1 405B in 7.07 minutes on 8,192 GPUs, and Llama 3.1 8B in 4.46 minutes on 1,024 GPUs.

Those results came from NVIDIA platform configurations such as the GB300 NVL72 system and GB200 NVL72. NVIDIA notes that its GB300 NVL72 design connects 72 Blackwell Ultra GPUs and 36 Grace CPUs into a single system using NVIDIA NVLink and NVLink Switch, and that cloud partners scaled up to 8,192 Blackwell GPUs across production data centers for several entries.

How did Blackwell scale to thousands of GPUs?

NVIDIA combined hardware changes, scale-out networking, and deep software work to run MoE and dense models at hyperscale while hiding communication behind compute. On the hardware side, GB300 Ultra systems increased memory and power budgets vs GB200, and GB300 NVL72 linked 72 Blackwell Ultra GPUs with 36 Grace CPUs as one NVLink domain.

For scale-out fabric, NVIDIA used Spectrum-X Ethernet with Advanced Adaptive Routing to distribute traffic packet-by-packet and ConnectX SuperNICs to handle out-of-order delivery. Spectrum-X Congestion Control detects incast and paces senders before buffers overflow, which NVIDIA says helps keep all-to-all communication hidden behind compute at large scale.

The software stack delivered multiple targeted optimizations. NVIDIA implemented full-iteration CUDA graphs for token-dropless MoEs to remove CPU-GPU synchronization and offload the entire iteration to the GPU. CuTe DSL enabled advanced kernel fusions and dynamic tile scheduling, yielding "more than 8% end-to-end benefit on DeepSeek-v3 and a 93% end-to-end speedup on GPT-OSS," according to NVIDIA. An MXFP8 attention block moved attention math to 8-bit precision to accelerate attention without affecting required attention math, and router and HybridEP fusions plus FP32 math in router kernels produced a 5x kernel speedup and about a 5% end-to-end gain.

NVIDIA also improved the 1F1B (One Forward, One Backward) all-to-all overlap in Megatron-Core, capturing full iterations in CUDA graphs and prioritizing communication streams to achieve nearly 100% A2A communication overlap, which delivered an overall 8% performance benefit. Pipeline layout and MXFP8 use reduced pipeline imbalance to less than 1%, translating to a 4% end-to-end saving.

Why it matters

These results show work across the stack rather than a single-component win: hardware (GB300 Ultra), networking (Spectrum-X and ConnectX), and software (cuDNN, Transformer Engine, CuTe DSL, Megatron Core, Megatron Bridge) were tuned together to convert silicon into usable cluster-scale throughput. For organizations training very large models, the combination of scale-out networking and synchronization-free GPU execution directly reduces time-to-train for the largest MoE and dense models.

What to watch

Watch future MLPerf submissions for whether other platforms match NVIDIA’s cross-workload coverage and whether the full-iteration CUDA graphs and CuTe DSL techniques appear in competing stacks. Also follow subsequent MLCommons entries to see if large-scale submissions extend beyond the 8,192-GPU scale used for several NVIDIA results.

Sources: NVIDIA MLPerf Training v6.0 submission summary and Table 1 results retrieved from mlcommons.org entries cited by NVIDIA on June 16, 2026.

Selected NVIDIA MLPerf Training v6.0 time-to-train results
Item
DeepSeek-V3671B (MoE)GB300 NVL728,192 GPUs2.02 mins
GPT-OSS20B (MoE)GB300 NVL72512 GPUs7.43 mins
Llama 3.1405BGB200 NVL728,192 GPUs7.07 mins
Llama 3.18BGB200 NVL721,024 GPUs4.46 mins
Llama 2 70B LoRA70B LoRAGB300 NVL72512 GPUs0.4 mins
FLUX.1GB300 NVL72512 GPUs17.1 mins
DLRM-dcnv2GB300 NVL7264 GPUs0.67mins
Advertisement

Written by The Brieftide · Source: NVIDIA

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement