NVIDIA Blackwell tops MLPerf Training 6.0 with scale and speed
NVIDIA won every MLPerf Training v6.0 test, submitting across all workloads and scaling DeepSeek-V3 to 8,192 Blackwell Ultra GPUs.
TL;DR
- 01NVIDIA won every MLPerf Training v6.0 test, submitting across all workloads and scaling DeepSeek-V3 to 8,192 Blackwell Ultra GPUs.
- 02The company was also the only platform to submit results across all new and existing workloads, including massive MoE models such as DeepSeek-V3.
- 03NVIDIA won every MLPerf Training v6.0 benchmark and set multiple time-to-train records, including DeepSeek-V3 (671B MoE) trained in 2.02 minutes on 8,192 GPUs.
NVIDIA Blackwell delivered a clean sweep in MLPerf Training v6.0 on June 16, 2026, winning every benchmark and posting the fastest time-to-train and the highest per-accelerator performance on every test. The company was also the only platform to submit results across all new and existing workloads, including massive MoE models such as DeepSeek-V3.
What did NVIDIA achieve in MLPerf Training v6.0?
NVIDIA won every MLPerf Training v6.0 benchmark and set multiple time-to-train records, including DeepSeek-V3 (671B MoE) trained in 2.02 minutes on 8,192 GPUs. Other MLPerf v6.0 highlights from NVIDIA submissions include GPT-OSS 20B trained in 7.43 minutes on 512 GPUs, Llama 3.1 405B in 7.07 minutes on 8,192 GPUs, and Llama 3.1 8B in 4.46 minutes on 1,024 GPUs.
Those results came from NVIDIA platform configurations such as the GB300 NVL72 system and GB200 NVL72. NVIDIA notes that its GB300 NVL72 design connects 72 Blackwell Ultra GPUs and 36 Grace CPUs into a single system using NVIDIA NVLink and NVLink Switch, and that cloud partners scaled up to 8,192 Blackwell GPUs across production data centers for several entries.
How did Blackwell scale to thousands of GPUs?
NVIDIA combined hardware changes, scale-out networking, and deep software work to run MoE and dense models at hyperscale while hiding communication behind compute. On the hardware side, GB300 Ultra systems increased memory and power budgets vs GB200, and GB300 NVL72 linked 72 Blackwell Ultra GPUs with 36 Grace CPUs as one NVLink domain.
For scale-out fabric, NVIDIA used Spectrum-X Ethernet with Advanced Adaptive Routing to distribute traffic packet-by-packet and ConnectX SuperNICs to handle out-of-order delivery. Spectrum-X Congestion Control detects incast and paces senders before buffers overflow, which NVIDIA says helps keep all-to-all communication hidden behind compute at large scale.
The software stack delivered multiple targeted optimizations. NVIDIA implemented full-iteration CUDA graphs for token-dropless MoEs to remove CPU-GPU synchronization and offload the entire iteration to the GPU. CuTe DSL enabled advanced kernel fusions and dynamic tile scheduling, yielding "more than 8% end-to-end benefit on DeepSeek-v3 and a 93% end-to-end speedup on GPT-OSS," according to NVIDIA. An MXFP8 attention block moved attention math to 8-bit precision to accelerate attention without affecting required attention math, and router and HybridEP fusions plus FP32 math in router kernels produced a 5x kernel speedup and about a 5% end-to-end gain.
NVIDIA also improved the 1F1B (One Forward, One Backward) all-to-all overlap in Megatron-Core, capturing full iterations in CUDA graphs and prioritizing communication streams to achieve nearly 100% A2A communication overlap, which delivered an overall 8% performance benefit. Pipeline layout and MXFP8 use reduced pipeline imbalance to less than 1%, translating to a 4% end-to-end saving.
Why it matters
These results show work across the stack rather than a single-component win: hardware (GB300 Ultra), networking (Spectrum-X and ConnectX), and software (cuDNN, Transformer Engine, CuTe DSL, Megatron Core, Megatron Bridge) were tuned together to convert silicon into usable cluster-scale throughput. For organizations training very large models, the combination of scale-out networking and synchronization-free GPU execution directly reduces time-to-train for the largest MoE and dense models.
What to watch
Watch future MLPerf submissions for whether other platforms match NVIDIA’s cross-workload coverage and whether the full-iteration CUDA graphs and CuTe DSL techniques appear in competing stacks. Also follow subsequent MLCommons entries to see if large-scale submissions extend beyond the 8,192-GPU scale used for several NVIDIA results.
Sources: NVIDIA MLPerf Training v6.0 submission summary and Table 1 results retrieved from mlcommons.org entries cited by NVIDIA on June 16, 2026.
| Item | |||||
|---|---|---|---|---|---|
| DeepSeek-V3 | 671B (MoE) | GB300 NVL72 | 8,192 GPUs | 2.02 mins | |
| GPT-OSS | 20B (MoE) | GB300 NVL72 | 512 GPUs | 7.43 mins | |
| Llama 3.1 | 405B | GB200 NVL72 | 8,192 GPUs | 7.07 mins | |
| Llama 3.1 | 8B | GB200 NVL72 | 1,024 GPUs | 4.46 mins | |
| Llama 2 70B LoRA | 70B LoRA | GB300 NVL72 | 512 GPUs | 0.4 mins | |
| FLUX.1 | — | GB300 NVL72 | 512 GPUs | 17.1 mins | |
| DLRM-dcnv2 | — | GB300 NVL72 | 64 GPUs | 0.67mins |
Written by The Brieftide · Source: NVIDIA
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI InfrastructureCognitive Debt: AI leverage and systemic fragility model
Shuchen Meng's formal theory explains how substitutive AI builds 'cognitive debt', compounds leverage in calm periods.
Germany approves DE-AISI, an AI security institute based on UK
The National Security Council authorised a German AI Security Institute to test advanced models.
China's 2 trillion yuan AI buildout needs 80% domestic chips
Beijing plans roughly 2 trillion yuan over five years to knit data centers into a national network and require at least 80 percent domestic.
Apple Siri AI at WWDC 2026: built with Google and Nvidia
Apple unveiled Siri AI at WWDC 2026, using Apple Foundation Models refined with Google technology and Nvidia-powered Private Cloud Compute.