Benchmarks & Evals6 min read

NVIDIA Nemotron 3 Ultra NVFP4 checkpoint: Model Optimizer, 5.9x

Model Optimizer quantized Nemotron 3 Ultra to NVFP4, shrinking BF16 weights from 1.

The Brieftide

TL;DR

  • 01Model Optimizer quantized Nemotron 3 Ultra to NVFP4, shrinking BF16 weights from 1.
  • 02NVIDIA quantized Nemotron 3 Ultra (550B) to NVFP4 using NVIDIA Model Optimizer on Jun 26, 2026, producing a single NVFP4 checkpoint that runs on both Hopper and Blackwell hardware.
  • 03NVIDIA used a mix of per-layer precisions and several quantization calibration strategies to build a high-quality NVFP4 checkpoint, rather than naively converting every layer to NVFP4.

NVIDIA quantized Nemotron 3 Ultra (550B) to NVFP4 using NVIDIA Model Optimizer on Jun 26, 2026, producing a single NVFP4 checkpoint that runs on both Hopper and Blackwell hardware. The BF16 model shrank from 1,121 GB to 352.3 GB (a 3.2x reduction) and the NVFP4 checkpoint achieves up to 5.9x higher inference throughput than GLM-5.1 754B FP4 on decode-heavy workloads while matching BF16 accuracy across nearly every benchmark.

How did NVIDIA produce the NVFP4 checkpoint?

NVIDIA used a mix of per-layer precisions and several quantization calibration strategies to build a high-quality NVFP4 checkpoint, rather than naively converting every layer to NVFP4. Embedding, output classification and MTP layers remained BF16; MoE routed experts were quantized to NVFP4; MoE shared experts and Mamba mixer linears used FP8 per-tensor; KV cache used FP8; Mamba SSM cache moved from FP32 to FP16 with stochastic rounding. This per-layer assignment reduced the model size to 352.3 GB while preserving accuracy.

The team emphasized that a single checkpoint can serve both Hopper and Blackwell by converting the weight format at runtime: on Hopper the serving framework switches to W4A16, and on Blackwell it uses native W4A4. W4A16 was necessary to fit Multi-Token Prediction (MTP) on Hopper because W8A8’s larger memory footprint left too little headroom to host MTP; W4A16 matches or beats W8A8 across the board according to the post.

What scaling and calibration choices mattered?

FP4 has only eight positive representable values, so choosing per-block scales matters massively. NVIDIA evaluated max (absmax) scaling, mean squared error (MSE) scaling, GPTQ, and a custom four-over-six method. Max scaling sets the block scale by the absolute maximum which preserves the largest weight but can compress other values. MSE scaling minimizes reconstruction error but did not reliably improve downstream benchmarks despite reducing per-tensor weight error by 27.1% in Nemotron 3 Ultra experiments.

Four-over-six lets each block choose between an M=4 or M=6 grid to reduce rounding error near the 4-to-6 gap. Applying four-over-six to the routed-expert weights raised the global per-tensor weight scale by 1.75x, and across all 49,152 projection weights in the model’s 48 MoE expert layers it cut the median reconstruction MSE by 16.4% compared to standard max calibration. In the balanced 5.03-BPE operating point four-over-six delivered 98.5% median recovery relative to BF16, ahead of max calibration (96.8%) and MSE (98.4%).

The team also swept effective bits-per-element (BPE) across five operating points from 4.85 to 7.19 and used AutoQuantize (mtq.auto_quantize) to meet bit budgets. NVFP4’s minimum practical effective BPE is 4.5 once per-block and per-tensor scales are amortized. The AA-LCR benchmark showed a clear operating-signal: raising BPE from 4.85 to 5.03 improved AA-LCR by 2.4 points in the sweep described.

Why it matters

This work shows that FP4 floating formats can deliver real, usable gains when paired with careful per-layer decisions and calibrated scaling. The single-checkpoint approach that adapts to Hopper and Blackwell removes a deployment friction point for teams targeting mixed NVIDIA fleets. For large models where memory and KV cache matter, cutting a 1,121 GB BF16 model to 352.3 GB while retaining near-BF16 accuracy materially lowers infrastructure cost and increases throughput.

What to watch

NVIDIA plans to release NVFP4_FOUR_OVER_SIX_CFG in the upcoming 0.46 NVIDIA Model Optimizer in July, and the post points to the Nemotron 3 Ultra PTQ example as a hands-on recipe. Watch for external reproductions of the 5.9x throughput claim and for other model teams publishing their BPE sweeps and four-over-six results against BF16 baselines.

Key NVFP4 checkpoint metrics vs BF16 baseline
Item
Model size (GB)1,121352.3
Size reduction3.2x
Throughput vs GLM-5.1 754B FP4 (decode-heavy)up to 5.9x higher
Minimum effective bits-per-element (BPE)16 (BF16)4.5 (NVFP4 minimum with overhead)
Median recovery vs BF16 (balanced 5.03-BPE)100% (reference)98.5% (four-over-six)
Median reconstruction MSE reduction (MoE projection weights)16.4% reduction (49,152 weights, 48 MoE layers)
Advertisement

Written by The Brieftide · Source: NVIDIA

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement