NVIDIA Blackwell: DFlash boosts LLM inference up to 15x
DFlash’s block-diffusion speculative decoding raises throughput up to 15x on Blackwell and integrates with TensorRT-LLM, vLLM and SGLang.
TL;DR
- 01DFlash’s block-diffusion speculative decoding raises throughput up to 15x on Blackwell and integrates with TensorRT-LLM, vLLM and SGLang.
- 02NVIDIA is using DFlash block-diffusion speculative decoding on Blackwell GPUs to raise LLM inference throughput by up to 15x, the company published on Jun 23, 2026.
- 03The method drafts full token blocks in parallel and verifies them with the target model, and NVIDIA shows the gains across multiple models and inference frameworks.
NVIDIA is using DFlash block-diffusion speculative decoding on Blackwell GPUs to raise LLM inference throughput by up to 15x, the company published on Jun 23, 2026. The method drafts full token blocks in parallel and verifies them with the target model, and NVIDIA shows the gains across multiple models and inference frameworks.
How much faster is DFlash on Blackwell?
DFlash delivers up to 15x higher throughput for gpt-oss-120b on an eight-GPU NVIDIA DGX B300 Blackwell Ultra system at the same interactivity target, and it nearly doubles interactivity for Llama 3.1 8B versus EAGLE-3 at matched concurrency. NVIDIA’s latency-throughput Pareto curve for gpt-oss-120b shows DFlash exceeding autoregressive decoding by more than 15x in the 500-600 tokens/sec per user interactivity range, and outperforming EAGLE-3 by about 1.5x in that region.
NVIDIA’s published tables compare EAGLE-3 and DFlash across Speed-Bench datasets: for gpt-oss-120b the average speedup moves from 1.7x with EAGLE-3 to 2.3x with DFlash; for Llama 3.1 8B the average moves from 2.2x to 2.8x. On single-GPU runs versus autoregressive decoding, DFlash reaches up to 5.8x on Gemma 4 31B (Math500) using vLLM and up to 5.1x on Qwen3 8-B (Math500) using SGLang.
How does DFlash speculative decoding work?
DFlash replaces an autoregressive drafter with a block-diffusion drafter that predicts a block of masked future tokens in a single forward pass, then lets the target model verify the draft in parallel. That design turns sequential drafting into parallel GPU work while preserving the target model’s output distribution through verification.
The method pairs three techniques: block-diffusion drafting to propose multiple tokens at once, target hidden-state conditioning so the drafter uses context features from the target model, and KV injection to insert those target context features into the drafter’s key-value projections across layers. The target model still performs verification and accepts the longest valid prefix, so correctness remains tied to the target model’s outputs.
NVIDIA highlights that this parallel drafting is well matched to Blackwell Ultra hardware. Each Blackwell Ultra GPU comprises two reticle-sized dies connected by 10 TB/s of chip-to-chip interconnect, yielding a unified compute domain with 160 SMs and 640 fifth-generation Tensor Cores and exposing up to 15 PFLOPS of dense NVFP4 compute. That extra parallel work lets the system use more compute while holding per-user token latency steady.
How can developers adopt DFlash?
Researchers released the DFlash paper in February 2026 and the team has posted 20 DFlash checkpoints on Hugging Face with recipes for Blackwell and Hopper GPUs. NVIDIA notes integration into TensorRT-LLM, and community support in vLLM and SGLang. On vLLM, swapping EAGLE-3 for a DFlash checkpoint requires no code changes outside of config and runs via the open source Speculators library connecting the drafter to target hidden states. On SGLang, migration requires updating the speculative algorithm and providing the matching DFlash checkpoint.
Why it matters
DFlash attacks the decode bottleneck that limits GPU utilization when autoregressive models produce tokens sequentially. By drafting blocks in parallel and keeping verification with the target model, DFlash shifts work from sequential memory movement to parallel compute, which increases throughput at fixed latency targets. For teams running interactive coding, reasoning, or agentic workloads, that can mean serving many more concurrent users without increasing per-user token latency.
What to watch
Look for independent benchmarks of the DFlash checkpoints and the paper’s recipes across real-world production stacks, and for wider adoption in SGLang, vLLM and TensorRT-LLM deployments. Also watch whether future drafters expand block sizes or change KV injection strategies, since those choices will affect acceptance rates and end-to-end latency.
| Item | |||||
|---|---|---|---|---|---|
| Coding | Coding | 1.8x | 2.6x | 2.3x | 3.0x |
| RAG | RAG | 1.7x | 2.3x | 2.4x | 3.1x |
| Reasoning | Reasoning | 1.8x | 2.3x | 2.5x | 2.8x |
| Writing | Writing | 1.5x | 1.8x | 2.3x | 2.7x |
| Multilingual | Multilingual | 1.8x | 2.6x | 1.4x | 2.4x |
| Summarization | Summarization | 1.6x | 2.0x | 2.3x | 2.6x |
| Average | Average | 1.7x | 2.3x | 2.2x | 2.8x |
Written by The Brieftide · Source: NVIDIA
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI InfrastructureIEEE launches virtual training course on large language models
IEEE is offering a virtual training course that teaches engineers to use large language models as reasoning engines in development.
AI4SE and SE4AI: A decade review of AI in systems engineering
H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.
Hyperscalers AI spending to outpace cash flow by Q3 2026
Epoch AI data shows infrastructure spending growing ~70% annually versus operating cash flow at ~23%, with a crossover around Q3 2026.
DeepInsight: Unified evaluation for the Physical AI stack
DeepInsight provides a single runtime and three invariants to run and diagnose benchmarks across LLMs.