Foundation Models4 min read

NVIDIA CuTe DSL fused MoE kernels deliver 1.3x-2x speedups

Custom fused MLP kernels in cuDNN Frontend enable sync-free MoE CUDA graphs and boost kernel-level speed by 1.3x–2.1x.

The Brieftide

TL;DR

  • 01Custom fused MLP kernels in cuDNN Frontend enable sync-free MoE CUDA graphs and boost kernel-level speed by 1.3x–2.1x.
  • 02The announcement appeared on Jun 15, 2026 in a post by Rachit Garg and Matthew Nicely and documents both microbenchmark gains and end-to-end pretraining results.
  • 03The new kernels fuse GroupGemm with downstream operations into three patterns: GroupGemm + Quantize, GroupGemm + Activation + Quantize/Transpose, and GroupGemm + dActivation + Quantize/Transpose.

NVIDIA is introducing a family of fused MLP kernels written with the CuTe DSL that target mixture-of-experts (MoE) training bottlenecks, delivering 1.3x–2x kernel-level speedups and enabling sync-free MoE execution with full-iteration CUDA graphs. The announcement appeared on Jun 15, 2026 in a post by Rachit Garg and Matthew Nicely and documents both microbenchmark gains and end-to-end pretraining results.

What NVIDIA built

The new kernels fuse GroupGemm with downstream operations into three patterns: GroupGemm + Quantize, GroupGemm + Activation + Quantize/Transpose, and GroupGemm + dActivation + Quantize/Transpose. They natively support GLU-style activations including SwiGLU, GeGLU, and sReLU, and they add epilogue capabilities such as feature scaling, tensor clamping, and bias addition. The kernels also handle low-precision formats MXFP8 and NVFP4 in the fused path.

For GLU activations the team repacks weights so that the same thread block can access both the input and gate columns, allowing the GLU combination to be computed in the GEMM epilogue without extra global memory reads and writes. For quantization the fused kernels produce the low-precision outputs and any transposed versions needed for backprop, eliminating separate BF16 read/write passes and avoiding extra per-tensor amax memory passes for NVFP4.

To remove CPU launch and synchronization overhead the GroupGemm kernels track tokens per group in GPU memory rather than relying on host-side shape queries. That design removes the need for host-device synchronization before kernel launches and enables an iteration to be expressed as an end-to-end CUDA graph.

The kernels expose runtime controls to improve multi-kernel overlap: dynamic scheduling to overlap communication and parallelism work, and a configurable cluster margin that limits the number of SMs a kernel uses so other kernels can run concurrently.

Performance and availability

On unit microbenchmarks the fused kernels accelerate the forward pass by up to 1.3x and the backward pass by up to 2.1x compared to unfused execution paths. NVIDIA reports end-to-end training boosts of up to 8% in a DeepSeek-V3 pretraining setup and up to 93% in a GPT-OSS pretraining setup when these kernels are integrated and allowed to enable sync-free CUDA graphs.

The kernels are already available in the cuDNN Frontend (v1.23.0+). They can be invoked directly from cuDNN Frontend, via NVIDIA Transformer Engine (v2.15+) through transformer_engine.pytorch.ops, or through NVIDIA Megatron-Core (26.04-alpha.rc2+). The cuDNN Frontend wrapper compiles a kernel on first invocation and caches the compiled object for reuse; NVIDIA says it is working on Ahead-of-Time compilation support to cache cubins on disk.

Why it matters

MoE blocks expose three systemic bottlenecks: activation functions that become memory bound, CPU-bound overhead for per-expert token bookkeeping, and the memory cost of quantization. By fusing GEMM with activation and quantization and by moving token-shape tracking onto the GPU, these kernels directly reduce memory traffic and eliminate a common host-side synchronization point. The result is higher Tensor Core utilization and fewer CPU stalls during large-scale MoE pretraining, which matters for teams running capacity-hungry models where every percentage of throughput shortens wall-clock time and cloud spend.

What to watch

NVIDIA is working on additional fusion patterns, support for more frameworks including JAX, heuristics to pick kernels to compile, activation recompute, and Ahead-of-Time compilation to reduce compile cost. Track the arrival of AOT support in cuDNN Frontend and new framework bindings as the next concrete milestones that will determine how broadly and easily these kernels are adopted in production MoE training pipelines.

Baseline vs fused CuTe DSL kernels (reported)
Item
Forward pass speedbaselineup to 1.3x
Backward pass speedbaselineup to 2.1x
DeepSeek-V3 end-to-end throughputbaselineup to 8% improvement
GPT-OSS end-to-end throughputbaselineup to 93% improvement
Advertisement

Written by The Brieftide · Source: NVIDIA

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement