AI Infrastructure4 min read

NVIDIA TensorRT 11.0: Multi-GPU inference, context parallelism

TensorRT 11.0 adds native IDistCollectiveLayer multi-device primitives; Cosmos 3 and FLUX.1 benchmarks favor DeepSpeed Ulysses for extreme.

The Brieftide

TL;DR

  • 01TensorRT 11.0 adds native IDistCollectiveLayer multi-device primitives; Cosmos 3 and FLUX.1 benchmarks favor DeepSpeed Ulysses for extreme.
  • 02NVIDIA TensorRT 11.0 introduced native multi-device inference support on Jun 25, 2026, adding distributed communication primitives so a single network can execute across multiple GPUs.
  • 03TensorRT 11.0 adds native multi-device inference via IDistCollectiveLayer and integrates NVIDIA NCCL, enabling the full set of NCCL distributed collectives to be used inside TensorRT networks.

NVIDIA TensorRT 11.0 introduced native multi-device inference support on Jun 25, 2026, adding distributed communication primitives so a single network can execute across multiple GPUs. The release integrates NVIDIA NCCL collectives and new IDistCollectiveLayer primitives, and benchmarks on single-node 8-GPU systems compare AllGather KV, Ring Attention, and DeepSpeed Ulysses for generative media pipelines.

What did TensorRT 11.0 add?

TensorRT 11.0 adds native multi-device inference via IDistCollectiveLayer and integrates NVIDIA NCCL, enabling the full set of NCCL distributed collectives to be used inside TensorRT networks. The supported collectives listed in the release are AllReduce, Broadcast, Reduce, AllGather, ReduceScatter, AlltoAll, Gather, and Scatter. The runtime now permits distributed communication layers to be inserted into a network and supports deploying the same network across multiple GPU ranks, while still allowing each rank to, in theory, run a different model.

The release also highlights a production deployment path: models can be authored in PyTorch, converted out-of-framework using Torch-TensorRT, and deployed as optimized TensorRT engines in C++ inference applications. The authors explicitly cite edge-ready multi-device production deployments as a target use case.

How do context parallelism strategies compare for long-sequence media workloads?

For long-sequence attention workloads, TensorRT 11.0 focuses on context parallelism and evaluates three strategies: AllGather KV, Ring Attention, and DeepSpeed Ulysses. AllGather KV has each rank exchange key and value shards via an AllGather before computing attention, costing one additional collective per attention block while shrinking the local Q×Kᵀ multiplication in proportion to the number of ranks. Ring Attention overlaps communication and computation in a ring topology, streaming K and V so full-size K and V tensors need not be materialized on any GPU when using an online softmax. DeepSpeed Ulysses partitions samples across sequence dimension, performs an all-to-all on Q, K, V so each GPU sees the full sequence for a subset of heads, computes attention in parallel, then uses a second all-to-all to regroup outputs.

Benchmarks were run on a single node with 8 GPUs using two generative pipelines: a video pipeline based on NVIDIA Cosmos 3 and an image pipeline based on FLUX.1. For diffusion-based media generation at extreme context lengths described as "in the order of tens of thousands of input tokens," the tests show DeepSpeed Ulysses consistently delivers the lowest latency. The reports also note that Ring Attention provides strong scaling up to 4 GPUs in the FLUX.1 image-generation case.

Why it matters

Long-context attention dominates compute and memory costs in high-resolution image and multi-frame video diffusion pipelines because attention scales quadratically with sequence length. Providing native multi-GPU inference primitives inside TensorRT means developers can keep runtime optimizations such as kernel fusion, memory planning, and quantization, while scaling beyond single-GPU memory and compute limits. The direct integration with NCCL allows TensorRT to inherit topology-aware transport choices across NVLink, NVSwitch, PCIe, and InfiniBand for inference collectives.

Those two factors combined make it easier to convert PyTorch research models into optimized C++ TensorRT deployments that can span multiple GPUs, including deployments targeted at edge hardware.

What to watch

Check performance on your own models for context lengths in the "tens of thousands" of tokens to confirm whether DeepSpeed Ulysses or Ring Attention is preferable. NVIDIA provides a working sample in the TensorRT repository and directs users to download TensorRT 11 from the NVIDIA Developer Portal as the next step.

Advertisement

Written by The Brieftide · Source: NVIDIA

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement