NVIDIA TensorRT 11.0: Multi-GPU inference, context parallelism
TensorRT 11.0 adds native IDistCollectiveLayer multi-device primitives; Cosmos 3 and FLUX.1 benchmarks favor DeepSpeed Ulysses for extreme.
TL;DR
- 01TensorRT 11.0 adds native IDistCollectiveLayer multi-device primitives; Cosmos 3 and FLUX.1 benchmarks favor DeepSpeed Ulysses for extreme.
- 02NVIDIA TensorRT 11.0 introduced native multi-device inference support on Jun 25, 2026, adding distributed communication primitives so a single network can execute across multiple GPUs.
- 03TensorRT 11.0 adds native multi-device inference via IDistCollectiveLayer and integrates NVIDIA NCCL, enabling the full set of NCCL distributed collectives to be used inside TensorRT networks.
NVIDIA TensorRT 11.0 introduced native multi-device inference support on Jun 25, 2026, adding distributed communication primitives so a single network can execute across multiple GPUs. The release integrates NVIDIA NCCL collectives and new IDistCollectiveLayer primitives, and benchmarks on single-node 8-GPU systems compare AllGather KV, Ring Attention, and DeepSpeed Ulysses for generative media pipelines.
What did TensorRT 11.0 add?
TensorRT 11.0 adds native multi-device inference via IDistCollectiveLayer and integrates NVIDIA NCCL, enabling the full set of NCCL distributed collectives to be used inside TensorRT networks. The supported collectives listed in the release are AllReduce, Broadcast, Reduce, AllGather, ReduceScatter, AlltoAll, Gather, and Scatter. The runtime now permits distributed communication layers to be inserted into a network and supports deploying the same network across multiple GPU ranks, while still allowing each rank to, in theory, run a different model.
The release also highlights a production deployment path: models can be authored in PyTorch, converted out-of-framework using Torch-TensorRT, and deployed as optimized TensorRT engines in C++ inference applications. The authors explicitly cite edge-ready multi-device production deployments as a target use case.
How do context parallelism strategies compare for long-sequence media workloads?
For long-sequence attention workloads, TensorRT 11.0 focuses on context parallelism and evaluates three strategies: AllGather KV, Ring Attention, and DeepSpeed Ulysses. AllGather KV has each rank exchange key and value shards via an AllGather before computing attention, costing one additional collective per attention block while shrinking the local Q×Kᵀ multiplication in proportion to the number of ranks. Ring Attention overlaps communication and computation in a ring topology, streaming K and V so full-size K and V tensors need not be materialized on any GPU when using an online softmax. DeepSpeed Ulysses partitions samples across sequence dimension, performs an all-to-all on Q, K, V so each GPU sees the full sequence for a subset of heads, computes attention in parallel, then uses a second all-to-all to regroup outputs.
Benchmarks were run on a single node with 8 GPUs using two generative pipelines: a video pipeline based on NVIDIA Cosmos 3 and an image pipeline based on FLUX.1. For diffusion-based media generation at extreme context lengths described as "in the order of tens of thousands of input tokens," the tests show DeepSpeed Ulysses consistently delivers the lowest latency. The reports also note that Ring Attention provides strong scaling up to 4 GPUs in the FLUX.1 image-generation case.
Why it matters
Long-context attention dominates compute and memory costs in high-resolution image and multi-frame video diffusion pipelines because attention scales quadratically with sequence length. Providing native multi-GPU inference primitives inside TensorRT means developers can keep runtime optimizations such as kernel fusion, memory planning, and quantization, while scaling beyond single-GPU memory and compute limits. The direct integration with NCCL allows TensorRT to inherit topology-aware transport choices across NVLink, NVSwitch, PCIe, and InfiniBand for inference collectives.
Those two factors combined make it easier to convert PyTorch research models into optimized C++ TensorRT deployments that can span multiple GPUs, including deployments targeted at edge hardware.
What to watch
Check performance on your own models for context lengths in the "tens of thousands" of tokens to confirm whether DeepSpeed Ulysses or Ring Attention is preferable. NVIDIA provides a working sample in the TensorRT repository and directs users to download TensorRT 11 from the NVIDIA Developer Portal as the next step.
Written by The Brieftide · Source: NVIDIA
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI InfrastructureIEEE launches virtual training course on large language models
IEEE is offering a virtual training course that teaches engineers to use large language models as reasoning engines in development.
AI4SE and SE4AI: A decade review of AI in systems engineering
H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.
Hyperscalers AI spending to outpace cash flow by Q3 2026
Epoch AI data shows infrastructure spending growing ~70% annually versus operating cash flow at ~23%, with a crossover around Q3 2026.
DeepInsight: Unified evaluation for the Physical AI stack
DeepInsight provides a single runtime and three invariants to run and diagnose benchmarks across LLMs.