NVIDIA Confidential Computing: 98% performance, Blackwell GPUs
NVIDIA’s Confidential Computing secures models and data on Blackwell (HGX B300) while adding typically under 8% throughput or per‑token.
TL;DR
- 01NVIDIA’s Confidential Computing secures models and data on Blackwell (HGX B300) while adding typically under 8% throughput or per‑token.
- 02CC adds only small steady-state overheads to inference, typically under 8% for throughput and per-token latency across the tested workloads.
- 03NVIDIA’s measurements on an HGX B300 running Qwen 3.5 397B-A17B-FP8 show Δ% vs OFF values ranging from -2.0% to -7.5% for one ISL/OSL setting and -3.5% to -0.9% for another, depending on concurrency.
NVIDIA Confidential Computing (CC) embeds hardware-rooted security across Blackwell GPUs while keeping inference performance near native levels: NVIDIA measured CC at up to 98% of the performance of non-CC solutions using Qwen 3.5 on an HGX B300, with typical steady-state overheads under 8%.
What performance impact does Confidential Computing introduce?
CC adds only small steady-state overheads to inference, typically under 8% for throughput and per-token latency across the tested workloads. NVIDIA’s measurements on an HGX B300 running Qwen 3.5 397B-A17B-FP8 show Δ% vs OFF values ranging from -2.0% to -7.5% for one ISL/OSL setting and -3.5% to -0.9% for another, depending on concurrency.
The evaluation table in the post reports per-concurrency deltas for Throughput/GPU and Median TPOT (time per output token) across concurrency levels 4, 8, 16, 32, 64, 128 and 256. The largest throughput delta reported is -7.5% at concurrency 128 (ISL/OSL = 1024/1024), while many common concurrency points sit in the -2% to -6% range.
How does CC protect models and data during inference?
CC places a hardware root of trust on Blackwell GPUs, fusing a private signing key at manufacture and preserving it from exposure to software, firmware, or the host, and it uses remote attestation before secrets are provisioned. The NVIDIA Remote Attestation Service (NRAS) validates a signed evidence bundle—the GPU hardware report combined with CPU TEE measurements—against a known-good reference integrity manifest.
After attestation, confidential workloads run in a verified Confidential VM and can receive secrets such as model decryption keys; "the attestation handshake is typically a one-time startup event," and attestation does not add latency to individual inference requests once running. NVLink encryption and system-level protections extend confidential computing across multiple GPUs (up to 8 on HGX B200 and HGX B300).
What changes were needed to preserve performance in CC mode?
NVIDIA and partner projects adjusted inference software to avoid CC-induced bottlenecks in secure launch latency and encrypted host-to-device bandwidth. Key optimizations include a CC-safe autotuner in FlashInfer that uses the GPU global timer register, an async device-to-host copy worker in SGLang to restore compute/copy overlap, and piecewise CUDA graph replay for prefill and mixed batches to reduce kernel launch overhead.
The benchmark setup used SGLang (Server) with the docker image docker.io/lmsysorg/sglang:v0.5.12-cu130, CUDA 13.2, NVIDIA driver 595.71.05, Intel TDX platform, and a VM with GPU passthrough. Tested input/output token lengths included 8192/1024 and 1024/1024, and batch sizes from 4 up to 256 concurrent requests.
Why it matters
Hardware-rooted confidentiality removes a major obstacle for enterprises that must protect training data, proprietary models, or regulated workloads during active inference. The measured overheads, such as the reported up-to-7.5% throughput hit in some configurations and frequent sub-5% deltas elsewhere, make a case that organizations can enable stronger protection while keeping production performance viable.
Those protection and performance characteristics matter for deployments requiring attestable integrity and encrypted interconnects across multiple GPUs, since CC combines a silicon-level signing key, NRAS-based attestation, and NVLink encryption to protect both code and data in use.
What to watch
Watch upstream framework integrations and community adoption: NVIDIA notes work with inference framework projects and lists SGLang and FlashInfer changes (PRs referenced) as fixes that reduce CC overheads. A useful milestone will be equivalent CC-optimized merges into mainstream inference frameworks and wider, independently reproduced benchmarks on additional models and hardware configurations.
| Item | |||||
|---|---|---|---|---|---|
| 4 | -2.0% | -1.6% | -3.5% | -3.6% | |
| 8 | -2.6% | -2.4% | -2.8% | -2.9% | |
| 16 | -5.3% | -4.9% | -2.8% | -3.0% | |
| 32 | -6.3% | -7.8% | -1.0% | -0.9% | |
| 64 | -6.2% | -6.8% | -2.3% | -2.4% | |
| 128 | -7.5% | -8.1% | -3.5% | -3.5% | |
| 256 | -4.6% | -4.1% | -3.6% | -3.7% |
Written by The Brieftide · Source: NVIDIA
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Enterprise AI AdoptionMulti-Agent Orchestration for Enterprise AI: arXiv Paper
An arXiv paper (18 Jun 2026) evaluates DAG Plan and Execute versus ReAct across 208 enterprise scenarios and adds a Task Manager that cuts.
ChatGPT Enterprise: new spend controls and usage analytics
OpenAI added spend controls and usage analytics to ChatGPT Enterprise to help organizations manage costs and scale AI.
NEA's Tiffany Luck: AI IPOs, personal agents and ROI reckoning
NEA partner Tiffany Luck on AI IPOs, personal agents, and the tokenmaxxing-to-ROI shift in enterprise AI spend.
OpenAI Partner Network launch: $150M fund to scale enterprise AI
OpenAI commits $150M to a Partner Network to help global partners accelerate enterprise AI adoption, deployment and transformation.