NVIDIA XR AI public beta: build intelligent XR agents
Open-source XR AI library in public beta connects AR glasses and headsets to GPU-accelerated Cosmos, Nemotron and MCP services.
TL;DR
- 01Open-source XR AI library in public beta connects AR glasses and headsets to GPU-accelerated Cosmos, Nemotron and MCP services.
- 02The repository includes sample agents, model-server launchers, MCP servers, web clients and core media infrastructure so developers can prototype intelligent XR agents.
- 03NVIDIA XR AI is a modular foundation for building intelligent XR agents that combine live camera and microphone streams, multimodal models, enterprise connectors, and optional spatial rendering.
NVIDIA released XR AI in public beta on Jun 16, 2026, an open-source library that connects AR glasses, AI glasses, and XR headsets to GPU-accelerated AI services running in cloud, data center, workstation, or edge environments. The repository includes sample agents, model-server launchers, MCP servers, web clients and core media infrastructure so developers can prototype intelligent XR agents.
What is NVIDIA XR AI and what does it include?
NVIDIA XR AI is a modular foundation for building intelligent XR agents that combine live camera and microphone streams, multimodal models, enterprise connectors, and optional spatial rendering. The stack centers on an XR Media Hub for routing media, NVIDIA Cosmos VLMs for visual grounding, NVIDIA Nemotron models for language and tool calling, Model Context Protocol (MCP) servers for enterprise connectivity, and optional CloudXR for rendered spatial content.
The public beta repository documents how video pixels can remain in shared memory while metadata flows through the system, enabling agents to retrieve image data only when required and letting developers swap clients, models, MCP servers, orchestration frameworks, and deployment environments without rebuilding agents.
How do developers build a working XR agent?
Developers can clone the public beta repository and run sample agents to reach a working multimodal agent in a few steps. The repo instructions begin with git clone https://github.com/NVIDIA/xr-ai.git, then start shared AI services using the example command sequence shown in the repository (cd agent-samples/model-servers; uv sync; uv run model_servers), and run a sensor-first example with uv run simple_vlm_example.
The model server stack in the repository includes nvidia/parakeet-tdt-0.6b-v3 for speech-to-text, nvidia/Cosmos-Reason1-7B for vision-language reasoning, nvidia/Llama-3.1-Nemotron-Nano-8B-v1 for fast language responses, and NVIDIA-Nemotron-3-Nano-30B-A3B for deeper tool-calling workflows. The simple_vlm_example prints a web client URL and authentication token; once connected, the client streams camera and microphone data to the XR Media Hub, speech is converted to text, the latest frame is analyzed by a Cosmos-powered VLM path, and the agent returns both text and synthesized audio. "This is now a working intelligent XR agent." The repository also includes MCP servers such as vlm-mcp, video-mcp, render-mcp, oxr-mcp, vec-mcp, and transcript-mcp for XR-specific enterprise workflows.
Why it matters
XR AI addresses an integration gap: devices are available but end-to-end AI experiences require live media routing, multimodal models, enterprise data access, and orchestration. By separating media transport, model services, tool access, orchestration, and client delivery, XR AI reduces unnecessary inference and data movement while enabling multi-user and multi-agent scenarios where participant identity routes responses back to the correct client. That mix makes it practical to prototype hands-busy workflows for field service, remote assistance, industrial operations, healthcare, and training.
The repository already shows applied research interest: the Cong Lab at the Stanford School of Medicine and the Wang Lab at Princeton have explored XR and AI workflows for stem cell therapy research, and Siemens is exploring XR AI together with NVIDIA DGX Spark in a research context for factory engineering tasks. The inclusion of tools such as NVIDIA Video Search and Summarization (VSS) points toward searchable visual knowledge capture and retrieval over time.
What to watch
Watch whether research pilots from academic labs and Siemens move toward production deployments and whether the public repo attracts integrations for domain-specific MCP servers and RAG pipelines. Also note adoption signals around the NeMo Agent Toolkit examples for MCP integration and multi-agent orchestration, which the repository references as an orchestration option.
Written by The Brieftide · Source: NVIDIA
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Augmented Reality HardwareMIT Gleanmer chip: 6 mW SoC for tiny-robot 3D mapping
Gleanmer is a custom system-on-a-chip that builds Gaussian-based 3D maps in real time while consuming about 6 milliwatts for tiny robots.
Snap Specs launch: $2,200 price and stock tumbles 5%
Snap unveiled Specs at nearly $2,200 and its shares fell from $5.86 to as low as $4.83 after the announcement.
Snap Specs $2,195 glasses: bold design, limited appeal
Snap unveiled $2,195 Specs with chunky frames, 132–136g weights and a Meisel-shot fashion campaign aimed at early adopters rather than.
Apple 2027 rumors: AirPods with cameras and a second foldable
Leaked details point to camera-equipped AirPods on iOS 28 in late 2027 and a second folding iPhone arriving this fall.