Coding AgentsJune 17, 20264 min read

ENPIRE: Nvidia's AI coding agents train robots to 99% success

A fleet of eight dual-arm robot stations using ENPIRE hit up to 99 percent success on tasks like Push-T and pin insertion.

The BrieftideJune 17, 2026

TL;DR

01A fleet of eight dual-arm robot stations using ENPIRE hit up to 99 percent success on tasks like Push-T and pin insertion.
02Researchers from Nvidia, Carnegie Mellon University, and UC Berkeley built ENPIRE, a system that uses AI coding agents to train real robots with minimal human oversight.
03The team ran a fleet of eight dual-arm YAM robot stations and achieved, the study says, "up to 99 percent success" on demanding tasks including the Push-T test and pin insertion.

Researchers from Nvidia, Carnegie Mellon University, and UC Berkeley built ENPIRE, a system that uses AI coding agents to train real robots with minimal human oversight. The team ran a fleet of eight dual-arm YAM robot stations and achieved, the study says, "up to 99 percent success" on demanding tasks including the Push-T test and pin insertion.

How does ENPIRE work?

ENPIRE runs as a two-phase feedback loop on real hardware: first the agent builds the evaluation tools and a safe workspace with a small amount of human input; then the agent runs autonomously, writing and editing its own training code. In phase one the agent needs only a few minutes of example video of successful and failed attempts to produce its own reward function and automatic reset and success checks. In phase two the agent reads papers, forms hypotheses, and selects methods such as behavior cloning or reinforcement learning based on real-world success signals.

The reward functions and evaluation tools are task specific but reusable. For pin insertion the agent combined visual alignment, gripper height, and estimated force. For closing a cable tie it combined two camera angles to avoid false positives and pushed reaction time below 150 milliseconds.

How well do the agents perform and scale?

A fleet of eight agents reduced time-to-solution and improved absolute success: on the Push-T test, going from one to eight agents cut time to full success from about five hours to two; for pin insertion the time dropped from over 90 minutes to roughly 40. The study reports success rates reaching up to 99 percent on tasks such as Push-T, sorting pins into a box, and cutting a cable tie, and notes that the pin insertion strategy converged to 100 percent faster than a comparable human-in-the-loop method.

The system coordinates experiments across stations through Git. Each station's agent tests different hypotheses in parallel, shares results via version control, and adopts successful recipes discovered elsewhere in the fleet. The researchers tested three current coding agents: Codex with GPT-5.5, Claude Code with Opus 4.7, and Kimi Code with Kimi K2.6; Codex performed best in most cases.

ENPIRE also exposes the gap between simulation and the real world. All three agents solved the Push-T task in simulation, but two out of three failed in the real environment. The authors attribute real-world failures to unpredictable dynamics, friction, and object movement. In the RoboCasa simulation ENPIRE beat an end-to-end vision-language-action model called GR00T and a tool-based approach without autoresearch called CaP-X.

To measure efficiency the researchers propose two metrics: Mean Robot Utilization (MRU), which tracks how much research time robots actually spend working, and Mean Token Utilization (MTU), which counts language model usage per minute. They also show skill transfer: experience from pin insertion helped agents slot GPUs into a motherboard using the robot arms.

Why it matters

ENPIRE reduces the constant human involvement that has long slowed robot learning by automating evaluation, experiment design, and code edits. The fleet model shows clear time savings as multiple agents explore hypotheses in parallel, and the ability for agents to write reusable reward checks cuts engineering overhead. At the same time the study documents practical limits: agents spend large amounts of time reading logs and summarizing peers, per-robot utilization falls as fleets grow, and token costs scale faster than performance gains.

What to watch

Track MRU and MTU as the next concrete signals: rising MRU or falling MTU per successful skill would show tighter efficiency, while improvements closing the gap between simulation and real-world success would confirm robustness. Also watch whether agent-discovered checks and training recipes generalize beyond the tested tasks and hardware.

ENPIRE task outcomes, timing and agent notes

Item
Push-T	up to 99 percent	about five hours	two hours	Solved in simulation by all three; in real world two of three agents failed
Pin insertion	100 percent (strategy converged faster than human-in-the-loop)	over 90 minutes	roughly 40 minutes	Agent built check using visual alignment, gripper height, estimated force
Close cable tie	up to 99 percent	—	—	Combined two camera angles to avoid false positives; reaction time below 150 milliseconds
Agent comparison	varied	—	—	Codex (GPT-5.5) performed best; Claude Code (Opus 4.7), Kimi Code (Kimi K2.6) tested

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Data2Story: CSV-to-article pipeline with seven AI agents

A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.

The BrieftideDAILY BRIEF

Vibe Coding: AI evaluation for greenfield software engineering

Callum Barbour's arXiv paper tests 'vibe coding' on isolated Python greenfield tasks using a custom evaluation suite.

The BrieftideDAILY BRIEF

CODA-BENCH benchmark: testing code agents on data tasks

CODA-BENCH places agents in a Kaggle-based Linux sandbox with 1,009 tasks across 31 communities and an average of 980 files per task.

The BrieftideDAILY BRIEF

SWE-Explore: benchmark shows AI coding agents miss key lines

SWE-Explore isolates code search from repair and finds agents hit the right files but cover only 14–19% of the lines that matter.