PEVA whole-body egocentric video prediction with 16s rollouts
An autoregressive conditional diffusion transformer trained on Nymeria predicts next-frame egocentric video from 48‑D whole-body pose and.
TL;DR
- 01An autoregressive conditional diffusion transformer trained on Nymeria predicts next-frame egocentric video from 48‑D whole-body pose and.
- 02Berkeley AI Research built PEVA, a model that predicts egocentric video from human actions by conditioning on past video frames and an action specifying a desired change in 3D pose.
- 03PEVA conditions video prediction on a structured, high-dimensional action representation that captures whole-body kinematics.
Berkeley AI Research built PEVA, a model that predicts egocentric video from human actions by conditioning on past video frames and an action specifying a desired change in 3D pose. PEVA is trained on Nymeria, a dataset pairing real-world egocentric video with body pose capture, and it can generate atomic actions, simulate counterfactuals, and produce coherent long rollouts, including 16-second videos.
How PEVA represents action and models video
PEVA conditions video prediction on a structured, high-dimensional action representation that captures whole-body kinematics. Each action encodes global translation and relative joint rotations based on the body kinematic tree: 3 degrees of freedom for root translation and 15 upper-body joints, using Euler angles for relative joint rotations to yield a 48-dimensional action space (3 + 15 × 3 = 48). Motion-capture data are aligned with video via timestamps, converted from global coordinates to a pelvis-centered local frame for position and orientation invariance, and normalized before training.
Architecturally, PEVA is an autoregressive conditional diffusion transformer. The design adapts the Conditional Diffusion Transformer used in Navigation World Models to whole-body human motion by adding three extensions: random timeskips to learn both short-term and longer-term dynamics, sequence-level training that applies loss over each frame prefix, and action embeddings that concatenate all actions at time t into a 1D tensor to condition each AdaLN layer for high-dimensional motion. At test time PEVA encodes context frames with a VAE encoder, adds noise to the target frame latent, and progressively denoises that latent using the diffusion model. To speed inference, the model restricts attention so that within-image attention is applied only to the target frame and context cross-attention is applied only for the last frame.
PEVA generates multi-frame predictions with an autoregressive rollout strategy. Starting from context frames, the model encodes them, appends the current action, predicts the next frame, adds that predicted frame to the context while dropping the oldest frame, and repeats this process for each action in the sequence before decoding predicted latents to pixel space with a VAE decoder.
Atomic actions, rollouts and planning
To probe causal effects of joint-level motions on the egocentric view, PEVA is evaluated on atomic actions such as body movements (move forward, rotate left, rotate right) and hand movements (move left/right hand up, down, left, right for each hand). The paper includes visual samples of these atomic actions and shows the model can maintain visual and semantic consistency over extended prediction horizons, with examples of coherent 16-second rollouts conditioned on full-body motion.
PEVA also enables visual planning by simulating multiple candidate action sequences and scoring them by perceptual similarity to a goal using LPIPS. The authors frame planning as an energy minimization problem and optimize action sequences with the Cross-Entropy Method, following the approach introduced in Navigation World Models (arXiv:2412.03572). Planning examples optimize sequences for either the left or right arm while holding other body parts fixed; representative cases include raising the right arm toward a mixing stick and reaching toward a kettle, though the method can miss coordinated multi-limb adjustments because it optimizes only a subset of body parts.
Quantitatively, the paper reports that PEVA consistently outperforms baselines on perceptual metrics, maintains coherence over long horizons, and exhibits scaling properties where larger models lead to better performance. The evaluation includes comparisons on atomic action performance, video-quality metrics such as FID over time, and baseline perceptual metrics.
Why it matters
PEVA moves world models closer to embodied agents by conditioning video prediction on physically grounded, high-dimensional human actions from an egocentric perspective. Its combination of a structured 48-dimensional action space, autoregressive diffusion modeling, and action-conditioned rollouts shows an approach to simulate how whole-body motion changes first-person visuals, which is a core capability for visual planning and anticipating outcomes in real-world tasks.
At the same time, the current approach highlights remaining gaps: planning is limited to simulating candidate arm actions without full trajectory optimization or closed-loop control, and the model lacks explicit conditioning on task intent or object-centric semantics. Those limitations constrain immediate application to robust robot control or interactive closed-loop systems.
What to watch
The next milestones to look for are extensions of PEVA to closed-loop control or interactive environments and integration of explicit goal conditioning or object-centric representations; achieving full trajectory optimization or coordinated multi-limb planning would be a concrete sign the approach is ready for embodied control. Another key signal will be demonstrations that move beyond LPIPS-based scoring to task metrics or closed-loop success rates in manipulation or navigation tasks.
Written by The Brieftide · Source: Berkeley AI Research
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIDeepMind Gemma 4 12B release - encoder-free decoder-only LLM
A 12B-parameter Gemma 4 variant removes the separate visual encoder, processing text and images with a single decoder-only model.
Hugging Face Spaces: Multimedia Building Blocks demo
Hugging Face Spaces project assembles modular components to prototype multimodal agents handling text, images, audio and video.
2026 LLM Research Roundup Jan-May: Alignment, RAG, Multimodal
Curated highlights from Jan–May 2026 covering alignment, retrieval-augmented models, multimodal advances, evaluation, and efficiency.
Qwen3.7-Plus by Alibaba: multimodal autonomous agent
Combines visual perception, GUI control and code generation in one multimodal agent loop for extended task automation.