Multimodal AIJune 17, 20266 min read

MolmoMotion language-guided 3D motion forecasting released

MolmoMotion predicts object-centered 3D point trajectories from video and text; the release includes MolmoMotion-1M (1.16M videos) and.

The BrieftideJune 17, 2026

TL;DR

01MolmoMotion predicts object-centered 3D point trajectories from video and text; the release includes MolmoMotion-1M (1.16M videos) and.
02Given an RGB observation, a set of 3D query points, and a written action description, MolmoMotion predicts the future 3D trajectories of those points over the next few seconds.
03MolmoMotion represents motion as object-attached 3D points in a shared world frame and uses Molmo 2 as its backbone to connect language instructions to image content.

MolmoMotion, a language-guided 3D motion forecasting model, was released on June 17, 2026, together with MolmoMotion-1M, a dataset drawn from 1.16M videos, and PointMotionBench, a human-validated benchmark containing 2.7K clips. Given an RGB observation, a set of 3D query points, and a written action description, MolmoMotion predicts the future 3D trajectories of those points over the next few seconds.

What is MolmoMotion and how does it represent motion?

MolmoMotion represents motion as object-attached 3D points in a shared world frame and uses Molmo 2 as its backbone to connect language instructions to image content. The model receives a short video history, an action description, and initial 3D query coordinates, identifies the referred object and query points, then predicts the future 3D trajectory of each point so the trajectories remain stable across camera motion and are directly usable by downstream systems.

MolmoMotion trains two variants. The autoregressive variant, MolmoMotion-AR, writes out future coordinates step by step as quantized text, encouraging smooth rollouts when futures are well-defined. The flow-matching variant, MolmoMotion-FM, predicts trajectories in continuous 3D space by transforming noise into motion, and is designed to better represent uncertainty when multiple plausible futures exist.

How does MolmoMotion perform and what data was released?

MolmoMotion outperforms the existing 3D motion forecasting methods tested on the authors' human-validated benchmark and ships with a large training corpus. The team assembled MolmoMotion-1M, described as the largest corpus of action-described, object-grounded 3D point trajectories to date, drawn from 1.16M videos, spanning 736 motion types and 5.6K distinct objects. PointMotionBench contains 2.7K clips, 111 object categories, and 61 motion types and measures how closely predicted 3D point trajectories match actual future motion.

On PointMotionBench, MolmoMotion beat pixel-space video generators, parametric 3D methods, and a constant-velocity baseline across a range of objects, scenes, and actions. In robotics simulation, a control policy built on MolmoMotion succeeded on 76.3% of pick-and-place tasks versus 56.0% for the same policy built on Molmo 2. The MolmoMotion-based policy also learned faster, reaching 51% success after 10K training steps where the Molmo 2 version reached 19% at the same point. After fine-tuning on a large robot manipulation dataset (DROID), MolmoMotion achieved the same test L2 error on real robots in about 2K training steps that the Molmo 2 baseline required 12K steps to reach.

MolmoMotion's predicted paths can also steer image-to-video models. Used to guide a generator, MolmoMotion improved motion quality on all five motion-related metrics the authors measured and beat a larger image-to-video model on four of the five metrics.

Why it matters

MolmoMotion shifts forecasting from pixel-space or category-specific templates to compact, object-centric 3D trajectories that are view-stable and directly consumable by robot planners and video generators. The dataset scale (1.16M videos) and the human-validated PointMotionBench (2.7K clips) give the field a common training resource and an evaluation that tests whether predicted motion matches true future motion rather than plausibility alone. The robotics results show those trajectories can materially speed learning and raise task success in control policies.

Limitations

Training used eight query points per object, a choice the authors note limits dense surface representation and handling of complex deformable motion. The automatic pipeline that produced MolmoMotion-1M filters, smooths, and clips tracks to mitigate depth and tracking noise, but the paper acknowledges raw tracks from unconstrained video remain challenging.

What to watch

Look for community reuse of MolmoMotion-1M and PointMotionBench to test alternative forecasters and for follow-up work increasing point density to handle deformable objects. A concrete milestone will be published comparisons on PointMotionBench by independent teams and whether MolmoMotion-guided policies generalize across more robot platforms and manipulation tasks.

Selected MolmoMotion vs Molmo 2 baseline results

Item
Pick-and-place success rate (simulation)	76.3%	56.0%
Success at 10K training steps (simulation)	51%	19%
Steps to match baseline test L2 error (real robots)	≈2K steps	12K steps
Training videos / scale	MolmoMotion-1M: 1.16M videos	—
Benchmark clips	PointMotionBench: 2.7K clips	—

Written by The Brieftide · Source: Hugging Face

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study

Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.

The BrieftideDAILY BRIEF

ThinkDeception: Progressive RL framework for multimodal deception

ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.

The BrieftideDAILY BRIEF

Visual-Seeker: visual-native multimodal search surpasses rivals

Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.

The BrieftideDAILY BRIEF

Gemma 4 12B: unified, encoder-free multimodal model for laptops

Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.