Google Antigravity launch: DeepMind's multimodal reasoning model
DeepMind announced Google Antigravity this week, a multimodal model built for spatial and physical reasoning across images and video.
TL;DR
- 01DeepMind announced Google Antigravity this week, a multimodal model built for spatial and physical reasoning across images and video.
- 02DeepMind announced Google Antigravity this week, a multimodal model designed to reason about space, motion and physical interactions across images and video.
- 03The release includes a research writeup and interactive demos that highlight tasks such as object-centric reasoning, short-term prediction and vision-language question answering.
DeepMind announced Google Antigravity this week, a multimodal model designed to reason about space, motion and physical interactions across images and video. The release includes a research writeup and interactive demos that highlight tasks such as object-centric reasoning, short-term prediction and vision-language question answering.
Antigravity connects a vision front end to temporal and physics-oriented reasoning components, aiming to bridge scene perception with causal and spatial inference. DeepMind positions the model as targeted toward problems that require understanding of object behavior and physical relations rather than only captioning or classification.
What Antigravity does
Antigravity combines image and video encoders with a multimodal fusion stage and a dedicated reasoning module. The model accepts single images, short video clips and textual prompts, producing answers to questions about movement, object permanence, forces and likely next frames. Demonstrations released with the paper include tasks such as predicting trajectories behind occlusion, answering spatial questions about object arrangements and interpreting short video segments to infer interactions.
The architecture emphasizes explicit temporal processing and a physics-aware reasoning layer. According to the technical material, the system uses a vision encoder to extract per-frame features, a temporal module to aggregate motion information, and a reasoning transformer that conditions on both visual tokens and symbolic-like scene representations. Output heads target tasks including open-ended language responses, multiple-choice question answering and short-horizon video prediction.
Training mixes supervised datasets, curated simulation data and internet-scale visual-text pairs. The inclusion of simulated physics scenes aims to improve generalization on tasks that require causal inference about object dynamics. DeepMind describes evaluation on a blend of existing benchmarks and newly designed probes focused on physical plausibility and spatial reasoning.
Performance, availability and limits
DeepMind reports improvements on internal and public tests that emphasize spatial and physical understanding, while noting the model is not optimized as a generalist conversational assistant. Benchmarks highlighted in the release focus on robustness to occlusion, multi-object reasoning and short-term trajectory prediction. The blog materials show Antigravity outperforming several vision-language baselines on these targeted probes, though improvements are narrower on broad visual question answering suites.
The team released interactive demos to illustrate capabilities, but model weights were not published at announcement. Access appears to be limited to hosted demos and the research paper for now; follow-up releases may expand availability or provide API access depending on deployment decisions. DeepMind flags common failure modes in the writeup, including overconfidence on ambiguous scenes and sensitivity to distribution shifts between simulation and real video.
Why it matters
Antigravity signals a renewed push by DeepMind to build multimodal systems that combine perception with structured reasoning about physics and space. Systems that better infer object dynamics and causal relations can improve robotics perception, video understanding and downstream safety checks for visual decisions. The model's release narrows a gap between perception-focused vision models and reasoning-oriented language systems, showing practical progress on tasks where both capabilities must work together.
Written by The Brieftide · Source: Google DeepMind (deepmind.google)
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIDeepMind Gemma 4 12B release - encoder-free decoder-only LLM
A 12B-parameter Gemma 4 variant removes the separate visual encoder, processing text and images with a single decoder-only model.
Hugging Face Spaces: Multimedia Building Blocks demo
Hugging Face Spaces project assembles modular components to prototype multimodal agents handling text, images, audio and video.
2026 LLM Research Roundup Jan-May: Alignment, RAG, Multimodal
Curated highlights from Jan–May 2026 covering alignment, retrieval-augmented models, multimodal advances, evaluation, and efficiency.
Qwen3.7-Plus by Alibaba: multimodal autonomous agent
Combines visual perception, GUI control and code generation in one multimodal agent loop for extended task automation.