DeepMind Veo 3.1 release: Ingredients-to-Video with vertical
Veo 3.1 improves consistency and creative variation for short clips, adds vertical video output and finer control over motion and framing.
TL;DR
- 01Veo 3.1 improves consistency and creative variation for short clips, adds vertical video output and finer control over motion and framing.
- 02DeepMind has released Veo 3.1, the latest update to its Ingredients-to-Video generative system, adding features to improve consistency, increase creative variation and support vertical video output.
- 03The update targets short-form clip generation and gives users finer control over motion, pacing and camera framing.
DeepMind has released Veo 3.1, the latest update to its Ingredients-to-Video generative system, adding features to improve consistency, increase creative variation and support vertical video output. The update targets short-form clip generation and gives users finer control over motion, pacing and camera framing.
Veo 3.1 arrives as an iterative upgrade rather than a ground-up redesign. The release focuses on three areas cited by the developer: improving temporal coherence so objects and characters remain stable across frames, expanding the models creative palette to produce less repetitive motion, and exposing explicit control parameters for vertical and portrait aspect ratios used across social platforms.
What’s new in Veo 3.1
The headline additions include a suite of control tokens and conditioning signals that let users set clip length, camera motion style, and framing constraints. DeepMind says the controls are calibrated to work across both landscape and vertical canvases, with presets for common mobile formats.
On quality, Veo 3.1 applies expanded temporal attention windows and refined frame interpolation to reduce flicker and identity drift. The update also introduces stochastic variation layers intended to diversify motion trajectories while preserving object identity. DeepMind highlights sample outputs that show more natural head and limb motion, steadier object placement, and fewer texture artifacts.
For vertical video, the model includes aspect-aware conditioning so generated camera movement and composition respect portrait-focused subject placement. The system provides automatic center-of-interest heuristics for single-subject shots and adjustable margins for multi-subject framing.
Veo 3.1 continues to rely on the Ingredients-to-Video approach, which combines high-level scene "ingredients" such as character descriptions, action prompts and reference images with generative modules that synthesize motion and render frames. The update refines how those inputs are fused, and adds user-facing sliders for pacing, jitter tolerance and motion creativity.
How it works
Under the hood, the pipeline retains a staged architecture: a scene planner converts textual and visual ingredients into an abstract motion plan, a motion synthesizer samples plausible trajectories consistent with that plan, and a renderer produces the final frames. The new controls operate at planning and motion synthesis stages, changing the probability distributions used to sample trajectories and camera paths.
Improvements in temporal coherence come from two technical tweaks. First, the system increases the effective attention span across frames so the model can reference a longer history when predicting the next frame. Second, it introduces a consistency loss during training that penalizes identity changes across time. For creative variation, the model applies conditional noise schedules and diversity-promoting objectives to encourage varied but plausible motion.
Deployment options remain similar to prior Veo releases: the model can be run on dedicated inference infrastructure and accepts both text-only and mixed text-plus-image prompts. DeepMind emphasizes that runtime controls let producers trade off determinism for variety depending on production needs.
Why it matters
Veo 3.1 tightens the gap between single-frame image models and short-form video tools by addressing common failure modes like flicker and identity drift while adding controls tailored for mobile vertical formats. Content creators and app developers who need reliable short clips with predictable framing will find the new controls useful, while researchers can study the trade-offs between temporal consistency and motion diversity introduced by the update.
Input ingredients
Text prompts, reference images, aspect ratio and control tokens (pacing, motion style, framing)
Scene planner
Converts ingredients into an abstract motion and camera plan with center-of-interest heuristics
Motion synthesizer
Samples trajectories with diversity controls and temporal attention across frames
Renderer
Produces per-frame pixels, applies interpolation and consistency loss adjustments
Output formatting
Finalizes aspect ratio, applies vertical presets and exports short-form clip
Primary source
Google DeepMind
deepmind.googleThe Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIDeepMind Gemma 4 12B release - encoder-free decoder-only LLM
A 12B-parameter Gemma 4 variant removes the separate visual encoder, processing text and images with a single decoder-only model.
Hugging Face Spaces: Multimedia Building Blocks demo
Hugging Face Spaces project assembles modular components to prototype multimodal agents handling text, images, audio and video.
Hugging Face: Five labs compose multi-agent small LLM finance demo
Five independent labs combined compact LLM agents into a finance simulation showcased on Hugging Face.
2026 LLM Research Roundup Jan-May: Alignment, RAG, Multimodal
Curated highlights from Jan–May 2026 covering alignment, retrieval-augmented models, multimodal advances, evaluation, and efficiency.