Multimodal AIJuly 2, 20264 min read

VideoFlexTok: Flexible video tokens cut cost, enable 10s clips

VideoFlexTok uses coarse-to-fine variable-length tokens to match 3D-grid gFVD and ViCLIP scores with a 1.1B versus 5.2B model.

The BrieftideJuly 2, 2026

TL;DR

01VideoFlexTok uses coarse-to-fine variable-length tokens to match 3D-grid gFVD and ViCLIP scores with a 1.1B versus 5.2B model.
02The Apple Machine Learning writeup reports comparable gFVD and ViCLIP Score performance while using a 1.1B-parameter model versus a 5.2B 3D-grid baseline.
03The paper frames this as an alternative to the de facto spatiotemporal 3D-grid tokenization, which forces downstream models to predict low-level detail uniformly across space and time.

VideoFlexTok, a paper published July 2026 for ICML, represents videos as variable-length, coarse-to-fine token sequences and uses a generative flow decoder to reconstruct realistic videos from any token count. The Apple Machine Learning writeup reports comparable gFVD and ViCLIP Score performance while using a 1.1B-parameter model versus a 5.2B 3D-grid baseline.

How does VideoFlexTok work?

VideoFlexTok encodes a video as a flexible-length sequence of tokens arranged coarse-to-fine: early tokens emergently capture abstract information such as semantics and motion, while later tokens add fine-grained detail, and a generative flow decoder reconstructs realistic video from any chosen token count. The paper frames this as an alternative to the de facto spatiotemporal 3D-grid tokenization, which forces downstream models to predict low-level detail uniformly across space and time.

The method lets downstream models receive fewer tokens when appropriate and more tokens when detail is needed, adapting token count to task requirements. That flexibility also permits encoding longer videos for the same token budget compared with 3D-grid tokenizers, because token count no longer scales strictly with spatiotemporal resolution.

How does VideoFlexTok compare to 3D-grid tokenization on quality and scale?

The paper shows that VideoFlexTok attains comparable generation quality on gFVD and ViCLIP Score benchmarks while using a model five times smaller, specifically 1.1B versus 5.2B parameters, demonstrating more efficient training compared to 3D-grid tokens. The authors present this 5x size gap as an example of improved efficiency rather than a single exhaustive benchmark.

Beyond model-size comparisons, VideoFlexTok enables long video generation without prohibitive compute by compressing long clips into far fewer tokens: the paper describes training a text-to-video model on 10-second, 81-frame videos using only 672 tokens, which the authors state is eight times fewer than a comparable 3D-grid tokenizer. The writeup also names a generative flow decoder as the mechanism that allows realistic reconstructions from variable token counts.

What came before this work?

VideoFlexTok follows a string of tokenization research aimed at reducing redundant tokens and decoupling token count from video duration. The authors link earlier related work: TrajTok, published March 17, 2026 at CVPR, which focuses on trajectory-based tokens but relies on external segmentation and tracking pipelines, and FlexTok, published February 19, 2025, which resamples images into 1D token sequences of flexible length. The VideoFlexTok authors include Andrei Atanov, Jesse Allardice, Roman Bachmann, O4n Fatih Kar, Devon Hjelm, David Griffiths, Peter Fu, Afshin Dehghan, and Amir Zamir; the page notes a Swiss Federal Institute of Technology Lausanne affiliation marker and that Andrei Atanov's listing indicates work done while at Apple.

Why it matters

Token efficiency changes the trade-offs for video models: compressing long clips into 672 tokens and matching benchmark scores with a 1.1B model versus a 5.2B baseline reduces both memory and compute pressure for training and inference. That makes experiments with longer temporal context tractable and lowers the entry barrier for researchers and teams that cannot afford very large 3D-grid models.

What to watch

Watch for follow-up papers and open evaluations that reproduce the 1.1B versus 5.2B comparisons and the 10-second, 81-frame, 672-token training setup. Also look for whether downstream text-to-video systems adopt variable-length, coarse-to-fine tokenization in place of fixed 3D grids, and for benchmarks reporting gFVD and ViCLIP Score under those wider, long-video settings.

VideoFlexTok vs 3D-grid tokenization (key figures)

Item
Model size to match gFVD and ViCLIP Score	1.1B	5.2B
Relative model size	5x smaller	baseline
Tokens to encode 10s, 81-frame video	672 tokens	8x more than VideoFlexTok

Written by The Brieftide · Source: Apple Machine Learning

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

MMIR-TCM: multimodal TCM AI framework outperforms GPT-4o, Gemini

MMIR-TCM pairs Memory-SAM, fine-tuned Qwen3-VL and a Qwen3 RAG pipeline.

The BrieftideDAILY BRIEF

MIT Masked IRL: LLMs help robots clarify and ignore cues

MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.

The BrieftideDAILY BRIEF

Multimodal LLM evaluation: four missing capabilities (2026)

A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.

The BrieftideDAILY BRIEF

ReMMD: Multilingual Multi-Image Benchmark and Agent Release

ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.