Multimodal AIJune 4, 20255 min read

AGI Is Not Multimodal: Why LLMs Lack Embodied World Models

The essay argues multimodal scaling and next-token training yield syntax and heuristics, not the embodied world models AGI needs.

The BrieftideJune 4, 2025

TL;DR

01The essay argues multimodal scaling and next-token training yield syntax and heuristics, not the embodied world models AGI needs.
02An essay argues multimodal scaling will not produce human-level AGI, and that embodiment and interaction must be treated as primary rather than an afterthought.
03The central claim is that true AGI must be able to solve problems that originate in physical reality, examples the essay gives include repairing a car, untying a knot, and preparing food.

An essay argues multimodal scaling will not produce human-level AGI, and that embodiment and interaction must be treated as primary rather than an afterthought. It contends large language models trained on the predict-next-token objective likely learn bags of heuristics and syntactic rules, not the physical world models needed for sensorimotor reasoning, motion planning, or social coordination.

What the essay argues

The central claim is that true AGI must be able to solve problems that originate in physical reality, examples the essay gives include repairing a car, untying a knot, and preparing food. Those tasks require a "form of intelligence that is fundamentally situated in something like a physical world model," not mere symbol manipulation.

The essay challenges the idea that next-token prediction induces human-like world models. Instead, it argues LLMs often retain only the information necessary to predict the next token and so can learn "comprehensive sets of idiosyncratic heuristics." A highlighted example is work on Othello: researchers inferred a board state from a transformer hidden state, but that result does not generalize to natural language because Othello resides in the land of symbols while many physical tasks do not. A related critique, cited in the essay, notes OthelloGPT learned rules that do not hold for all games, for example the blog-post observation that "if the token for B4 does not appear before A4 in the input string, then B4 is empty."

The piece breaks linguistic intelligence into syntax, semantics, and pragmatics and argues LLMs may reduce semantics and pragmatics to syntax. It sketches a thought experiment in which a system could embed semantic constraints into new syntactic categories and special production rules learned from massive corpora, a strategy that would mimic correct outputs without producing genuine world understanding.

Evidence, context, and the Bitter Lesson

The essay locates current successes in scale: these models "scaled effectively on hardware we already had," and proponents of scale have been seduced by those results. It revisits Sutton's Bitter Lesson and argues the maxim has been misread as forbidding structural assumptions. In contrast, the essay says it is precisely when humans think deeply about the structure of intelligence that major advancements occur, and that multimodal proponents implicitly assume particular structures about modalities and how to sew them together.

To underline inefficiency, the author uses an analogy: training purely by scale is like training "a pile of one trillion ants for one billion years to mimic the form and function of a Formula 1 race car; eventually it gets there, but wow was the process inefficient." The essay calls for either careful thinking about how to unite modalities or an alternative approach that makes embodiment and interaction the core cognitive process.

The piece also points to subfields that explicitly use world models to solve physical tasks, naming model-based reinforcement learning, task and motion planning in robotics, and causal world modeling as relevant approaches that show how high-fidelity physical predictions are leveraged in practice.

Why it matters

If LLMs primarily learn syntactic heuristics rather than grounded models of the world, then apparent language competence is a misleading proxy for general intelligence. The argument implies multimodal architectures assembled from modality-specific components, optimized by scale alone, may fail to reach human-level AGI on tasks that require real-world interaction and planning. This reframes priorities for research and engineering toward embodiment, interaction, and explicit world-modeling.

What to watch

Look for demonstrations that go beyond token prediction: models that predict the next physical state given a history of states, or multimodal systems that perform credible motion planning, sensorimotor reasoning, or sustained social coordination. Evidence that LLMs are running high-fidelity physics-style simulations in latent space, rather than only reproducing token patterns, would contradict the essay's central claim.

Written by The Brieftide · Source: The Gradient

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

DeepMind Gemma 4 12B release - encoder-free decoder-only LLM

A 12B-parameter Gemma 4 variant removes the separate visual encoder, processing text and images with a single decoder-only model.

Hugging FaceFRONTIER LAB

Hugging Face Spaces: Multimedia Building Blocks demo

Hugging Face Spaces project assembles modular components to prototype multimodal agents handling text, images, audio and video.

Ahead of AINEWSLETTER

2026 LLM Research Roundup Jan-May: Alignment, RAG, Multimodal

Curated highlights from Jan–May 2026 covering alignment, retrieval-augmented models, multimodal advances, evaluation, and efficiency.

The DecoderNEWSLETTER

Qwen3.7-Plus by Alibaba: multimodal autonomous agent

Combines visual perception, GUI control and code generation in one multimodal agent loop for extended task automation.