Multimodal AIJuly 2, 20265 min read

AGI Maze benchmark: LLMs fail at maze world-modeling

Alexey Potapov's AGI Maze offers grid mazes and a clean API where vanilla LLMs fail to form persistent world models; a message-history.

The BrieftideJuly 2, 2026

TL;DR

01Alexey Potapov's AGI Maze offers grid mazes and a clean API where vanilla LLMs fail to form persistent world models; a message-history.
02Alexey Potapov submitted AGI Maze to arXiv on 1 Jul 2026 (arXiv:2607.00627).
03AGI Maze provides a family of grid-based mazes, no high-dimensional sensory inputs, and a simple API so researchers can vary difficulty and partial observability to require explicit world models.

Alexey Potapov submitted AGI Maze to arXiv on 1 Jul 2026 (arXiv:2607.00627). AGI Maze is a lightweight framework of grid-based maze tasks with a clean API and multiple difficulty regimes, designed to force agents to construct and use persistent, manipulable world-state representations rather than rely on next-token pattern completion.

How does AGI Maze work?

AGI Maze provides a family of grid-based mazes, no high-dimensional sensory inputs, and a simple API so researchers can vary difficulty and partial observability to require explicit world models. The paper frames the problem by contrasting LLMs' default operating mode of predicting the next token from a static context with the demands of environments that are partially observable, stateful, and require memory and structured hypotheses about hidden state.

The framework deliberately avoids complex perceptual inputs, focusing instead on tasks where the challenge is representation and memory. That design lets evaluators change difficulty regimes and step budgets while keeping the observation space low dimensional, so failures point to gaps in internal world-modeling rather than to vision or signal processing.

How did LLMs perform on the benchmark?

Initial evaluation found that several vanilla LLMs fail to represent mazes internally at LLM inference time, and cannot reliably solve even small mazes within reasonable step budgets. Potapov reports an initial set of experiments on simple mazes showing this basic failure mode.

To probe remedies, the paper introduces a baseline agent that is allowed to use its message history as a working memory to construct descriptions of observations at agentic runtime. That baseline, in Potapov's results, improves performance over the vanilla inference-only approach but remains insufficient: even with message-history working memory, an LLM agent could not reliably solve small mazes within a step budget that the author says is more than enough for humans.

Those two concrete points anchor the paper's claims: vanilla LLMs do not form lasting internal maze representations during standard inference, and a straightforward working-memory trick — what the paper calls using "message history as a working memory" — helps but does not solve the shortcoming.

Why it matters

The failure of LLMs on AGI Maze isolates weaknesses that are easy to miss in text-only benchmarks that do not impose partial observability or stateful environments. If models can answer questions by pattern completion but cannot maintain manipulable representations of a simple external world, then many behaviors labeled as reasoning may not generalize to tasks that require persistent, structured hypotheses about hidden state.

AGI Maze gives a crisp, low-overhead way to test this. The framework exposes whether an agent truly tracks world state across actions and observations, a capability relevant to robotics, planning, and any interactive system that must remember and act on hidden facts.

What to watch

Look for wider adoption of AGI Maze evaluations and for follow-up papers that report quantitative metrics and agent designs specifically aimed at persistent world models. The next useful milestones will be public benchmarks built on the AGI Maze API, comparative leaderboards, and agent architectures that go beyond message-history working memory to explicitly construct and update internal state models.

Paper reference: Alexey Potapov, "AGI Maze as a Benchmark Framework for World-Modeling Agents," arXiv:2607.00627, submitted 1 Jul 2026. DOI: 10.48550/arXiv.2607.00627.

Initial evaluation: vanilla LLMs vs baseline agent

Item
Internal maze representation during inference	fail to represent mazes internally at LLM inference time	constructs descriptions from message history but not full internal model
Solve small mazes within a human-scale step budget	cannot reliably solve simple mazes	improves performance but still insufficient to reliably solve
Sensory complexity	framework uses low-dimensional observations (no high-dimensional inputs)	framework uses low-dimensional observations (no high-dimensional inputs)

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

MMIR-TCM: multimodal TCM AI framework outperforms GPT-4o, Gemini

MMIR-TCM pairs Memory-SAM, fine-tuned Qwen3-VL and a Qwen3 RAG pipeline.

The BrieftideDAILY BRIEF

MIT Masked IRL: LLMs help robots clarify and ignore cues

MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.

The BrieftideDAILY BRIEF

Multimodal LLM evaluation: four missing capabilities (2026)

A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.

The BrieftideDAILY BRIEF

ReMMD: Multilingual Multi-Image Benchmark and Agent Release

ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.