AGI Maze benchmark: LLMs fail at maze world-modeling
Alexey Potapov's AGI Maze offers grid mazes and a clean API where vanilla LLMs fail to form persistent world models; a message-history.
TL;DR
- 01Alexey Potapov's AGI Maze offers grid mazes and a clean API where vanilla LLMs fail to form persistent world models; a message-history.
- 02Alexey Potapov submitted AGI Maze to arXiv on 1 Jul 2026 (arXiv:2607.00627).
- 03AGI Maze provides a family of grid-based mazes, no high-dimensional sensory inputs, and a simple API so researchers can vary difficulty and partial observability to require explicit world models.
Alexey Potapov submitted AGI Maze to arXiv on 1 Jul 2026 (arXiv:2607.00627). AGI Maze is a lightweight framework of grid-based maze tasks with a clean API and multiple difficulty regimes, designed to force agents to construct and use persistent, manipulable world-state representations rather than rely on next-token pattern completion.
How does AGI Maze work?
AGI Maze provides a family of grid-based mazes, no high-dimensional sensory inputs, and a simple API so researchers can vary difficulty and partial observability to require explicit world models. The paper frames the problem by contrasting LLMs' default operating mode of predicting the next token from a static context with the demands of environments that are partially observable, stateful, and require memory and structured hypotheses about hidden state.
The framework deliberately avoids complex perceptual inputs, focusing instead on tasks where the challenge is representation and memory. That design lets evaluators change difficulty regimes and step budgets while keeping the observation space low dimensional, so failures point to gaps in internal world-modeling rather than to vision or signal processing.
How did LLMs perform on the benchmark?
Initial evaluation found that several vanilla LLMs fail to represent mazes internally at LLM inference time, and cannot reliably solve even small mazes within reasonable step budgets. Potapov reports an initial set of experiments on simple mazes showing this basic failure mode.
To probe remedies, the paper introduces a baseline agent that is allowed to use its message history as a working memory to construct descriptions of observations at agentic runtime. That baseline, in Potapov's results, improves performance over the vanilla inference-only approach but remains insufficient: even with message-history working memory, an LLM agent could not reliably solve small mazes within a step budget that the author says is more than enough for humans.
Those two concrete points anchor the paper's claims: vanilla LLMs do not form lasting internal maze representations during standard inference, and a straightforward working-memory trick — what the paper calls using "message history as a working memory" — helps but does not solve the shortcoming.
Why it matters
The failure of LLMs on AGI Maze isolates weaknesses that are easy to miss in text-only benchmarks that do not impose partial observability or stateful environments. If models can answer questions by pattern completion but cannot maintain manipulable representations of a simple external world, then many behaviors labeled as reasoning may not generalize to tasks that require persistent, structured hypotheses about hidden state.
AGI Maze gives a crisp, low-overhead way to test this. The framework exposes whether an agent truly tracks world state across actions and observations, a capability relevant to robotics, planning, and any interactive system that must remember and act on hidden facts.
What to watch
Look for wider adoption of AGI Maze evaluations and for follow-up papers that report quantitative metrics and agent designs specifically aimed at persistent world models. The next useful milestones will be public benchmarks built on the AGI Maze API, comparative leaderboards, and agent architectures that go beyond message-history working memory to explicitly construct and update internal state models.
Paper reference: Alexey Potapov, "AGI Maze as a Benchmark Framework for World-Modeling Agents," arXiv:2607.00627, submitted 1 Jul 2026. DOI: 10.48550/arXiv.2607.00627.
| Item | |||
|---|---|---|---|
| Internal maze representation during inference | fail to represent mazes internally at LLM inference time | constructs descriptions from message history but not full internal model | |
| Solve small mazes within a human-scale step budget | cannot reliably solve simple mazes | improves performance but still insufficient to reliably solve | |
| Sensory complexity | framework uses low-dimensional observations (no high-dimensional inputs) | framework uses low-dimensional observations (no high-dimensional inputs) |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIMMIR-TCM: multimodal TCM AI framework outperforms GPT-4o, Gemini
MMIR-TCM pairs Memory-SAM, fine-tuned Qwen3-VL and a Qwen3 RAG pipeline.
MIT Masked IRL: LLMs help robots clarify and ignore cues
MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.
Multimodal LLM evaluation: four missing capabilities (2026)
A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.
ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.