Multimodal AI5 min read

RecurrReason benchmark: Recurrent Reasoning on symbolic puzzles

A 10,817-puzzle dataset tests Tower of Hanoi, River Crossing, Block World and Checkers Jumping with BFS-optimal trajectories.

The Brieftide

TL;DR

  • 01A 10,817-puzzle dataset tests Tower of Hanoi, River Crossing, Block World and Checkers Jumping with BFS-optimal trajectories.
  • 02The dataset contains 10,817 unique puzzles and 285,933 moves, each with BFS-optimal trajectories and a single interpretable difficulty parameter N in {1,…,10}.
  • 03The paper states the code and dataset will be open-sourced upon acceptance.

Gowrav Mannem and co-authors posted "Recurrent Reasoning on Symbolic Puzzles with Sequence Models" to arXiv on 19 Apr 2026, introducing RecurrReason, a difficulty-controlled benchmark of four recurrent logic puzzles. The dataset contains 10,817 unique puzzles and 285,933 moves, each with BFS-optimal trajectories and a single interpretable difficulty parameter N in {1,…,10}.

What is RecurrReason?

RecurrReason is a labelled, difficulty-controlled benchmark composed of Tower of Hanoi, River Crossing, Block World, and Checkers Jumping puzzles, totalling 10,817 unique instances and 285,933 moves with BFS-optimal trajectories. The benchmark exposes a single integer difficulty parameter N in {1,…,10} so researchers can scale problems in a controlled way and evaluate robustness across in-distribution and harder out-of-distribution settings.

The authors designed the suite to address weaknesses in existing reasoning benchmarks, namely that many only check whether a model can produce a valid answer rather than whether solutions are minimal, robust, and stable under controlled difficulty scaling. The paper states the code and dataset will be open-sourced upon acceptance.

How were sequence models tested and what were the results?

The paper benchmarks two Transformer families, an encoder-decoder model described as T5-style and a decoder-only model described as GPT-2-style, trained on instances with N = 1 to 7 and evaluated on held-out in-distribution instances plus harder out-of-distribution instances with N = 8 to 10. Fine-tuned pre-trained T5 achieves 97.27% validation accuracy and 81.00% out-of-distribution accuracy specifically on Block World.

By contrast, all models scored 0.00% on River Crossing under all conditions, indicating that some puzzle structures remain completely unsolved by the evaluated sequence models. The authors report that architecture choice was a stronger determinant of success than model scale, and that pre-training only transferred to puzzles whose transition functions are locally structured.

Why it matters

The benchmark exposes where apparent reasoning strengths of large language models break down: minimal, BFS-optimal solutions and controlled difficulty scaling reveal brittleness that simple correctness checks miss. The combination of per-puzzle difficulty N and a large set of optimal moves lets researchers separate errors caused by length and depth of reasoning from errors caused by model architecture or pre-training mismatch. Showing 97.27% validation but lower OOD performance (81.00%) on Block World quantifies that gap. Equally striking is the universal 0.00% score on River Crossing, which signals that some symbolic transition structures are effectively out of reach for the tested sequence-model families.

The paper's failure-mode findings matter for anyone using sequence models on algorithmic or symbolic tasks: architecture and the local structure of state transitions can dominate whether pre-training helps at all.

What to watch

Look for the authors to release the code and dataset upon acceptance, which will enable replication and broader stress-testing of architectures across the four puzzles. The next concrete milestone is whether other model families or training regimes can move River Crossing above 0.00% and whether pre-training transfer can be extended beyond locally structured transition functions.

Benchmark snapshot: Block World and River Crossing results
Item
Encoder-decoder (T5-style, fine-tuned pre-trained)97.27%81.00%0.00%
Decoder-only (GPT-2-style)N/AN/A0.00%
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement