RecurrReason benchmark: Recurrent Reasoning on symbolic puzzles
A 10,817-puzzle dataset tests Tower of Hanoi, River Crossing, Block World and Checkers Jumping with BFS-optimal trajectories.
TL;DR
- 01A 10,817-puzzle dataset tests Tower of Hanoi, River Crossing, Block World and Checkers Jumping with BFS-optimal trajectories.
- 02The dataset contains 10,817 unique puzzles and 285,933 moves, each with BFS-optimal trajectories and a single interpretable difficulty parameter N in {1,…,10}.
- 03The paper states the code and dataset will be open-sourced upon acceptance.
Gowrav Mannem and co-authors posted "Recurrent Reasoning on Symbolic Puzzles with Sequence Models" to arXiv on 19 Apr 2026, introducing RecurrReason, a difficulty-controlled benchmark of four recurrent logic puzzles. The dataset contains 10,817 unique puzzles and 285,933 moves, each with BFS-optimal trajectories and a single interpretable difficulty parameter N in {1,…,10}.
What is RecurrReason?
RecurrReason is a labelled, difficulty-controlled benchmark composed of Tower of Hanoi, River Crossing, Block World, and Checkers Jumping puzzles, totalling 10,817 unique instances and 285,933 moves with BFS-optimal trajectories. The benchmark exposes a single integer difficulty parameter N in {1,…,10} so researchers can scale problems in a controlled way and evaluate robustness across in-distribution and harder out-of-distribution settings.
The authors designed the suite to address weaknesses in existing reasoning benchmarks, namely that many only check whether a model can produce a valid answer rather than whether solutions are minimal, robust, and stable under controlled difficulty scaling. The paper states the code and dataset will be open-sourced upon acceptance.
How were sequence models tested and what were the results?
The paper benchmarks two Transformer families, an encoder-decoder model described as T5-style and a decoder-only model described as GPT-2-style, trained on instances with N = 1 to 7 and evaluated on held-out in-distribution instances plus harder out-of-distribution instances with N = 8 to 10. Fine-tuned pre-trained T5 achieves 97.27% validation accuracy and 81.00% out-of-distribution accuracy specifically on Block World.
By contrast, all models scored 0.00% on River Crossing under all conditions, indicating that some puzzle structures remain completely unsolved by the evaluated sequence models. The authors report that architecture choice was a stronger determinant of success than model scale, and that pre-training only transferred to puzzles whose transition functions are locally structured.
Why it matters
The benchmark exposes where apparent reasoning strengths of large language models break down: minimal, BFS-optimal solutions and controlled difficulty scaling reveal brittleness that simple correctness checks miss. The combination of per-puzzle difficulty N and a large set of optimal moves lets researchers separate errors caused by length and depth of reasoning from errors caused by model architecture or pre-training mismatch. Showing 97.27% validation but lower OOD performance (81.00%) on Block World quantifies that gap. Equally striking is the universal 0.00% score on River Crossing, which signals that some symbolic transition structures are effectively out of reach for the tested sequence-model families.
The paper's failure-mode findings matter for anyone using sequence models on algorithmic or symbolic tasks: architecture and the local structure of state transitions can dominate whether pre-training helps at all.
What to watch
Look for the authors to release the code and dataset upon acceptance, which will enable replication and broader stress-testing of architectures across the four puzzles. The next concrete milestone is whether other model families or training regimes can move River Crossing above 0.00% and whether pre-training transfer can be extended beyond locally structured transition functions.
| Item | ||||
|---|---|---|---|---|
| Encoder-decoder (T5-style, fine-tuned pre-trained) | 97.27% | 81.00% | 0.00% | |
| Decoder-only (GPT-2-style) | N/A | N/A | 0.00% |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIAmazon Nova embeddings beat Cohere for Vexcel aerial search
Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.
LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.
ThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Reliability-Aware Inference reduces visual hallucinations in MLLMs
A retrieval-augmented, reliability-aware framework lifted ImageNet-100 accepted accuracy from 85.84% to 88.88% (89.04% coverage) and cut.