Age of LLM benchmark: 1v1 reasoning, diplomacy, reliability
Arnaud Ricci's Age of LLM runs 54 matches and 5,258 actions to test 15 LLMs under fog of war, diplomacy and strict JSON reliability.
TL;DR
- 01Arnaud Ricci's Age of LLM runs 54 matches and 5,258 actions to test 15 LLMs under fog of war, diplomacy and strict JSON reliability.
- 02Arnaud Ricci's Age of LLM, submitted 23 Jun 2026, introduces a turn-based 1v1 benchmark where two large language models face off on a 13x7 grid to destroy the enemy base.
- 03Ricci built a reliability layer so every turn must follow a strict JSON schema; any illegal action is silently discarded.
Arnaud Ricci's Age of LLM, submitted 23 Jun 2026, introduces a turn-based 1v1 benchmark where two large language models face off on a 13x7 grid to destroy the enemy base. The benchmark ran 54 matches, recorded 5,258 actions and evaluated 15 reasoning models under three deliberate stressors: fog of war, full diplomacy and a strict JSON-based reliability rule that silently discards illegal actions.
What is Age of LLM?
Age of LLM is a strategic game benchmark designed to expose how LLMs reason and behave under adversarial uncertainty: it is a 1v1, turn-based match on a 13x7 grid with fog of war, explicit diplomacy channels and hidden uranium for nuclear actions. The engine is private and each match uses a fresh random map seed and opponent to mitigate data contamination, while models receive a near rule-only prompt with no build-order advice (two tactical seed phrases were present during data collection, see Section 2.7).
Ricci built a reliability layer so every turn must follow a strict JSON schema; any illegal action is silently discarded. The paper includes 25 pages, 8 figures and 4 tables, and appends the verbatim system prompt and engine resolution pseudocode. All correlations are reported with p-values, 95% bootstrap confidence intervals and Spearman's rho, with additional tests including a Steiger test and a Bradley-Terry fit.
What did the benchmark find?
The core findings are empirical and numeric: the nuclear rush dominates, military conquest is rarer but faster, diplomacy is frequent yet rarely consummated, and many illegal actions reflect belief-tracking failures. Specifically, the nuclear rush dominates with a 78% rate on the rules-coherent v0.11+ sub-corpus and an 85% rate corpus-wide. Military conquest occurs less often but resolves faster, cited as 12.3 versus 18.9 turns. Diplomacy generates many messages, ceasefires and ultimatums, yet it is almost never consummated. Approximately 58% of illegal actions are fog/state errors, which the author frames as a measure of belief-tracking.
Ricci also flags a tentative link between reliability and winning: a weak, exploratory association is reported tying adherence to the JSON reliability requirement with match success, but the paper labels this as the least established result. The corpus itself is small, unbalanced and not side-swapped, so the ranking of models is presented as a preliminary descriptive view rather than a definitive leaderboard. The author releases the replay format, an isometric viewer and all replays, and offers engine source on request.
Why it matters
Age of LLM shifts evaluation from static benchmarks to interactive, adversarial settings that stress planning, hidden information and communication. The dominance of the nuclear rush under secret-simultaneous launch rules, described as a largely mechanical sole-launcher signature rather than a pure deterrence failure, reveals how environment rules can produce brittle, predictable strategies. The high share of fog/state illegal actions (about 58%) points to persistent gaps in belief-tracking and state awareness, which are relevant for any application where models must act on limited, evolving information.
Framing reliability as a measurable axis via a strict JSON schema is pragmatic: it converts a soft concept into a testable constraint and uncovers how often models violate required formats under pressure. The dataset of turn-by-turn traces and messages provides a lens on spontaneous deception, per-model decision patterns and how diplomacy emerges or fails in LLM play, all useful for researchers studying multiagent interaction and trust.
What to watch
Look for broader, balanced corpora and side-swapped experiments that address the paper's small, unbalanced sample and lack of side-swapping; these would test whether the nuclear-rush dominance and the reliability–winning link persist. Also watch for public releases or requests for the engine source and for follow-up work that uses the released replays and viewer to reproduce or extend the reported p-values, 95% bootstrap confidence intervals and Bradley-Terry analyses.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsBIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
QMFOL benchmark: QMFOLBench with 2880 logic instances
QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.
DeFAb: Defeasible Abduction Benchmark, 372,648+ instances
DeFAb converts four decades of publicly funded knowledge bases into 372.
LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.