Benchmarks & EvalsJune 18, 20266 min read

RTSGameBench: RTS benchmark for strategic reasoning by VLMs

RTSGameBench evaluates vision-language models in Beyond All Reason using mini-games.

The BrieftideJune 18, 2026

TL;DR

01RTSGameBench evaluates vision-language models in Beyond All Reason using mini-games.
02RTSGameBench, submitted to arXiv on 17 June 2026 (arXiv:2606.18950), is a new benchmark designed to probe strategic reasoning in vision-language models.
03RTSGameBench is a multifaceted benchmark that combines diverse gameplay, diagnostic mini-games and an extensible generation pipeline.

RTSGameBench, submitted to arXiv on 17 June 2026 (arXiv:2606.18950), is a new benchmark designed to probe strategic reasoning in vision-language models. The paper, authored by San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa and Jonghyun Choi, builds its evaluation on Beyond All Reason, a large-scale RTS game that expands the battlefield and strategy diversity.

What is RTSGameBench and what does it include?

RTSGameBench is a multifaceted benchmark that combines diverse gameplay, diagnostic mini-games and an extensible generation pipeline. The benchmark is built on Beyond All Reason, provides matchup-structured evaluations, includes mini-games each targeting a single strategic competency, and adds a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles.

The authors frame real-time strategy games as a natural testbed because they require coordination with allies, adaptation to opponents, and long-horizon planning under partial observability. RTSGameBench aims to diagnose which of those competencies vision-language models struggle with.

How does RTSGameBench evaluate vision-language models?

RTSGameBench evaluates models through three linked mechanisms: diverse matchups across an expanded battlefield, targeted mini-games for diagnostic assessment, and an iterative generator that creates new scenarios from free-form queries. The benchmark also includes RTSGameAgent, an agent that manages units using a finite-state machine with agentic memory to let VLMs operate at scale.

The mini-games isolate specific strategic skills so the benchmark can attribute failures to particular competencies. The self-evolving generation framework is explicit: it converts free-form queries into new mini-games and improves those generated scenarios over successive cycles. RTSGameAgent provides a practical layer for managing many units, implemented as an FSM augmented with agentic memory to bridge VLM outputs and in-game control.

What did the authors find when testing state-of-the-art VLMs?

The paper reports that multiple state-of-the-art vision-language models underperform as task demands increase in coordination and scale. In particular, models struggled when matchups required tighter coordination, multiagent coordination and larger task scale. Those empirical results are the benchmark's core validation point and motivate the mini-game diagnostics and the self-evolving generator.

The submission notes that existing RTS benchmarks offer limited evaluation scope, fixed scenario coverage and lack systematic competency diagnosis. RTSGameBench intends to address those gaps by widening scenario diversity through Beyond All Reason and by making the benchmark extensible.

Why it matters

RTSGameBench exposes gaps in strategic reasoning that standard VLM evaluations miss: long-horizon planning, partial observability and multiagent coordination. By combining focused mini-games with a generator that expands scenarios, the benchmark creates a repeatable path to probe specific failures and improvements. Systems that claim broad visual and linguistic competence will face a stronger, more targeted test as a result.

The inclusion of an agent layer, RTSGameAgent, means the benchmark does not stop at high-level prompts: it provides an operational route to test models in large-scale, multiunit settings where coordination and memory matter.

What to watch

Follow whether the self-evolving generation framework demonstrably improves diagnostic coverage over successive cycles and whether RTSGameBench attracts published leaderboards or model submissions. The next concrete signal will be comparative evaluations showing improved performance on the mini-games that target multiagent coordination and tight coordination scenarios.

Authors and submission details: the paper is arXiv:2606.18950, submitted 17 June 2026; the first two authors, San Kim and Daechul Ahn, contributed equally.

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

ForecastBench-Sim: Simulated-World Forecasting Benchmark

A benchmark built on Freeciv game rollouts that generates solvable forecasting tasks with configurable horizons.

The BrieftideDAILY BRIEF

TxBench-PP: 100 preclinical pharmacology tasks, top score 59.3%

TxBench-PP is a verifiable benchmark of 100 small-molecule preclinical decisions across 11 models and 4.

The BrieftideDAILY BRIEF

LLM-as-Judge: Curriculum-Grounded Marking Pipeline for Exam Prep

A staged LLM workflow that grounds question marking in authorised syllabus artefacts.

The BrieftideDAILY BRIEF

MapSatisfyBench: Benchmarking satisfaction-aware map agents

MapSatisfyBench uses large-scale anonymized user data to test whether map agents recover implicit decision factors that shape user.