Open Source AIJune 17, 20265 min read

Policy gradient vs game-theory: MIT benchmark shows generalists

MIT researchers present a benchmarking suite showing policy gradient-trained neural networks beat game-theory algorithms on five.

The BrieftideJune 17, 2026

TL;DR

01MIT researchers present a benchmarking suite showing policy gradient-trained neural networks beat game-theory algorithms on five.
02The work, published June 17, 2026 and presented in April at the International Conference on Learning Representations in Rio De Janeiro, also releases the benchmark code for others to use.
03They ran experiments on five imperfect-information, two-player zero-sum games and measured performance with exploitability, a worst-case opponent metric.

MIT researchers presented a benchmarking suite that showed policy gradient-trained neural networks outperformed specialized game-theoretic algorithms in experiments on five imperfect-information games. The work, published June 17, 2026 and presented in April at the International Conference on Learning Representations in Rio De Janeiro, also releases the benchmark code for others to use.

What did the researchers test?

They ran experiments on five imperfect-information, two-player zero-sum games and measured performance with exploitability, a worst-case opponent metric. The five games were two versions of Phantom Tic-Tac-Toe, two imperfect-information variants of Hex, and Liar’s Dice. The team includes Sobhan Mohammadpour and Gabriele Farina from MIT, plus co-authors Max Rudolph, Nathan Lichtlé, Alexandre Bayen, J. Zico Kolter, Amy X. Zhang, Eugene Vinitsky, and Samuel Sokota.

How did the benchmark work and what were the results?

The benchmark compares algorithms by computing exploitability and then testing agents head-to-head; lower exploitability means closer to perfect play. The researchers pushed exploitability measurement to games that can include as many as 30 billion states, whereas previous work typically used exploitability on games about 100,000 times smaller. In those experiments, neural networks trained with policy gradient methods achieved lower exploitability scores than networks trained using game-theory-based algorithms, and policy gradient agents beat the game-theory agents in subsequent head-to-head matches. Samuel Sokota summarized the finding: "Our study showed that policy gradient methods can work better than these specialized algorithms." The team also emphasizes that their benchmark is intended as a neutral testing ground rather than a proposal for a new winning algorithm.

The benchmarking software is freely available and integrates with OpenSpiel. Sobhan Mohammadpour noted users do not need large clusters to run it, saying, "You don't need a supercomputer. You can run it on an ordinary laptop."

Why does exploitability matter here?

Exploitability measures how well an agent fares against the worst-case adversary, making it a strict, adversarial yardstick for strategic play. The researchers focused on exploitability because it captures how close an agent is to optimal play when opponents can fully exploit predictable strategies. Pushing exploitability measurement to games with up to 30 billion states exposed differences that smaller-scale benchmarks had missed, and the results challenged the long-standing assumption that specialized game-theoretic algorithms would necessarily dominate policy gradient methods in imperfect-information settings.

Why it matters

The findings shift a common assumption in multi-agent learning: general-purpose policy gradient methods can outperform specialized game-theory algorithms in certain strategic settings. That matters beyond recreational games because the paper frames "game" as any multi-agent strategic interaction with hidden information, including negotiations, trading scenarios, and military operations. The practical implications are twofold: researchers need robust, scale-aware benchmarks to compare algorithms fairly, and practitioners should reconsider algorithm choices for large-scale imperfect-information problems.

What to watch

Adoption and independent replication of the released benchmark in OpenSpiel will be the immediate test. Success would look like other groups reproducing lower exploitability for policy gradient agents on the same five games, and then extending the benchmark to additional multi-agent domains with hidden information.

Paper: "Reevaluating policy gradient methods for imperfect-information games." Presented April at ICLR, published June 17, 2026. Key figures: five games tested; exploitability calculations scaled to games with as many as 30 billion states. Notable authors: Sobhan Mohammadpour and Gabriele Farina (MIT), Max Rudolph, Nathan Lichtlé, Alexandre Bayen, J. Zico Kolter, Amy X. Zhang, Eugene Vinitsky, Samuel Sokota.

Benchmark comparison: policy gradient methods vs game-theoretic algorithms

Item
Benchmark outcome (five games)	Lower exploitability, won head-to-head	Higher exploitability, lost head-to-head
Games tested	Phantom Tic-Tac-Toe variants; Hex variants; Liar's Dice	Phantom Tic-Tac-Toe variants; Hex variants; Liar's Dice
Scales handled	Exploitability measured up to 30 billion states	Previous exploitability work used games ~100,000 times smaller
Availability	Benchmark code released, integrates with OpenSpiel	Evaluated within the same released benchmark

Written by The Brieftide · Source: MIT News · AI

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Zhipu AI GLM-5.2: 1M-token context, closes gap with Opus 4.8

GLM-5.2 ships under the MIT license with a stable one-million-token context and scores 74.4% on FrontierSWE, one point behind Opus 4.8.

The BrieftideDAILY BRIEF

OpenAI: PRC-linked influence operations target US AI debates

OpenAI says PRC-linked campaigns are using AI to push narratives on U.S. tech debates, data centers, tariffs and false ChatGPT claims.

The BrieftideDAILY BRIEF

OpenAI: LSEG scales trusted AI, empowers 4,000 staff

LSEG uses OpenAI to scale trusted AI across its global business, accelerating insights, shrinking release cycles and empowering 4.

The BrieftideDAILY BRIEF

Industrial policy OpenAI proposes for the Intelligence Age

OpenAI published a people-first industrial policy on June 9, 2026, and opened a pilot grants program with fellowships.