Policy gradient vs game-theory: MIT benchmark shows generalists
MIT researchers present a benchmarking suite showing policy gradient-trained neural networks beat game-theory algorithms on five.
TL;DR
- 01MIT researchers present a benchmarking suite showing policy gradient-trained neural networks beat game-theory algorithms on five.
- 02The work, published June 17, 2026 and presented in April at the International Conference on Learning Representations in Rio De Janeiro, also releases the benchmark code for others to use.
- 03They ran experiments on five imperfect-information, two-player zero-sum games and measured performance with exploitability, a worst-case opponent metric.
MIT researchers presented a benchmarking suite that showed policy gradient-trained neural networks outperformed specialized game-theoretic algorithms in experiments on five imperfect-information games. The work, published June 17, 2026 and presented in April at the International Conference on Learning Representations in Rio De Janeiro, also releases the benchmark code for others to use.
What did the researchers test?
They ran experiments on five imperfect-information, two-player zero-sum games and measured performance with exploitability, a worst-case opponent metric. The five games were two versions of Phantom Tic-Tac-Toe, two imperfect-information variants of Hex, and Liar’s Dice. The team includes Sobhan Mohammadpour and Gabriele Farina from MIT, plus co-authors Max Rudolph, Nathan Lichtlé, Alexandre Bayen, J. Zico Kolter, Amy X. Zhang, Eugene Vinitsky, and Samuel Sokota.
How did the benchmark work and what were the results?
The benchmark compares algorithms by computing exploitability and then testing agents head-to-head; lower exploitability means closer to perfect play. The researchers pushed exploitability measurement to games that can include as many as 30 billion states, whereas previous work typically used exploitability on games about 100,000 times smaller. In those experiments, neural networks trained with policy gradient methods achieved lower exploitability scores than networks trained using game-theory-based algorithms, and policy gradient agents beat the game-theory agents in subsequent head-to-head matches. Samuel Sokota summarized the finding: "Our study showed that policy gradient methods can work better than these specialized algorithms." The team also emphasizes that their benchmark is intended as a neutral testing ground rather than a proposal for a new winning algorithm.
The benchmarking software is freely available and integrates with OpenSpiel. Sobhan Mohammadpour noted users do not need large clusters to run it, saying, "You don't need a supercomputer. You can run it on an ordinary laptop."
Why does exploitability matter here?
Exploitability measures how well an agent fares against the worst-case adversary, making it a strict, adversarial yardstick for strategic play. The researchers focused on exploitability because it captures how close an agent is to optimal play when opponents can fully exploit predictable strategies. Pushing exploitability measurement to games with up to 30 billion states exposed differences that smaller-scale benchmarks had missed, and the results challenged the long-standing assumption that specialized game-theoretic algorithms would necessarily dominate policy gradient methods in imperfect-information settings.
Why it matters
The findings shift a common assumption in multi-agent learning: general-purpose policy gradient methods can outperform specialized game-theory algorithms in certain strategic settings. That matters beyond recreational games because the paper frames "game" as any multi-agent strategic interaction with hidden information, including negotiations, trading scenarios, and military operations. The practical implications are twofold: researchers need robust, scale-aware benchmarks to compare algorithms fairly, and practitioners should reconsider algorithm choices for large-scale imperfect-information problems.
What to watch
Adoption and independent replication of the released benchmark in OpenSpiel will be the immediate test. Success would look like other groups reproducing lower exploitability for policy gradient agents on the same five games, and then extending the benchmark to additional multi-agent domains with hidden information.
Paper: "Reevaluating policy gradient methods for imperfect-information games." Presented April at ICLR, published June 17, 2026. Key figures: five games tested; exploitability calculations scaled to games with as many as 30 billion states. Notable authors: Sobhan Mohammadpour and Gabriele Farina (MIT), Max Rudolph, Nathan Lichtlé, Alexandre Bayen, J. Zico Kolter, Amy X. Zhang, Eugene Vinitsky, Samuel Sokota.
| Item | ||
|---|---|---|
| Benchmark outcome (five games) | Lower exploitability, won head-to-head | Higher exploitability, lost head-to-head |
| Games tested | Phantom Tic-Tac-Toe variants; Hex variants; Liar's Dice | Phantom Tic-Tac-Toe variants; Hex variants; Liar's Dice |
| Scales handled | Exploitability measured up to 30 billion states | Previous exploitability work used games ~100,000 times smaller |
| Availability | Benchmark code released, integrates with OpenSpiel | Evaluated within the same released benchmark |
Written by The Brieftide · Source: MIT News · AI
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Open Source AIZhipu AI GLM-5.2: 1M-token context, closes gap with Opus 4.8
GLM-5.2 ships under the MIT license with a stable one-million-token context and scores 74.4% on FrontierSWE, one point behind Opus 4.8.
OpenAI: PRC-linked influence operations target US AI debates
OpenAI says PRC-linked campaigns are using AI to push narratives on U.S. tech debates, data centers, tariffs and false ChatGPT claims.
OpenAI: LSEG scales trusted AI, empowers 4,000 staff
LSEG uses OpenAI to scale trusted AI across its global business, accelerating insights, shrinking release cycles and empowering 4.
Industrial policy OpenAI proposes for the Intelligence Age
OpenAI published a people-first industrial policy on June 9, 2026, and opened a pilot grants program with fellowships.