Mastermind raises GPT-5.5 vulnerability pass rate to 84.5%
A strategy-grounded planner trained with SFT and milestone-based GRPO boosts repository-scale PoC reproduction on CyberGym.
TL;DR
- 01A strategy-grounded planner trained with SFT and milestone-based GRPO boosts repository-scale PoC reproduction on CyberGym.
- 02Mastermind, a strategy-grounded learning framework for repository-level vulnerability reproduction, appears in a paper submitted to arXiv on 2 Jul 2026.
- 03The authors evaluate Mastermind on CyberGym with 260 training tasks and 200 held-out evaluation tasks and report that its planner raises the pass rate to 84.5% when paired with GPT-5.5.
Mastermind, a strategy-grounded learning framework for repository-level vulnerability reproduction, appears in a paper submitted to arXiv on 2 Jul 2026. The authors evaluate Mastermind on CyberGym with 260 training tasks and 200 held-out evaluation tasks and report that its planner raises the pass rate to 84.5% when paired with GPT-5.5.
What is Mastermind and how does it work?
Mastermind is a dual-loop framework that separates transferable strategy learning from task-specific experience. The framework trains a planner via supervised fine-tuning (SFT) and milestone-based GRPO to learn reusable vulnerability-reproduction strategies, while an experience loop stores task-local strategy records to guide later attempts. The planner is trained independently of the executor, which lets the same learned strategies improve multiple frozen executors without changing their action-generation capability.
The paper frames strategy as the primary learning unit because it is compact enough to optimize, concrete enough to guide execution, and stable enough to store and reuse. The executor that performs actions is left frozen in experiments; the planner supplies higher-level strategies to steer that executor across attempts.
How well does Mastermind perform on benchmarks?
On CyberGym, using 260 training tasks and 200 held-out evaluation tasks, Mastermind achieves an 84.5% pass rate with GPT-5.5 as the frozen executor. That 84.5% pass rate outperforms several baselines run with the same executor: open-book PoC context at 60.0%, Best-of-8 sampling at 63.0%, and iterative improvement at 77.0%.
The planner also transfers to other executors. The authors report that the same planner improves GPT-5.4 mini from 45.0% to 60.0% and GLM~5.1 from 58.5% to 71.0%. These numbers are presented as evidence that learning high-level strategies produces gains across different frozen executors.
Why it matters
Mastermind changes where learning effort is concentrated: on reusable, high-level strategies rather than full action trajectories. That division lets a single trained planner improve multiple executors without touching their action policies, which lowers the friction for applying improvements across models. The empirical results on CyberGym, including the 84.5% pass rate with GPT-5.5 and the cross-model lifts for GPT-5.4 mini and GLM~5.1, show the approach can move repository-scale vulnerability reproduction metrics by sizable margins.
What to watch
Look for follow-up work that publishes the planner checkpoints, or applies Mastermind-style strategy learning to broader SE tasks and real-world codebases. Another key signal will be replication on other benchmarks beyond CyberGym and any details on how milestone-based GRPO is tuned in practice.
References The claims above come from the paper "Mastermind: Strategy-grounded Learning for Repository-Scale Vulnerability Reproduction" by Mingzhe Du et al., submitted to arXiv on 2 Jul 2026 (arXiv:2607.01764).
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsEinstein World Models: LLMs with visual rollouts (arXiv 2026)
An arXiv paper submitted 25 Jun 2026 proposes Einstein World Models, letting LLMs call visual-temporal rollouts as inspectable hypotheses.
KARLA: KB-augmented retrieval for language models paper
arXiv paper (25 Jun 2026) by Francois Crespin, Fabian M. Suchanek and Nils Holzenberger shows LLMs can query a knowledge base during token.
Synthetic clinical notes from LLMs: 70-patient longitudinal
William Poulett publishes a modular LLM pipeline and a synthetic dataset of 70 patients.
Capability Frontier: Benchmarks Miss 82% of LLM Performance
An arXiv paper finds single-model, single-run benchmarks undercount LLM ability; an oracle multi-model approach recovers 82% more.