AISI study: Benchmarks underestimate AI agents' compute gains
The UK's AI Security Institute found fixed token budgets can understate capability.
TL;DR
- 01The UK's AI Security Institute found fixed token budgets can understate capability.
- 02The UK's AI Security Institute (AISI) tested frontier models across seven benchmarks and found that fixed test-time token budgets systematically underestimate what AI agents can do.
- 03AISI found that an agent's measured performance is a curve that rises with test-time compute, and cutting evaluation budgets can report a minimum, not a maximum.
The UK's AI Security Institute (AISI) tested frontier models across seven benchmarks and found that fixed test-time token budgets systematically underestimate what AI agents can do. When models were given larger compute budgets, measured success rates rose—by as much as about 25 percent on some software engineering tasks—and some cyber problems only became solvable above tens of millions of tokens.
What did AISI find?
AISI found that an agent's measured performance is a curve that rises with test-time compute, and cutting evaluation budgets can report a minimum, not a maximum. Across domains, the institute reports examples where roughly 8 percent of cybersecurity tasks were only solved when the budget exceeded 10 million tokens, and some cyber tasks required 50 million tokens to be solvable. On software engineering benchmarks (TerminalBench 2.0, SWE-Bench Pro) success rates jumped about 25 percent when the token budget increased from one million to ten million. For math and academic tasks (Humanity's Last Exam) gains were around 22 percent up to a budget of five million tokens.
How do token budgets change model performance?
Larger token budgets extend the time horizon and raise success rates; newer models benefit more from extra compute than older ones. A current frontier model's time horizon grew from about 40 minutes at a budget of 2.5 million tokens to roughly four hours at 50 million tokens. Across the frontier, the horizon shifts from about two hours to 14 hours when the budget jumps from 2.5 to 50 million tokens. AISI also shows the relationship between human task time and token consumption follows a power law across 211 software engineering tasks and 78 cyber tasks: a one-minute task costs the agent thousands of tokens, a one-hour task costs millions, and a one-week task costs billions. A concrete illustration: the cyber task "The Last Ones" takes a human expert about 20 hours and no tested model could solve it with fewer than 30 million tokens.
Beyond reach, AISI measures three axes of improvement when models advance: reach (harder tasks become solvable), reliability (the same task gets solved more often), and efficiency (the same task needs fewer tokens). The institute observed that newer models (examples tested include GPT-5, GPT-5.5, Opus 4.5, Opus 4.8, and Sonnet 4.5) gain more from higher budgets than older ones.
Why it matters
Measuring capability as a single fixed score can mislead deployment, economic and risk decisions by undercounting what systems can achieve with more compute. AISI emphasizes that test budgets shape apparent progress: at a 2.5 million token budget the institute had previously estimated the time horizon of frontier models on cyber tasks doubles roughly every 4.7 months; at 50 million tokens, that trend becomes about 60 percent steeper, with doubling happening every 40 to 50 days instead of every 67 to 91. Falling per-token costs would make higher test-time budgets cheaper to run, which could bring previously unaffordable capabilities within reach and magnify the practical gap between low- and high-budget evaluations.
AISI frames the practical takeaway in measurement terms: "If we keep treating capability as a fixed score rather than a curve over compute, we will keep being surprised by what these systems can do when more is spent on them." The institute now runs frontier models through tests at several different budgets and uses "minimum informative budgets" to check whether a model's reach stops growing with extra compute; only then does a result count as meaningful.
What to watch
See whether other evaluators adopt multi-budget testing or publish methods to predict high-budget performance from cheaper runs. Also watch token-cost trajectories: if running larger test-time budgets becomes materially cheaper, the capability gap between low- and high-budget evaluations will become a practical policy and safety concern.
| Item | ||||
|---|---|---|---|---|
| Software engineering (TerminalBench 2.0, SWE-Bench Pro) | 1 million tokens | 10 million tokens | success rates jumped about 25 percent | |
| Math / academic (Humanity's Last Exam) | ≤5 million tokens | 5 million tokens | gain around 22 percent | |
| Cybersecurity tasks solvability | ≤10 million tokens | >10 million tokens (some 50M) | about 8 percent of tasks only solved above 10M; some required 50M | |
| Time horizon, frontier models | 2.5 million tokens | 50 million tokens | horizon shifts from about 2 hours to 14 hours (current model: 40 min → ~4 hours) | |
| Doubling rate of time horizon | 2.5M budget | 50M budget | doubling every 4.7 months → every 40–50 days (about 60% steeper) |
Written by The Brieftide · Source: The Decoder
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI InfrastructureAI power use strains grids, data centers and AWS demand
Volatile power draw from AI workloads, including at AWS facilities, is increasing demand patterns that stress the electrical grid.
IEEE launches virtual training course on large language models
IEEE is offering a virtual training course that teaches engineers to use large language models as reasoning engines in development.
AI4SE and SE4AI: A decade review of AI in systems engineering
H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.
Hyperscalers AI spending to outpace cash flow by Q3 2026
Epoch AI data shows infrastructure spending growing ~70% annually versus operating cash flow at ~23%, with a crossover around Q3 2026.