AI Infrastructure5 min read

AISI study: Benchmarks underestimate AI agents' compute gains

The UK's AI Security Institute found fixed token budgets can understate capability.

The Brieftide

TL;DR

  • 01The UK's AI Security Institute found fixed token budgets can understate capability.
  • 02The UK's AI Security Institute (AISI) tested frontier models across seven benchmarks and found that fixed test-time token budgets systematically underestimate what AI agents can do.
  • 03AISI found that an agent's measured performance is a curve that rises with test-time compute, and cutting evaluation budgets can report a minimum, not a maximum.

The UK's AI Security Institute (AISI) tested frontier models across seven benchmarks and found that fixed test-time token budgets systematically underestimate what AI agents can do. When models were given larger compute budgets, measured success rates rose—by as much as about 25 percent on some software engineering tasks—and some cyber problems only became solvable above tens of millions of tokens.

What did AISI find?

AISI found that an agent's measured performance is a curve that rises with test-time compute, and cutting evaluation budgets can report a minimum, not a maximum. Across domains, the institute reports examples where roughly 8 percent of cybersecurity tasks were only solved when the budget exceeded 10 million tokens, and some cyber tasks required 50 million tokens to be solvable. On software engineering benchmarks (TerminalBench 2.0, SWE-Bench Pro) success rates jumped about 25 percent when the token budget increased from one million to ten million. For math and academic tasks (Humanity's Last Exam) gains were around 22 percent up to a budget of five million tokens.

How do token budgets change model performance?

Larger token budgets extend the time horizon and raise success rates; newer models benefit more from extra compute than older ones. A current frontier model's time horizon grew from about 40 minutes at a budget of 2.5 million tokens to roughly four hours at 50 million tokens. Across the frontier, the horizon shifts from about two hours to 14 hours when the budget jumps from 2.5 to 50 million tokens. AISI also shows the relationship between human task time and token consumption follows a power law across 211 software engineering tasks and 78 cyber tasks: a one-minute task costs the agent thousands of tokens, a one-hour task costs millions, and a one-week task costs billions. A concrete illustration: the cyber task "The Last Ones" takes a human expert about 20 hours and no tested model could solve it with fewer than 30 million tokens.

Beyond reach, AISI measures three axes of improvement when models advance: reach (harder tasks become solvable), reliability (the same task gets solved more often), and efficiency (the same task needs fewer tokens). The institute observed that newer models (examples tested include GPT-5, GPT-5.5, Opus 4.5, Opus 4.8, and Sonnet 4.5) gain more from higher budgets than older ones.

Why it matters

Measuring capability as a single fixed score can mislead deployment, economic and risk decisions by undercounting what systems can achieve with more compute. AISI emphasizes that test budgets shape apparent progress: at a 2.5 million token budget the institute had previously estimated the time horizon of frontier models on cyber tasks doubles roughly every 4.7 months; at 50 million tokens, that trend becomes about 60 percent steeper, with doubling happening every 40 to 50 days instead of every 67 to 91. Falling per-token costs would make higher test-time budgets cheaper to run, which could bring previously unaffordable capabilities within reach and magnify the practical gap between low- and high-budget evaluations.

AISI frames the practical takeaway in measurement terms: "If we keep treating capability as a fixed score rather than a curve over compute, we will keep being surprised by what these systems can do when more is spent on them." The institute now runs frontier models through tests at several different budgets and uses "minimum informative budgets" to check whether a model's reach stops growing with extra compute; only then does a result count as meaningful.

What to watch

See whether other evaluators adopt multi-budget testing or publish methods to predict high-budget performance from cheaper runs. Also watch token-cost trajectories: if running larger test-time budgets becomes materially cheaper, the capability gap between low- and high-budget evaluations will become a practical policy and safety concern.

Selected AISI findings: performance at low vs high token budgets
Item
Software engineering (TerminalBench 2.0, SWE-Bench Pro)1 million tokens10 million tokenssuccess rates jumped about 25 percent
Math / academic (Humanity's Last Exam)≤5 million tokens5 million tokensgain around 22 percent
Cybersecurity tasks solvability≤10 million tokens>10 million tokens (some 50M)about 8 percent of tasks only solved above 10M; some required 50M
Time horizon, frontier models2.5 million tokens50 million tokenshorizon shifts from about 2 hours to 14 hours (current model: 40 min → ~4 hours)
Doubling rate of time horizon2.5M budget50M budgetdoubling every 4.7 months → every 40–50 days (about 60% steeper)
Advertisement

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement