PACE: Proxy predicts agentic benchmark scores (under 4% MAE)
PACE-Bench uses a small subset of non-agentic tests to predict agentic benchmark scores.
TL;DR
- 01PACE-Bench uses a small subset of non-agentic tests to predict agentic benchmark scores.
- 02The framework combines target-relevance local selection with globally informative global selection to curate the subset and produce PACE-Bench.
- 03The paper frames the problem as replacing full agentic runs, which can be expensive and require complex infrastructure, with a compact, fast-to-run proxy.
PACE constructs a lightweight proxy for expensive agentic evaluations by selecting a compact subset of non-agentic test instances and fitting a regression that maps a model's scores on that subset to its score on the target agentic benchmark. The framework combines target-relevance local selection with globally informative global selection to curate the subset and produce PACE-Bench.
How does PACE build a proxy benchmark?
PACE builds a proxy by drawing candidate instances from existing non-agentic benchmarks that span atomic capabilities, then selecting a small set of those instances and fitting a regression from the subset scores to the target agentic score. The subset selection uses two complementary strategies: target-relevance local selection and globally informative global selection, and the regression maps a model's scores on that compact subset to its agentic benchmark score.
The paper frames the problem as replacing full agentic runs, which can be expensive and require complex infrastructure, with a compact, fast-to-run proxy. The candidate pool intentionally spans atomic capabilities such as reasoning and code generation, so the selected instances together act as a compact signature of a model's skills relevant to agentic behavior.
How well does PACE predict agentic performance?
PACE-Bench predicts agentic scores with leave-one-out cross-validation mean absolute error under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85% across experiments that covered 14 models, 4 agentic benchmarks, and 19 non-agentic benchmarks, all at much less than 1% of the full agentic evaluation cost. Those results come from experiments reported in the paper submitted on 2 Jul 2026.
The paper contrasts this approach with running full agentic benchmarks such as SWE-Bench and GAIA, which it describes as expensive, time-consuming, and infrastructure-heavy; a single full agentic evaluation can cost thousands of dollars and take days. By contrast, non-agentic tests that exercise atomic capabilities are fast and cheap to run, and a carefully chosen subset of them can predict agentic outcomes reliably according to the reported metrics.
The authors also analyze which proxy instances are selected for each target, showing that the chosen instances reveal which skills individual agentic benchmarks uniquely demand. The concrete proxy benchmark derived in the study is called PACE-Bench.
Why it matters
PACE offers a practical alternative to repeatedly running full agentic evaluations during model development, selection, and routing. If a small, cheap set of non-agentic tests can predict agentic scores with MAE under 4% and maintain Spearman correlation above 0.80, teams can iterate and compare models far more cheaply and quickly than with full agentic runs. That changes where development effort goes: less on orchestration and cost management for agentic runs, and more on constructing and validating compact proxies.
The claim that PACE-Bench achieves those accuracy and ranking metrics for 14 models and 4 target agentic benchmarks suggests the method could be useful both for internal development cycles and for lightweight monitoring or model routing in production, where frequent full agent runs are impractical.
What to watch
Watch whether PACE-Bench and its selected-instance analyses are reproduced beyond the four agentic targets and 14 models examined in the paper, and whether practitioners adopt the framework for model selection, development, and routing. Also watch for public release of the proxy instance sets and regression code, since the paper shows the approach depends on curated subsets and a fitted mapping from subset scores to agentic outcomes.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsMeta-Benchmarks: Financial-Services LLM evaluation framework
A framework maps 452 publicly reported benchmarks into 41 O*NET activities and 38 BIAN domains.
CORE-Bench: Life After Benchmark Saturation, v1.1 Findings
arXiv paper shows CORE-Bench v1.1 and CORE-Bench OOD expose construct validity, efficiency, reliability and a twofold human-agent speedup.
T2D-Bench: Benchmarking LLMs for Type 2 Diabetes Evidence
A multi-layer clinical-lifestyle knowledge graph flags unsupported LLM diabetes recommendations and corrects them across 100 vignettes.
InvestPhilBench v0.6: Benchmark for LLM Investment Procedure
v0.6 supplies 118 verified investment principle cards, 25 framework cards and 243 QA items plus an automated scoring suite called BASP.