Benchmarks & Evals4 min read

TxBench-PP: 100 preclinical pharmacology tasks, top score 59.3%

TxBench-PP is a verifiable benchmark of 100 small-molecule preclinical decisions across 11 models and 4.

The Brieftide

TL;DR

  • 01TxBench-PP is a verifiable benchmark of 100 small-molecule preclinical decisions across 11 models and 4.
  • 02TxBench-PP, released in a paper submitted 17 Jun 2026, is a verifiable benchmark that evaluates AI agents on 100 realistic small-molecule preclinical pharmacology decisions.
  • 03TxBench-PP is a focused slice of a larger TherapeuticsBench effort and contains 100 evaluations indexed by program stage, assay type, and task structure.

TxBench-PP, released in a paper submitted 17 Jun 2026, is a verifiable benchmark that evaluates AI agents on 100 realistic small-molecule preclinical pharmacology decisions. The benchmark runs across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, and tests whether agents can draw correct conclusions from assay data rather than recall literature.

What is TxBench-PP and what does it test?

TxBench-PP is a focused slice of a larger TherapeuticsBench effort and contains 100 evaluations indexed by program stage, assay type, and task structure. The benchmark covers mechanism-of-action and pharmacodynamic reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy, and it requires agents to inspect files in a coding environment and return structured answers graded deterministically.

The paper describes TxBench-PP as a "verifiable benchmark for small-molecule preclinical pharmacology" and positions it to test workflow decisions made during preclinical programs rather than memorized facts.

How were agents evaluated and which models were tested?

Agents received realistic workflow snapshots, could inspect files programmatically, and returned structured outputs that were graded by deterministic rules. The authors ran 16 model-harness configurations, totaling 4,800 trajectories, to measure endpoint performance across the 100 benchmark evaluations.

The study reports results across 11 underlying models within those configurations. Each attempt produced an endpoint pass or fail under the benchmark's deterministic grading. That setup produced per-configuration aggregates such as the top system passing 178 of 300 endpoint attempts.

How well did current AI agents perform?

No system reliably recovered preclinical pharmacology decisions across the benchmark. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3% of endpoint attempts, 178 out of 300 (95% confidence interval, 51.1 to 67.6). The next-best configuration, GPT-5.5 / Pi, passed 55.3% of endpoint attempts, 166 out of 300 (95% confidence interval, 47.0 to 63.6).

Those two data points are the clearest performance markers in the paper: Claude Opus 4.8 / Pi at 59.3% (178/300) and GPT-5.5 / Pi at 55.3% (166/300).

Why it matters

TxBench-PP shifts evaluation from text recall to decision-making on raw assay outputs, exposing gaps between language-model proficiency and program-level scientific reasoning. The benchmark’s scale and determinism — 100 evaluations, 16 configurations, 4,800 trajectories — make it a reproducible probe of where agents fail in tasks such as causal target validation and translational efficacy. Those failures matter to teams trying to use AI to shorten interpretation and decision loops in drug discovery.

What to watch

Researchers and vendors will likely publish follow-up runs against TxBench-PP; improvements in pass rates on the same 100 evaluations will be the clearest signal of progress. The next milestones to watch are any configurations that exceed the Claude Opus 4.8 / Pi baseline of 59.3% and whether additional models reduce variability across the 16 harnesses.

Authors and provenance: the paper, titled "TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology," lists Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, and Kenny Workman as authors and was submitted to arXiv on 17 Jun 2026.

TxBench-PP key metrics and top configuration results
Item
Benchmark scope100 evaluationsIndexed by program stage, assay type, task structure
Test scale16 model-harness configurations4,800 trajectories total
Models covered11 models
Claude Opus 4.8 / Pi178178/300 endpoint attempts, 59.3% (95% CI 51.1-67.6)
GPT-5.5 / Pi166166/300 endpoint attempts, 55.3% (95% CI 47.0-63.6)
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement