TxBench-PP: 100 preclinical pharmacology tasks, top score 59.3%
TxBench-PP is a verifiable benchmark of 100 small-molecule preclinical decisions across 11 models and 4.
TL;DR
- 01TxBench-PP is a verifiable benchmark of 100 small-molecule preclinical decisions across 11 models and 4.
- 02TxBench-PP, released in a paper submitted 17 Jun 2026, is a verifiable benchmark that evaluates AI agents on 100 realistic small-molecule preclinical pharmacology decisions.
- 03TxBench-PP is a focused slice of a larger TherapeuticsBench effort and contains 100 evaluations indexed by program stage, assay type, and task structure.
TxBench-PP, released in a paper submitted 17 Jun 2026, is a verifiable benchmark that evaluates AI agents on 100 realistic small-molecule preclinical pharmacology decisions. The benchmark runs across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, and tests whether agents can draw correct conclusions from assay data rather than recall literature.
What is TxBench-PP and what does it test?
TxBench-PP is a focused slice of a larger TherapeuticsBench effort and contains 100 evaluations indexed by program stage, assay type, and task structure. The benchmark covers mechanism-of-action and pharmacodynamic reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy, and it requires agents to inspect files in a coding environment and return structured answers graded deterministically.
The paper describes TxBench-PP as a "verifiable benchmark for small-molecule preclinical pharmacology" and positions it to test workflow decisions made during preclinical programs rather than memorized facts.
How were agents evaluated and which models were tested?
Agents received realistic workflow snapshots, could inspect files programmatically, and returned structured outputs that were graded by deterministic rules. The authors ran 16 model-harness configurations, totaling 4,800 trajectories, to measure endpoint performance across the 100 benchmark evaluations.
The study reports results across 11 underlying models within those configurations. Each attempt produced an endpoint pass or fail under the benchmark's deterministic grading. That setup produced per-configuration aggregates such as the top system passing 178 of 300 endpoint attempts.
How well did current AI agents perform?
No system reliably recovered preclinical pharmacology decisions across the benchmark. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3% of endpoint attempts, 178 out of 300 (95% confidence interval, 51.1 to 67.6). The next-best configuration, GPT-5.5 / Pi, passed 55.3% of endpoint attempts, 166 out of 300 (95% confidence interval, 47.0 to 63.6).
Those two data points are the clearest performance markers in the paper: Claude Opus 4.8 / Pi at 59.3% (178/300) and GPT-5.5 / Pi at 55.3% (166/300).
Why it matters
TxBench-PP shifts evaluation from text recall to decision-making on raw assay outputs, exposing gaps between language-model proficiency and program-level scientific reasoning. The benchmark’s scale and determinism — 100 evaluations, 16 configurations, 4,800 trajectories — make it a reproducible probe of where agents fail in tasks such as causal target validation and translational efficacy. Those failures matter to teams trying to use AI to shorten interpretation and decision loops in drug discovery.
What to watch
Researchers and vendors will likely publish follow-up runs against TxBench-PP; improvements in pass rates on the same 100 evaluations will be the clearest signal of progress. The next milestones to watch are any configurations that exceed the Claude Opus 4.8 / Pi baseline of 59.3% and whether additional models reduce variability across the 16 harnesses.
Authors and provenance: the paper, titled "TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology," lists Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, and Kenny Workman as authors and was submitted to arXiv on 17 Jun 2026.
| Item | |||
|---|---|---|---|
| Benchmark scope | 100 evaluations | Indexed by program stage, assay type, task structure | |
| Test scale | 16 model-harness configurations | 4,800 trajectories total | |
| Models covered | 11 models | ||
| Claude Opus 4.8 / Pi | 178 | 178/300 endpoint attempts, 59.3% (95% CI 51.1-67.6) | |
| GPT-5.5 / Pi | 166 | 166/300 endpoint attempts, 55.3% (95% CI 47.0-63.6) |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsRTSGameBench: RTS benchmark for strategic reasoning by VLMs
RTSGameBench evaluates vision-language models in Beyond All Reason using mini-games.
ForecastBench-Sim: Simulated-World Forecasting Benchmark
A benchmark built on Freeciv game rollouts that generates solvable forecasting tasks with configurable horizons.
LLM-as-Judge: Curriculum-Grounded Marking Pipeline for Exam Prep
A staged LLM workflow that grounds question marking in authorised syllabus artefacts.
MapSatisfyBench: Benchmarking satisfaction-aware map agents
MapSatisfyBench uses large-scale anonymized user data to test whether map agents recover implicit decision factors that shape user.