Benchmarks & EvalsJune 18, 20264 min read

TxBench-PP: 100 preclinical pharmacology tasks, top score 59.3%

TxBench-PP is a verifiable benchmark of 100 small-molecule preclinical decisions across 11 models and 4.

The BrieftideJune 18, 2026

TL;DR

01TxBench-PP is a verifiable benchmark of 100 small-molecule preclinical decisions across 11 models and 4.
02TxBench-PP, released in a paper submitted 17 Jun 2026, is a verifiable benchmark that evaluates AI agents on 100 realistic small-molecule preclinical pharmacology decisions.
03TxBench-PP is a focused slice of a larger TherapeuticsBench effort and contains 100 evaluations indexed by program stage, assay type, and task structure.

TxBench-PP, released in a paper submitted 17 Jun 2026, is a verifiable benchmark that evaluates AI agents on 100 realistic small-molecule preclinical pharmacology decisions. The benchmark runs across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, and tests whether agents can draw correct conclusions from assay data rather than recall literature.

What is TxBench-PP and what does it test?

TxBench-PP is a focused slice of a larger TherapeuticsBench effort and contains 100 evaluations indexed by program stage, assay type, and task structure. The benchmark covers mechanism-of-action and pharmacodynamic reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy, and it requires agents to inspect files in a coding environment and return structured answers graded deterministically.

The paper describes TxBench-PP as a "verifiable benchmark for small-molecule preclinical pharmacology" and positions it to test workflow decisions made during preclinical programs rather than memorized facts.

How were agents evaluated and which models were tested?

Agents received realistic workflow snapshots, could inspect files programmatically, and returned structured outputs that were graded by deterministic rules. The authors ran 16 model-harness configurations, totaling 4,800 trajectories, to measure endpoint performance across the 100 benchmark evaluations.

The study reports results across 11 underlying models within those configurations. Each attempt produced an endpoint pass or fail under the benchmark's deterministic grading. That setup produced per-configuration aggregates such as the top system passing 178 of 300 endpoint attempts.

How well did current AI agents perform?

No system reliably recovered preclinical pharmacology decisions across the benchmark. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3% of endpoint attempts, 178 out of 300 (95% confidence interval, 51.1 to 67.6). The next-best configuration, GPT-5.5 / Pi, passed 55.3% of endpoint attempts, 166 out of 300 (95% confidence interval, 47.0 to 63.6).

Those two data points are the clearest performance markers in the paper: Claude Opus 4.8 / Pi at 59.3% (178/300) and GPT-5.5 / Pi at 55.3% (166/300).

Why it matters

TxBench-PP shifts evaluation from text recall to decision-making on raw assay outputs, exposing gaps between language-model proficiency and program-level scientific reasoning. The benchmark’s scale and determinism — 100 evaluations, 16 configurations, 4,800 trajectories — make it a reproducible probe of where agents fail in tasks such as causal target validation and translational efficacy. Those failures matter to teams trying to use AI to shorten interpretation and decision loops in drug discovery.

What to watch

Researchers and vendors will likely publish follow-up runs against TxBench-PP; improvements in pass rates on the same 100 evaluations will be the clearest signal of progress. The next milestones to watch are any configurations that exceed the Claude Opus 4.8 / Pi baseline of 59.3% and whether additional models reduce variability across the 16 harnesses.

Authors and provenance: the paper, titled "TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology," lists Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, and Kenny Workman as authors and was submitted to arXiv on 17 Jun 2026.

TxBench-PP key metrics and top configuration results

Item
Benchmark scope	100 evaluations	Indexed by program stage, assay type, task structure
Test scale	16 model-harness configurations	4,800 trajectories total
Models covered	11 models
Claude Opus 4.8 / Pi	178	178/300 endpoint attempts, 59.3% (95% CI 51.1-67.6)
GPT-5.5 / Pi	166	166/300 endpoint attempts, 55.3% (95% CI 47.0-63.6)

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

RTSGameBench: RTS benchmark for strategic reasoning by VLMs

RTSGameBench evaluates vision-language models in Beyond All Reason using mini-games.

The BrieftideDAILY BRIEF

ForecastBench-Sim: Simulated-World Forecasting Benchmark

A benchmark built on Freeciv game rollouts that generates solvable forecasting tasks with configurable horizons.

The BrieftideDAILY BRIEF

LLM-as-Judge: Curriculum-Grounded Marking Pipeline for Exam Prep

A staged LLM workflow that grounds question marking in authorised syllabus artefacts.

The BrieftideDAILY BRIEF

MapSatisfyBench: Benchmarking satisfaction-aware map agents

MapSatisfyBench uses large-scale anonymized user data to test whether map agents recover implicit decision factors that shape user.