Coding AgentsJuly 2, 20265 min read

PHREEQC-MCQ-200 benchmark: Tool-augmented scientific agents

A 200-question benchmark derived from 21 validated PHREEQC scenarios tests agents' ability to construct inputs.

The BrieftideJuly 2, 2026

TL;DR

01A 200-question benchmark derived from 21 validated PHREEQC scenarios tests agents' ability to construct inputs.
02The benchmark requires agents to construct simulator inputs, execute PHREEQC, inspect structured outputs and commit to final answers.
03The authors frame the task so agents must author PHREEQC input files, run the simulator, read structured outputs and then answer questions that depend on those outputs.

Ke Zhang, Sahchit Chundur, Mohammad Javad Qomi and Maziar Raissi published PHREEQC-MCQ-200 on arXiv on 1 July 2026, presenting a diagnostic benchmark of 200 multiple-choice questions derived from 21 validated PHREEQC scenarios. The benchmark requires agents to construct simulator inputs, execute PHREEQC, inspect structured outputs and commit to final answers.

What is PHREEQC-MCQ-200?

PHREEQC-MCQ-200 is a 200-item multiple-choice benchmark for tool-augmented agents that exercises end-to-end deterministic aqueous-geochemistry simulation workflows; it is drawn from 21 validated PHREEQC scenarios and was submitted to arXiv on 1 Jul 2026. The authors frame the task so agents must author PHREEQC input files, run the simulator, read structured outputs and then answer questions that depend on those outputs.

The paper appears as arXiv:2607.00436 with a listed DOI link https://doi.org/10.48550/arXiv.2607.00436. The submission metadata notes the upload on Wed, 1 Jul 2026 and the paper itself spans 30 pages with 2 figures, according to the arXiv record.

How do tools affect agent performance on the benchmark?

Simulator access improves aggregate accuracy across multiple frontier and mid-tier model families, but the improvement is not monotonic and introduces regressions on items agents previously solved without tools. In experiments reported in the abstract, granting agents access to PHREEQC substantially raised overall accuracy, yet tool-augmented agents also lost some items they had answered correctly in the non-tool setting, so average accuracy alone hides specific failures.

The paper further shows that output-access protocol matters: a table-of-contents style interface can reduce token cost while preserving or improving accuracy for stronger models, but that same interface degrades performance for mid-tier models that cannot reliably navigate structured simulator outputs. The authors therefore treat scientific tool use as an end-to-end diagnostic challenge rather than a simple capability to call an external program.

Why it matters

PHREEQC-MCQ-200 forces agent evaluations to expose where the computation chain breaks, not just whether a final answer is correct. By requiring agents to build inputs, execute PHREEQC and parse structured results, the benchmark surfaces trajectory failures, output-access sensitivity and item-level retention losses that average accuracy does not reveal. For teams building scientific agents, those failure modes change how one measures progress and where engineering effort should focus: parsing, output presentation and stepwise verification become testable targets.

The benchmark therefore reframes scientific tool use as a chain-of-trust problem: more grounded execution can improve results for many items, yet introduce new errors that only an item-level diagnostic benchmark will reveal.

What to watch

Look for community adoption of the 200-question PHREEQC-MCQ-200 suite and for follow-up studies that report item-level retention and output-access sensitivity alongside aggregate accuracy. Also watch for work that quantifies the tradeoffs the authors identify between token-cost saving interfaces, like a table-of-contents view, and the navigational abilities of mid-tier models.

Paper and provenance

PHREEQC-MCQ-200 is available on arXiv as arXiv:2607.00436 (submitted 1 Jul 2026). The authors are Ke Zhang, Sahchit Chundur, Mohammad Javad Qomi and Maziar Raissi. The arXiv entry notes the PDF and TeX sources and lists the paper length as 30 pages with 2 figures.

References to the paper should cite the arXiv record and DOI provided in the submission metadata for reproduction and further reading.

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Agent4cs: Multi-agent code summarization, up to 38% gains

Agent4cs uses three cooperating agents to summarize large hierarchical codebases.

The BrieftideDAILY BRIEF

llm-coding-agent 0.1a0: GPT-5.5 coding agent and tools

Simon Willison published llm-coding-agent 0.1a0 on 2nd July 2026, a PyPI slop-alpha that exposes file.

The BrieftideDAILY BRIEF

Autoformalization: Agent Instructions to Policy-as-Code

A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.

The BrieftideDAILY BRIEF

Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A

An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.