First Proof Second Batch: arXiv tests AI on 10 math problems
An arXiv paper submitted 16 Jun 2026 evaluated several AI systems on ten research-level math problems and published solutions and referee.
TL;DR
- 01An arXiv paper submitted 16 Jun 2026 evaluated several AI systems on ten research-level math problems and published solutions and referee.
- 02First Proof Second Batch, submitted to arXiv on 16 Jun 2026 as arXiv:2606.18119, tested several AI systems on a set of ten research-level mathematics problems.
- 03The paper evaluated several AI systems on ten problems drawn from a broad range of mathematical fields; the problems arose naturally in the research process of the contributors.
First Proof Second Batch, submitted to arXiv on 16 Jun 2026 as arXiv:2606.18119, tested several AI systems on a set of ten research-level mathematics problems. The paper, authored by Mohammed Abouzaid, Nikhil Srivastava, Rachel Ward, and Lauren Williams, presents the problems, the authors' methodology, and the results, and links to supplementary materials including human solutions, AI-generated solutions, and referee reports and logs.
What did the paper test and who contributed the problems?
The paper evaluated several AI systems on ten problems drawn from a broad range of mathematical fields; the problems arose naturally in the research process of the contributors. The ten problem contributors named in the arXiv entry are: Dariusz Kalociński and Theodore A. Slaman; Richard Schwartz; Aleksa Milojevic and Benny Sudakov; Larry Guth; Oleg Butkovsky, Jonathan Mattingly, and Lorenzo Zambotti; Joshua Evan Greene and Duncan McCoy; Sucharit Sarkar; Sam Payne and Jidong (Jayden) Wang; Sylvie Corteel and John Lentfer; and Srivatsav Kunnawalkam Elayavalli.
The paper frames its scope as an assessment of "the ability of current AI systems to correctly solve research-level mathematics problems" and covers problems across diverse subfields, rather than focusing on a single domain.
How did the authors document methodology and results?
The paper provides the test problems, the methodology used for evaluation, and the results, and it includes links to supplementary documents with human solutions, AI-generated solutions, and referee reports and logs for the AI-generated solutions. The arXiv entry lists available formats for the submission: PDF, an experimental HTML view, and TeX source, and it assigns the manuscript MSC class 68T01.
The submission record identifies the four authors—Mohammed Abouzaid, Nikhil Srivastava, Rachel Ward, and Lauren Williams—and gives the arXiv identifier arXiv:2606.18119 (cs.AI). The entry also supplies an arXiv-issued DOI via DataCite (pending registration) and a submission timestamp in the record: [v1] Tue, 16 Jun 2026 16:21:33 UTC.
Why it matters
This work places concrete research problems, contributed by active researchers, at the center of an AI evaluation. That choice forces systems to confront the kind of open-ended reasoning and domain expertise that define mathematical research, rather than toy or textbook exercises. Publishing AI-generated solutions alongside human solutions and referee logs creates a traceable record that other researchers can inspect, reproduce, and critique, which matters for judging whether AI output actually meets the standards of mathematical proof.
Testing on ten problems gives the study focused breadth: enough variety to reveal cross-field weaknesses, while keeping each case examinable by human experts. The presence of referee reports and logs signals an attempt to go beyond claim to documented scrutiny, which is a higher bar than many prior evaluations.
What to watch
Watch the supplementary referee reports and logs for whether independent experts endorse the AI-generated proofs as correct and complete. The concrete signals that will move this conversation are reproducible referee findings in the linked materials and follow-up work that attempts to replicate the paper's methodology on additional problems or with named AI systems.
The arXiv entry provides the materials needed for that scrutiny: problem statements, the authors' methodology, AI outputs, human solutions, and referee documentation, all tied to arXiv:2606.18119 and the listed author team. Researchers tracking AI progress in formal mathematics should inspect those supplementary files and the referee logs first.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Open Source AIZhipu AI GLM-5.2: 1M-token context, closes gap with Opus 4.8
GLM-5.2 ships under the MIT license with a stable one-million-token context and scores 74.4% on FrontierSWE, one point behind Opus 4.8.
OpenAI: PRC-linked influence operations target US AI debates
OpenAI says PRC-linked campaigns are using AI to push narratives on U.S. tech debates, data centers, tariffs and false ChatGPT claims.
OpenAI: LSEG scales trusted AI, empowers 4,000 staff
LSEG uses OpenAI to scale trusted AI across its global business, accelerating insights, shrinking release cycles and empowering 4.
Industrial policy OpenAI proposes for the Intelligence Age
OpenAI published a people-first industrial policy on June 9, 2026, and opened a pilot grants program with fellowships.