Mask-Proof: LLM pipeline and Mask-ProofBench with 292 tasks
Mask-Proof turns real proofs into masked-step tasks and ships Mask-ProofBench of 292 problems with an LLM equivalence judge.
TL;DR
- 01Mask-Proof turns real proofs into masked-step tasks and ships Mask-ProofBench of 292 problems with an LLM equivalence judge.
- 02The project publishes Mask-ProofBench with 292 curated problems, evaluates 17 models, and uses an LLM-based equivalence judge that the authors report agrees 96.8% with expert annotators.
- 03The pipeline produces masked-step tasks that are automatically checkable rather than relying solely on final answers or costly expert grading.
Mask-Proof, presented in an arXiv paper submitted 13 Jun 2026 by Jierui Zhang and coauthors, is an LLM-based pipeline that converts real mathematical proofs into automatically checkable masked-step tasks. The project publishes Mask-ProofBench with 292 curated problems, evaluates 17 models, and uses an LLM-based equivalence judge that the authors report agrees 96.8% with expert annotators.
How does Mask-Proof work?
Mask-Proof masks key formula steps inside real proofs, supplies the surrounding context for each mask, and asks models to reconstruct the missing steps; reconstructions are judged by an LLM-based equivalence evaluator that uses repeated votes for stability. The pipeline produces masked-step tasks that are automatically checkable rather than relying solely on final answers or costly expert grading.
The paper describes three linked pieces: a masking procedure that selects important formula steps, a context extraction that frames each masked step within the surrounding proof, and an LLM equivalence judge that aggregates repeated votes to decide whether a reconstruction matches the masked target. The authors position this flow as a way to measure step-level reasoning in long proofs across diverse sources.
How well does Mask-Proof perform in evaluations?
Mask-Proof yields a benchmark called Mask-ProofBench with 292 curated problems across diverse research areas; the authors evaluated 17 models and report that reasoning-enhanced models beat standard models by 12% to 27%. The equivalence evaluator attained 96.8% agreement with expert annotators, which the paper highlights as enabling reproducible, comparable measurement of step-level mathematical reasoning.
The experiments cover a set of models (17 in total) and compare model classes, finding a 12% to 27% advantage for reasoning-enhanced variants over standard models. The authors say the evaluator's near-97% agreement supports automatic checking of masked-step reconstructions and reduces dependence on manual expert grading. Benchmark data, annotations, and code are made available at a provided URL in the paper.
Why it matters
Current evaluations of mathematical reasoning often emphasize final answers or require expensive expert grading, which leaves step-level capability under-measured. Mask-Proof addresses that gap by turning proof verification into masked-step reconstruction tasks that can be judged automatically. If the pipeline and its LLM-based judge are adopted, researchers gain a scalable, reproducible tool to compare step-level reasoning across models and proof sources.
That shift matters for any effort that relies on trustworthy AI assistance in research-level mathematics: automated, high-agreement evaluation of intermediate steps makes it easier to validate where model help is reliable and where human oversight remains necessary.
What to watch
Adoption of Mask-ProofBench and external replication of the reported 96.8% evaluator agreement by independent expert studies will be the clearest confirmation that the pipeline generalizes beyond the authors' experiments. Also watch for community releases or forks of the benchmark and code linked in the paper, and for follow-up papers that apply Mask-Proof to new mathematical subfields.
Paper and metadata: arXiv:2606.15258 (submitted 13 Jun 2026) by Jierui Zhang et al.; DOI https://doi.org/10.48550/arXiv.2606.15258.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AILLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.
ThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Visual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.