Benchmarks & EvalsJune 19, 20266 min read

Process-Verified RL for Theorem Proving via Lean, tactic rewards

Minsu Kim and Se-Young Yun show Lean can supply tactic-level verified rewards during reinforcement learning for theorem proving.

The BrieftideJune 19, 2026

TL;DR

01Minsu Kim and Se-Young Yun show Lean can supply tactic-level verified rewards during reinforcement learning for theorem proving.
02That architecture treats the prover not merely as an end-of-run verifier but as an oracle that can inspect and score intermediate tactic steps.
03Tactic-level supervision outperformed outcome-only baselines in most settings across the experiments reported, and produced improvements on benchmark suites including MiniF2F and ProofNet.

Process-Verified Reinforcement Learning for Theorem Proving via Lean, a paper by Minsu Kim and Se-Young Yun submitted to arXiv on 18 Jun 2026 (arXiv:2606.20068), demonstrates that the Lean proof assistant can act as a symbolic process oracle, providing both outcome-level and fine-grained tactic-level verified feedback to reinforcement learning agents.

How does Lean provide process-level rewards?

Lean supplies both outcome-level and tactic-level verified feedback during training by parsing proof attempts into tactic sequences and using its elaboration to mark locally sound steps and the earliest failing step, producing dense, verifier-grounded credit signals rooted in type theory. The authors describe parsing proof attempts into tactic sequences, then using Lean's elaboration to indicate which steps are locally sound and which step fails first, enabling fine-grained, verifier-backed supervision rather than a single binary reward.

The paper incorporates these structured rewards into a GRPO-style reinforcement learning objective, and introduces first-error propagation and first-token credit methods that balance outcome- and process-level advantages. That architecture treats the prover not merely as an end-of-run verifier but as an oracle that can inspect and score intermediate tactic steps.

What did the experiments show?

Tactic-level supervision outperformed outcome-only baselines in most settings across the experiments reported, and produced improvements on benchmark suites including MiniF2F and ProofNet. The authors evaluated their methods using STP-Lean and DeepSeek-Prover-V1.5 agents, and found that tactic-level signals yielded better learning than using only a single binary verification signal at the end of a proof attempt.

The paper highlights two concrete methods integrated into the learning objective: first-error propagation, which traces credit back to the earliest failing tactic, and first-token credit, which assigns reward based on the initial token in a tactic sequence. These methods are applied within a GRPO-style objective that the authors use to blend process-level and outcome-level advantages during policy updates.

Why it matters

Symbolic proof assistants are traditionally used only as verifiers at evaluation time. The paper argues that they can also act as "process-level reward oracles during training," and demonstrates that doing so produces measurable improvements on established formal-reasoning benchmarks. That combination offers a path toward reinforcement learning frameworks that pair the scalability of language-model-based provers with the reliability of symbolic verification rooted in type theory.

Using tactic-level, verifier-grounded feedback changes the credit-assignment problem for theorem-proving agents. Instead of a sparse binary signal indicating only success or failure, agents get dense, locally checked signals that can speed learning and reduce wasted exploration on provably invalid tactic sequences.

What to watch

Follow whether tactic-level supervision scales across more theorem-proving benchmarks beyond MiniF2F and ProofNet, and whether other proof assistants can be used as process oracles in the same way. Subsequent work that reports numerical comparisons on additional benchmarks or applies the first-error propagation and first-token credit techniques to other prover architectures will confirm how general the approach is.

Bibliographic note: the paper is available on arXiv as arXiv:2606.20068 and was submitted on 18 Jun 2026 by Minsu Kim and Se-Young Yun.

Tactic-level supervision versus outcome-only baselines

Item
Performance on MiniF2F	delivering improvements	baseline
Performance on ProofNet	delivering improvements	baseline
Across experimental settings	outperforms in most settings	outcome-only

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

LLM Agents: Predictive Validity vs Static Leaderboards

Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.

The BrieftideDAILY BRIEF

CombEval: Benchmarking combinatorial counting in 11 LLMs

CombEval is a dynamic, solver-verified benchmark for combinatorial counting that tests 11 LLMs across varied object types.

The BrieftideDAILY BRIEF

DeXposure-Claw: Agentic System for DeFi Risk Supervision

DeXposure-Claw routes LLM decisions through forecasts, deterministic monitors and confidence gates; DeXposure-Bench scores tickets with a.

The BrieftideDAILY BRIEF

ForecastBench-Sim: Simulated-World Forecasting Benchmark

A benchmark built on Freeciv game rollouts that generates solvable forecasting tasks with configurable horizons.