Process-Verified RL for Theorem Proving via Lean, tactic rewards
Minsu Kim and Se-Young Yun show Lean can supply tactic-level verified rewards during reinforcement learning for theorem proving.
TL;DR
- 01Minsu Kim and Se-Young Yun show Lean can supply tactic-level verified rewards during reinforcement learning for theorem proving.
- 02That architecture treats the prover not merely as an end-of-run verifier but as an oracle that can inspect and score intermediate tactic steps.
- 03Tactic-level supervision outperformed outcome-only baselines in most settings across the experiments reported, and produced improvements on benchmark suites including MiniF2F and ProofNet.
Process-Verified Reinforcement Learning for Theorem Proving via Lean, a paper by Minsu Kim and Se-Young Yun submitted to arXiv on 18 Jun 2026 (arXiv:2606.20068), demonstrates that the Lean proof assistant can act as a symbolic process oracle, providing both outcome-level and fine-grained tactic-level verified feedback to reinforcement learning agents.
How does Lean provide process-level rewards?
Lean supplies both outcome-level and tactic-level verified feedback during training by parsing proof attempts into tactic sequences and using its elaboration to mark locally sound steps and the earliest failing step, producing dense, verifier-grounded credit signals rooted in type theory. The authors describe parsing proof attempts into tactic sequences, then using Lean's elaboration to indicate which steps are locally sound and which step fails first, enabling fine-grained, verifier-backed supervision rather than a single binary reward.
The paper incorporates these structured rewards into a GRPO-style reinforcement learning objective, and introduces first-error propagation and first-token credit methods that balance outcome- and process-level advantages. That architecture treats the prover not merely as an end-of-run verifier but as an oracle that can inspect and score intermediate tactic steps.
What did the experiments show?
Tactic-level supervision outperformed outcome-only baselines in most settings across the experiments reported, and produced improvements on benchmark suites including MiniF2F and ProofNet. The authors evaluated their methods using STP-Lean and DeepSeek-Prover-V1.5 agents, and found that tactic-level signals yielded better learning than using only a single binary verification signal at the end of a proof attempt.
The paper highlights two concrete methods integrated into the learning objective: first-error propagation, which traces credit back to the earliest failing tactic, and first-token credit, which assigns reward based on the initial token in a tactic sequence. These methods are applied within a GRPO-style objective that the authors use to blend process-level and outcome-level advantages during policy updates.
Why it matters
Symbolic proof assistants are traditionally used only as verifiers at evaluation time. The paper argues that they can also act as "process-level reward oracles during training," and demonstrates that doing so produces measurable improvements on established formal-reasoning benchmarks. That combination offers a path toward reinforcement learning frameworks that pair the scalability of language-model-based provers with the reliability of symbolic verification rooted in type theory.
Using tactic-level, verifier-grounded feedback changes the credit-assignment problem for theorem-proving agents. Instead of a sparse binary signal indicating only success or failure, agents get dense, locally checked signals that can speed learning and reduce wasted exploration on provably invalid tactic sequences.
What to watch
Follow whether tactic-level supervision scales across more theorem-proving benchmarks beyond MiniF2F and ProofNet, and whether other proof assistants can be used as process oracles in the same way. Subsequent work that reports numerical comparisons on additional benchmarks or applies the first-error propagation and first-token credit techniques to other prover architectures will confirm how general the approach is.
Bibliographic note: the paper is available on arXiv as arXiv:2606.20068 and was submitted on 18 Jun 2026 by Minsu Kim and Se-Young Yun.
| Item | ||
|---|---|---|
| Performance on MiniF2F | delivering improvements | baseline |
| Performance on ProofNet | delivering improvements | baseline |
| Across experimental settings | outperforms in most settings | outcome-only |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsLLM Agents: Predictive Validity vs Static Leaderboards
Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.
CombEval: Benchmarking combinatorial counting in 11 LLMs
CombEval is a dynamic, solver-verified benchmark for combinatorial counting that tests 11 LLMs across varied object types.
DeXposure-Claw: Agentic System for DeFi Risk Supervision
DeXposure-Claw routes LLM decisions through forecasts, deterministic monitors and confidence gates; DeXposure-Bench scores tickets with a.
ForecastBench-Sim: Simulated-World Forecasting Benchmark
A benchmark built on Freeciv game rollouts that generates solvable forecasting tasks with configurable horizons.