Mistral Leanstral 1.5 aces formal math benchmarks, finds code bugs
Open-source Leanstral 1.5 (Apache 2.0) scores 100% on miniF2F, solves 587 of 672 Putnam problems and found five bugs in 57 repositories.
TL;DR
- 01Open-source Leanstral 1.5 (Apache 2.0) scores 100% on miniF2F, solves 587 of 672 Putnam problems and found five bugs in 57 repositories.
- 02Mistral AI released Leanstral 1.5 on Jul 4, 2026, a free open-source model (Apache 2.0) built for formal verification in the Lean 4 programming language.
- 03The model hits 100 percent on miniF2F, solves 587 of 672 problems on PutnamBench, and posts top open-source results on FATE-H and FATE-X with scores of 87 and 34 percent respectively.
Mistral AI released Leanstral 1.5 on Jul 4, 2026, a free open-source model (Apache 2.0) built for formal verification in the Lean 4 programming language. The model hits 100 percent on miniF2F, solves 587 of 672 problems on PutnamBench, and posts top open-source results on FATE-H and FATE-X with scores of 87 and 34 percent respectively.
What results did Leanstral 1.5 produce?
Leanstral 1.5 achieved perfect performance on miniF2F and solved 587 out of 672 PutnamBench problems, while scoring 87 percent on FATE-H and 34 percent on FATE-X. MiniF2F covers problems from high school level up to math olympiad difficulty; PutnamBench includes 672 problems from the Putnam math competition; FATE-H and FATE-X test master's and doctoral-level algebra tasks in areas such as group theory and ring theory.
Mistral positions Leanstral 1.5 as the top open-source model on PutnamBench, FATE-H and FATE-X. The only system noted as beating it on PutnamBench is the closed-source Aleph Prover.
How was Leanstral 1.5 trained and tested?
Mistral trained Leanstral 1.5 with a pipeline that included mid-training, supervised fine-tuning, and reinforcement learning, and targeted the model mainly for mathematical formal verification in Lean 4. The model is available through Hugging Face and via a free API.
Beyond benchmark scores, the team ran a hands-on code-verification test where Leanstral 1.5 scanned 57 open-source repositories and identified five previously unknown bugs. One of the flagged issues was an overflow bug in the Rust library varinteger. Mistral reports these findings alongside the formal-math benchmark results.
Why it matters
An open-source model matching or surpassing other open implementations on formal-math benchmarks lowers the barrier for researchers and developers working in theorem proving and software verification. Lean 4 is designed for formally verifying mathematical proofs and software correctness; a model trained for that task with reported wins on miniF2F, PutnamBench, FATE-H and FATE-X becomes a practical tool for both automated proof search and code auditing.
The report that Leanstral 1.5 found five previously unknown bugs in a 57-repository sweep signals that formal-verification models can surface real defects in production code, not just toy proofs. The presence of a closed-source system, Aleph Prover, still outperforming on PutnamBench also frames this as a competitive space where open and closed offerings vie on high-stakes benchmarks.
What to watch
Watch for independent replication of the 57-repo bug sweep and for how the Lean 4 community adopts Leanstral 1.5 in tooling and CI pipelines. The next clear milestone will be whether an open-source model matches or exceeds Aleph Prover on PutnamBench, and whether follow-up evaluations extend beyond algebraic benchmarks into broader formal-verification workloads.
Mistral's primary data points: 100 percent on miniF2F, 587 solved on PutnamBench's 672 problems, FATE-H at 87 percent, FATE-X at 34 percent, and five previously unknown bugs found across 57 scanned repositories. The model is released under Apache 2.0 and is available on Hugging Face and via a free API.
Written by The Brieftide · Source: The Decoder
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Open Source AIOpenAI joins Appia Foundation to build shared AI standards
OpenAI supports evaluation frameworks, safety practices and global cooperation through the Appia Foundation.
Zhipu AI GLM-5.2: 1M-token context, closes gap with Opus 4.8
GLM-5.2 ships under the MIT license with a stable one-million-token context and scores 74.4% on FrontierSWE, one point behind Opus 4.8.
OpenAI: PRC-linked influence operations target US AI debates
OpenAI says PRC-linked campaigns are using AI to push narratives on U.S. tech debates, data centers, tariffs and false ChatGPT claims.
OpenAI: LSEG scales trusted AI, empowers 4,000 staff
LSEG uses OpenAI to scale trusted AI across its global business, accelerating insights, shrinking release cycles and empowering 4.