QMFOL benchmark: QMFOLBench with 2880 logic instances
QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.
TL;DR
- 01QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.
- 02QMFOL, an automated framework by Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette and Kailong Wang, was submitted to arXiv on 18 Jun 2026 (arXiv:2606.20227).
- 03It generates monadic first-order logic reasoning tasks and assembles QMFOLBench, a benchmark of 2880 instances with 960 configurations, to measure deductive reasoning in modern language models.
QMFOL, an automated framework by Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette and Kailong Wang, was submitted to arXiv on 18 Jun 2026 (arXiv:2606.20227). It generates monadic first-order logic reasoning tasks and assembles QMFOLBench, a benchmark of 2880 instances with 960 configurations, to measure deductive reasoning in modern language models.
What is QMFOL and how does it generate tasks?
QMFOL is an automated generator for monadic first-order logic tasks that offers quantifiable, controllable complexity, constructing formal logical structures using conjunction and disjunction patterns and explicit controls for reasoning depth, width, label types and distractors. The framework translates those formal structures into natural language via large language models, then enforces logical consistency through round-trip verification with an external prover.
The paper describes how QMFOL builds logical structures with parametrized patterns so researchers can vary the exact shape of problems. That pipeline — formal structure, natural-language rendering, and external-prover verification — is intended to balance semantic diversity against logical consistency, a shortcoming the authors identify in existing benchmarks.
How did models perform on QMFOLBench?
QMFOLBench comprises 2880 instances covering 960 configurations; the authors evaluated six large reasoning models (LRMs) and two LLMs and found that performance declines and computational overhead rises as logical complexity increases. Models scored higher on True-labeled tasks than on False or Unknown labels, and their accuracy was sensitive to semantic variation introduced in the natural-language rendering.
The paper reports two consistent patterns: increased logical complexity reduced model accuracy and raised computational cost, and different label types produced measurable performance gaps, with True labels easier for the evaluated models than False or Unknown. The benchmark is presented as scalable and reliable for constructing deductive reasoning testbeds with these controllable axes.
Why it matters
Benchmarks that expose specific failure modes matter because they let researchers and engineers target exact weaknesses. QMFOL's explicit controls for depth, width, distractors and label types make it possible to isolate whether a model fails because of combinational complexity, semantic paraphrase, or label ambiguity. That specificity should help developers set precise training or prompting interventions and let evaluators compare models on identical, parameterized reasoning challenges.
What to watch
Look for broader adoption of QMFOLBench configurations in future papers and model leaderboards, and for external groups to reuse the round-trip verification approach to ensure logical consistency in language-rendered test cases. The next concrete milestone will be independent replication of the reported performance trends across a wider set of models and decoding settings.
QMFOL and QMFOLBench aim to give the community a repeatable, tunable way to stress deductive reasoning in language models, with clear numeric controls and an external prover in the loop, as laid out in the arXiv submission dated 18 Jun 2026 (arXiv:2606.20227).
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsDiffusion Language Models: Eight DLMs evaluated across tasks
Authors evaluate eight state-of-the-art DLMs on eight benchmarks, measuring generation quality and computational efficiency while varying.
DeFAb: Defeasible Abduction Benchmark, 372,648+ instances
DeFAb converts four decades of publicly funded knowledge bases into 372.
LLMs vs Bloom's Taxonomy: 20,700 generated educational questions
A paper by Xiaolong Wang et al. evaluates six LLMs with 20,700 questions.
ProfiLLM: DiDi's LLM pipeline boosts dispatch AUC and GMV
Agentic LLM pipeline extracts reusable profiles with 27 analytical tools and yields up to +6.14% AUC and +4.35% GMV in DiDi tests.