Foundation Models6 min read

QMFOL benchmark: QMFOLBench with 2880 logic instances

QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.

The Brieftide

TL;DR

  • 01QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.
  • 02QMFOL, an automated framework by Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette and Kailong Wang, was submitted to arXiv on 18 Jun 2026 (arXiv:2606.20227).
  • 03It generates monadic first-order logic reasoning tasks and assembles QMFOLBench, a benchmark of 2880 instances with 960 configurations, to measure deductive reasoning in modern language models.

QMFOL, an automated framework by Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette and Kailong Wang, was submitted to arXiv on 18 Jun 2026 (arXiv:2606.20227). It generates monadic first-order logic reasoning tasks and assembles QMFOLBench, a benchmark of 2880 instances with 960 configurations, to measure deductive reasoning in modern language models.

What is QMFOL and how does it generate tasks?

QMFOL is an automated generator for monadic first-order logic tasks that offers quantifiable, controllable complexity, constructing formal logical structures using conjunction and disjunction patterns and explicit controls for reasoning depth, width, label types and distractors. The framework translates those formal structures into natural language via large language models, then enforces logical consistency through round-trip verification with an external prover.

The paper describes how QMFOL builds logical structures with parametrized patterns so researchers can vary the exact shape of problems. That pipeline — formal structure, natural-language rendering, and external-prover verification — is intended to balance semantic diversity against logical consistency, a shortcoming the authors identify in existing benchmarks.

How did models perform on QMFOLBench?

QMFOLBench comprises 2880 instances covering 960 configurations; the authors evaluated six large reasoning models (LRMs) and two LLMs and found that performance declines and computational overhead rises as logical complexity increases. Models scored higher on True-labeled tasks than on False or Unknown labels, and their accuracy was sensitive to semantic variation introduced in the natural-language rendering.

The paper reports two consistent patterns: increased logical complexity reduced model accuracy and raised computational cost, and different label types produced measurable performance gaps, with True labels easier for the evaluated models than False or Unknown. The benchmark is presented as scalable and reliable for constructing deductive reasoning testbeds with these controllable axes.

Why it matters

Benchmarks that expose specific failure modes matter because they let researchers and engineers target exact weaknesses. QMFOL's explicit controls for depth, width, distractors and label types make it possible to isolate whether a model fails because of combinational complexity, semantic paraphrase, or label ambiguity. That specificity should help developers set precise training or prompting interventions and let evaluators compare models on identical, parameterized reasoning challenges.

What to watch

Look for broader adoption of QMFOLBench configurations in future papers and model leaderboards, and for external groups to reuse the round-trip verification approach to ensure logical consistency in language-rendered test cases. The next concrete milestone will be independent replication of the reported performance trends across a wider set of models and decoding settings.

QMFOL and QMFOLBench aim to give the community a repeatable, tunable way to stress deductive reasoning in language models, with clear numeric controls and an external prover in the loop, as laid out in the arXiv submission dated 18 Jun 2026 (arXiv:2606.20227).

Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement