Foundation ModelsJune 20, 20266 min read

QMFOL benchmark: QMFOLBench with 2880 logic instances

QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.

The BrieftideJune 20, 2026

TL;DR

01QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.
02QMFOL, an automated framework by Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette and Kailong Wang, was submitted to arXiv on 18 Jun 2026 (arXiv:2606.20227).
03It generates monadic first-order logic reasoning tasks and assembles QMFOLBench, a benchmark of 2880 instances with 960 configurations, to measure deductive reasoning in modern language models.

QMFOL, an automated framework by Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette and Kailong Wang, was submitted to arXiv on 18 Jun 2026 (arXiv:2606.20227). It generates monadic first-order logic reasoning tasks and assembles QMFOLBench, a benchmark of 2880 instances with 960 configurations, to measure deductive reasoning in modern language models.

What is QMFOL and how does it generate tasks?

QMFOL is an automated generator for monadic first-order logic tasks that offers quantifiable, controllable complexity, constructing formal logical structures using conjunction and disjunction patterns and explicit controls for reasoning depth, width, label types and distractors. The framework translates those formal structures into natural language via large language models, then enforces logical consistency through round-trip verification with an external prover.

The paper describes how QMFOL builds logical structures with parametrized patterns so researchers can vary the exact shape of problems. That pipeline — formal structure, natural-language rendering, and external-prover verification — is intended to balance semantic diversity against logical consistency, a shortcoming the authors identify in existing benchmarks.

How did models perform on QMFOLBench?

QMFOLBench comprises 2880 instances covering 960 configurations; the authors evaluated six large reasoning models (LRMs) and two LLMs and found that performance declines and computational overhead rises as logical complexity increases. Models scored higher on True-labeled tasks than on False or Unknown labels, and their accuracy was sensitive to semantic variation introduced in the natural-language rendering.

The paper reports two consistent patterns: increased logical complexity reduced model accuracy and raised computational cost, and different label types produced measurable performance gaps, with True labels easier for the evaluated models than False or Unknown. The benchmark is presented as scalable and reliable for constructing deductive reasoning testbeds with these controllable axes.

Why it matters

Benchmarks that expose specific failure modes matter because they let researchers and engineers target exact weaknesses. QMFOL's explicit controls for depth, width, distractors and label types make it possible to isolate whether a model fails because of combinational complexity, semantic paraphrase, or label ambiguity. That specificity should help developers set precise training or prompting interventions and let evaluators compare models on identical, parameterized reasoning challenges.

What to watch

Look for broader adoption of QMFOLBench configurations in future papers and model leaderboards, and for external groups to reuse the round-trip verification approach to ensure logical consistency in language-rendered test cases. The next concrete milestone will be independent replication of the reported performance trends across a wider set of models and decoding settings.

QMFOL and QMFOLBench aim to give the community a repeatable, tunable way to stress deductive reasoning in language models, with clear numeric controls and an external prover in the loop, as laid out in the arXiv submission dated 18 Jun 2026 (arXiv:2606.20227).

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Diffusion Language Models: Eight DLMs evaluated across tasks

Authors evaluate eight state-of-the-art DLMs on eight benchmarks, measuring generation quality and computational efficiency while varying.

The BrieftideDAILY BRIEF

DeFAb: Defeasible Abduction Benchmark, 372,648+ instances

DeFAb converts four decades of publicly funded knowledge bases into 372.

The BrieftideDAILY BRIEF

LLMs vs Bloom's Taxonomy: 20,700 generated educational questions

A paper by Xiaolong Wang et al. evaluates six LLMs with 20,700 questions.

The BrieftideDAILY BRIEF

ProfiLLM: DiDi's LLM pipeline boosts dispatch AUC and GMV

Agentic LLM pipeline extracts reusable profiles with 27 analytical tools and yields up to +6.14% AUC and +4.35% GMV in DiDi tests.