Benchmarks & Evals5 min read

CombEval: Benchmarking combinatorial counting in 11 LLMs

CombEval is a dynamic, solver-verified benchmark for combinatorial counting that tests 11 LLMs across varied object types.

The Brieftide

TL;DR

  • 01CombEval is a dynamic, solver-verified benchmark for combinatorial counting that tests 11 LLMs across varied object types.
  • 02CombEval, a dynamic benchmark for combinatorial counting, landed on arXiv on 18 Jun 2026 as arXiv:2606.19788.
  • 03The framework supports systematic variation of object type, entity scale, constraint count, and reasoning depth, unlike static collections that lack such programmatic control.

CombEval, a dynamic benchmark for combinatorial counting, landed on arXiv on 18 Jun 2026 as arXiv:2606.19788. The paper, by Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang and Yi Chang, introduces a testbed that generates solver-verified natural-language counting problems and evaluates 11 large language models under direct and code-augmented settings.

What is CombEval and how does it work?

CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints; it then renders controlled natural-language counting problems with exact solver-verified answers. The framework supports systematic variation of object type, entity scale, constraint count, and reasoning depth, unlike static collections that lack such programmatic control.

The paper positions the Cofola specification as the core representational device: problems are defined in a typed formalism that allows automatic generation and exact verification by solvers. The authors state the code and generated benchmark suites are publicly available at this https URL, and the submission is noted as under review.

How did the 11 LLMs perform?

Across both direct prompting and code-augmented settings the 11 evaluated models showed consistent weaknesses: models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. The authors evaluated 11 LLMs and report systematic failure modes rather than isolated mistakes, and they back those claims with error analysis that highlights failures in constraint interpretation and counting principles.

The paper contrasts CombEval with static collections: CombEval offers solver-verified answers and systematic parameter variation, which the authors use to isolate when models fail (for example, when object indistinguishability or nested dependencies are involved). The evaluation reported in the abstract does not list per-model scores; it summarizes cross-model brittleness on the four named problem types and identifies two broad sources of error: constraint interpretation and misapplied counting principles.

Why it matters

CombEval matters because it supplies a diagnostic testbed with exact answers and controlled difficulty axes, enabling researchers to probe specific combinatorial failure modes. By generating problems from typed specifications and verifying answers with solvers, the framework removes ambiguity about ground truth and lets evaluations focus on reasoning errors rather than dataset noise. That makes it possible to measure how changes in object type, entity scale, constraint count, or reasoning depth affect model behavior.

Researchers and model developers can use those controlled variations to track whether improvements come from better prompt engineering, code augmentation, or deeper changes to model architecture and training. The authors’ identification of failures in constraint interpretation and counting principles points to concrete targets for future interventions.

What to watch

Watch for the public code and the generated benchmark suites at this https URL and for subsequent versions of the submission while it is under review. The next concrete signals will be reported evaluation splits or per-model breakdowns published by the authors, and any follow-up work that uses CombEval’s controlled variations to show improved handling of ordered objects, indistinguishable elements, positional constraints, or nested dependencies.

Details and provenance: CombEval appears on arXiv as arXiv:2606.19788, submitted 18 Jun 2026, authored by Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang and Yi Chang. The abstract states the framework evaluates 11 LLMs under direct and code-augmented settings and that the code and generated suites are publicly available at this https URL.

CombEval versus static collections
Item
Solver-verified answersSolver-verified answersYesNo
Dynamic generationDynamic generationYesNo
Systematic variation of object typeSystematic variationYesLimited
Control over entity scale, constraint count, reasoning depthControlled difficulty axesYesNo
Designed for diagnostic error analysisDiagnostic testbedYesNo
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement