CombEval: Benchmarking combinatorial counting in 11 LLMs
CombEval is a dynamic, solver-verified benchmark for combinatorial counting that tests 11 LLMs across varied object types.
TL;DR
- 01CombEval is a dynamic, solver-verified benchmark for combinatorial counting that tests 11 LLMs across varied object types.
- 02CombEval, a dynamic benchmark for combinatorial counting, landed on arXiv on 18 Jun 2026 as arXiv:2606.19788.
- 03The framework supports systematic variation of object type, entity scale, constraint count, and reasoning depth, unlike static collections that lack such programmatic control.
CombEval, a dynamic benchmark for combinatorial counting, landed on arXiv on 18 Jun 2026 as arXiv:2606.19788. The paper, by Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang and Yi Chang, introduces a testbed that generates solver-verified natural-language counting problems and evaluates 11 large language models under direct and code-augmented settings.
What is CombEval and how does it work?
CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints; it then renders controlled natural-language counting problems with exact solver-verified answers. The framework supports systematic variation of object type, entity scale, constraint count, and reasoning depth, unlike static collections that lack such programmatic control.
The paper positions the Cofola specification as the core representational device: problems are defined in a typed formalism that allows automatic generation and exact verification by solvers. The authors state the code and generated benchmark suites are publicly available at this https URL, and the submission is noted as under review.
How did the 11 LLMs perform?
Across both direct prompting and code-augmented settings the 11 evaluated models showed consistent weaknesses: models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. The authors evaluated 11 LLMs and report systematic failure modes rather than isolated mistakes, and they back those claims with error analysis that highlights failures in constraint interpretation and counting principles.
The paper contrasts CombEval with static collections: CombEval offers solver-verified answers and systematic parameter variation, which the authors use to isolate when models fail (for example, when object indistinguishability or nested dependencies are involved). The evaluation reported in the abstract does not list per-model scores; it summarizes cross-model brittleness on the four named problem types and identifies two broad sources of error: constraint interpretation and misapplied counting principles.
Why it matters
CombEval matters because it supplies a diagnostic testbed with exact answers and controlled difficulty axes, enabling researchers to probe specific combinatorial failure modes. By generating problems from typed specifications and verifying answers with solvers, the framework removes ambiguity about ground truth and lets evaluations focus on reasoning errors rather than dataset noise. That makes it possible to measure how changes in object type, entity scale, constraint count, or reasoning depth affect model behavior.
Researchers and model developers can use those controlled variations to track whether improvements come from better prompt engineering, code augmentation, or deeper changes to model architecture and training. The authors’ identification of failures in constraint interpretation and counting principles points to concrete targets for future interventions.
What to watch
Watch for the public code and the generated benchmark suites at this https URL and for subsequent versions of the submission while it is under review. The next concrete signals will be reported evaluation splits or per-model breakdowns published by the authors, and any follow-up work that uses CombEval’s controlled variations to show improved handling of ordered objects, indistinguishable elements, positional constraints, or nested dependencies.
Details and provenance: CombEval appears on arXiv as arXiv:2606.19788, submitted 18 Jun 2026, authored by Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang and Yi Chang. The abstract states the framework evaluates 11 LLMs under direct and code-augmented settings and that the code and generated suites are publicly available at this https URL.
| Item | |||
|---|---|---|---|
| Solver-verified answers | Solver-verified answers | Yes | No |
| Dynamic generation | Dynamic generation | Yes | No |
| Systematic variation of object type | Systematic variation | Yes | Limited |
| Control over entity scale, constraint count, reasoning depth | Controlled difficulty axes | Yes | No |
| Designed for diagnostic error analysis | Diagnostic testbed | Yes | No |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsLLM Agents: Predictive Validity vs Static Leaderboards
Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.
DeXposure-Claw: Agentic System for DeFi Risk Supervision
DeXposure-Claw routes LLM decisions through forecasts, deterministic monitors and confidence gates; DeXposure-Bench scores tickets with a.
ForecastBench-Sim: Simulated-World Forecasting Benchmark
A benchmark built on Freeciv game rollouts that generates solvable forecasting tasks with configurable horizons.
TxBench-PP: 100 preclinical pharmacology tasks, top score 59.3%
TxBench-PP is a verifiable benchmark of 100 small-molecule preclinical decisions across 11 models and 4.