Foundation ModelsJune 18, 20265 min read

DeFAb: Defeasible Abduction Benchmark, 372,648+ instances

DeFAb converts four decades of publicly funded knowledge bases into 372.

The BrieftideJune 18, 2026

TL;DR

01DeFAb converts four decades of publicly funded knowledge bases into 372.
02The release includes 372,648+ instances materialized into 33.75M rules from 18 sources and three difficulty levels with polynomial-time verifiable gold standards.
03The authors also release difficulty variants.

DeFAb, introduced on arXiv on 17 Jun 2026 by Patrick Cooper and Alvaro Velasquez, is a large dataset and pipeline that turns four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction. The release includes 372,648+ instances materialized into 33.75M rules from 18 sources and three difficulty levels with polynomial-time verifiable gold standards.

What is DeFAb and how is it built?

DeFAb is a dataset and generation pipeline that pairs taxonomic hierarchies with behavioral property graphs to produce formally grounded defeasible-abduction instances; it materializes 33.75M rules from 18 sources into 372,648+ instances across three levels. The pipeline combines OpenCyc, YAGO, and Wikidata taxonomies with ConceptNet and UMLS property graphs, and enforces polynomial-time checks for valid derivation, conservativity, and minimality so every gold hypothesis is verifiable.

How do models and symbolic systems perform on DeFAb?

A rule-based logic solver resolves every instance in the benchmark in under 50 microseconds with 100% accuracy; by contrast, the best frontier language model reaches 65% at best and falls to 23.5% under a rendering-robust evaluation. Across four surface renderings, rendering-robust Level 2 accuracy for four frontier models ranges from 7.8% to 23.5%; chain-of-thought prompting introduces variance of about 36 percentage points and a matched contamination control isolates a +19.4 percentage-point Level 3 gap between models and cleaner splits.

The authors also release difficulty variants. DeFAb-Hard is a 235-instance Level 3 variant where the best model achieved 53.3% compared with 100% for the symbolic verifier. CONJURE is a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain; a pilot on CONJURE found zero novel concepts.

Why does a verifiable defeasible-abduction benchmark matter?

DeFAb makes logical rigor the metric for creativity by requiring hypotheses to pass polynomial-time checks for derivation, conservativity, and minimality, so systems that generate fluent but theory-destroying prose score poorly. The gap between symbolic and model performance — including a solver that is both exact and extremely fast (under 50 microseconds) and models that drop to under 24% under rendering-robust tests — shows foundation models do not reliably internalize the kind of defeasible reasoning DeFAb measures. The paper also highlights that the verifier can serve as an exact reward signal for preference optimization methods (DPO, RLVR/GRPO), opening a concrete training objective tied to formal correctness.

What to watch next

A concrete milestone will be models narrowing the matched contamination control gap of +19.4 percentage points on Level 3 or improving rendering-robust Level 2 accuracy above the reported 23.5% worst-case figure. Other signals include whether any model exceeds the 53.3% best score on DeFAb-Hard or whether future CONJURE experiments produce novel, kernel-verified definitions. The authors have released dataset, code, and an evaluation harness under an MIT license at the provided URLs in the paper.

Verified accuracy and runtime: symbolic solver versus models

Item
Rule-based logic solver	100	N/A	<50 µs	Polynomial-time verifier; resolves every instance
Best frontier language model	65	23.5	N/A	Best raw accuracy 65%; drops under rendering-robust eval
Frontier models (Level 2 range)	N/A	7.8-23.5	N/A	Rendering-robust Level 2 accuracy range over four renderings
DeFAb-Hard best model	53.3	N/A	N/A	235-instance Level 3 variant; symbolic solver 100%

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

ProfiLLM: DiDi's LLM pipeline boosts dispatch AUC and GMV

Agentic LLM pipeline extracts reusable profiles with 27 analytical tools and yields up to +6.14% AUC and +4.35% GMV in DiDi tests.

The BrieftideDAILY BRIEF

Zhipu AI GLM-5.2: 1M-token context, closes gap with Opus 4.8

GLM-5.2 ships under the MIT license with a stable one-million-token context and scores 74.4% on FrontierSWE, one point behind Opus 4.8.

The BrieftideDAILY BRIEF

LLMs and CEO-Bench: Benchmarking Strategic Resource Reallocation

CEO-Bench tests LLMs on multi-round, role-conditioned resource allocation with private advisor signals and four evaluation dimensions.

The BrieftideDAILY BRIEF

LLM Consumer Behavior Theory: New field for agentic markets

Manon Reusens, Sofie Goethals and David Martens formalize how LLMs make consumption decisions and map research gaps in agentic markets.