Foundation Models5 min read

DeFAb: Defeasible Abduction Benchmark, 372,648+ instances

DeFAb converts four decades of publicly funded knowledge bases into 372.

The Brieftide

TL;DR

  • 01DeFAb converts four decades of publicly funded knowledge bases into 372.
  • 02The release includes 372,648+ instances materialized into 33.75M rules from 18 sources and three difficulty levels with polynomial-time verifiable gold standards.
  • 03The authors also release difficulty variants.

DeFAb, introduced on arXiv on 17 Jun 2026 by Patrick Cooper and Alvaro Velasquez, is a large dataset and pipeline that turns four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction. The release includes 372,648+ instances materialized into 33.75M rules from 18 sources and three difficulty levels with polynomial-time verifiable gold standards.

What is DeFAb and how is it built?

DeFAb is a dataset and generation pipeline that pairs taxonomic hierarchies with behavioral property graphs to produce formally grounded defeasible-abduction instances; it materializes 33.75M rules from 18 sources into 372,648+ instances across three levels. The pipeline combines OpenCyc, YAGO, and Wikidata taxonomies with ConceptNet and UMLS property graphs, and enforces polynomial-time checks for valid derivation, conservativity, and minimality so every gold hypothesis is verifiable.

How do models and symbolic systems perform on DeFAb?

A rule-based logic solver resolves every instance in the benchmark in under 50 microseconds with 100% accuracy; by contrast, the best frontier language model reaches 65% at best and falls to 23.5% under a rendering-robust evaluation. Across four surface renderings, rendering-robust Level 2 accuracy for four frontier models ranges from 7.8% to 23.5%; chain-of-thought prompting introduces variance of about 36 percentage points and a matched contamination control isolates a +19.4 percentage-point Level 3 gap between models and cleaner splits.

The authors also release difficulty variants. DeFAb-Hard is a 235-instance Level 3 variant where the best model achieved 53.3% compared with 100% for the symbolic verifier. CONJURE is a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain; a pilot on CONJURE found zero novel concepts.

Why does a verifiable defeasible-abduction benchmark matter?

DeFAb makes logical rigor the metric for creativity by requiring hypotheses to pass polynomial-time checks for derivation, conservativity, and minimality, so systems that generate fluent but theory-destroying prose score poorly. The gap between symbolic and model performance — including a solver that is both exact and extremely fast (under 50 microseconds) and models that drop to under 24% under rendering-robust tests — shows foundation models do not reliably internalize the kind of defeasible reasoning DeFAb measures. The paper also highlights that the verifier can serve as an exact reward signal for preference optimization methods (DPO, RLVR/GRPO), opening a concrete training objective tied to formal correctness.

What to watch next

A concrete milestone will be models narrowing the matched contamination control gap of +19.4 percentage points on Level 3 or improving rendering-robust Level 2 accuracy above the reported 23.5% worst-case figure. Other signals include whether any model exceeds the 53.3% best score on DeFAb-Hard or whether future CONJURE experiments produce novel, kernel-verified definitions. The authors have released dataset, code, and an evaluation harness under an MIT license at the provided URLs in the paper.

Verified accuracy and runtime: symbolic solver versus models
Item
Rule-based logic solver100N/A<50 µsPolynomial-time verifier; resolves every instance
Best frontier language model6523.5N/ABest raw accuracy 65%; drops under rendering-robust eval
Frontier models (Level 2 range)N/A7.8-23.5N/ARendering-robust Level 2 accuracy range over four renderings
DeFAb-Hard best model53.3N/AN/A235-instance Level 3 variant; symbolic solver 100%
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement