Theoria paper: certifies 105 of 185 HLE problems on arXiv
Theoria rewrites candidate solutions into typed state transitions with explicit justifications and certifies 105 of 185 HLE-Verified Gold.
TL;DR
- 01Theoria rewrites candidate solutions into typed state transitions with explicit justifications and certifies 105 of 185 HLE-Verified Gold.
- 02Every certification produces a human readable proof trace in which each step can be independently challenged.
- 03On GPQA Diamond (n = 65), Theoria's certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).
Theoria, a verification architecture described on arXiv, rewrites candidate AI solutions into sequences of typed state transitions, each licensed by an explicit justification, and certifies 105 of 185 HLE-Verified Gold text-only expert problems at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). Every certification produces a human readable proof trace in which each step can be independently challenged.
What is Theoria and how does it verify solutions?
Theoria verifies answers by converting a candidate solution into a sequence of typed state transitions and requiring an explicit justification for every change; the foundational invariant is completeness of change so that every difference between consecutive proof states must be accounted for. Each transition can be licensed by a citation, a computation, or a problem-given fact, and is independently auditable, which causes hidden premises to surface as unlicensed mutations rather than passing silently.
Theoria's design sits between formal proof assistants and scalar LLM judges: proof assistants give formal certainty but cannot cover most problems, while scalar LLM judges offer coverage but produce opaque scores. Theoria keeps coverage by operating over informal reasoning states while enforcing an auditable structure: typed states, explicit justifications, and stepwise traces that humans can read and challenge.
How did Theoria perform compared with holistic LLM judges?
On HLE-Verified Gold (185 text-only expert problems), Theoria certified 105 problems at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]); holistic LLM judges achieve comparable precision at matched coverage but fail on different problems, with Jaccard overlap between their sets of accepted problems ranging from 0.14 to 0.36. On GPQA Diamond (n = 65), Theoria's certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).
Theoria shows a clearer advantage on adversarial inputs: on 95 adversarial poisoned proofs across 15 domains, structured judges caught 94.7% while holistic judging caught 83.2% (p = 0.0017). The overall gap of 11.5 percentage points concentrates in specific error classes predicted by Theoria's formal analysis: hidden premises (90.6% detection vs. 62.5% for holistic judges, a 28 percentage point difference) and fabricated citations (100% detection vs. 90%). Performance was identical between the approaches on arithmetic and theorem-misapplication errors, which Theoria's analysis predicted would yield no advantage.
Those numbers imply complementarity rather than outright dominance: holistic judges and Theoria reach similar aggregate precision on some benchmarks but diverge strongly on which problems each accepts, reflected in the low Jaccard scores.
Why it matters
Theoria narrows the gap between formal verification and opaque LLM judging by producing auditable, stepwise proof traces while keeping broad coverage. That matters for trust and red-team evaluations: the architecture specifically surfaces hidden premises and fabricated citations, the failure modes where structured, stepwise verification shows measurable gains. The low overlap with holistic judges suggests that combining structured traces with holistic scoring could improve overall coverage and safety by covering different weaknesses.
What to watch
Look for reproduction of these results on additional benchmarks beyond HLE-Verified Gold and GPQA Diamond, and for whether the low Jaccard overlap (0.14 to 0.36) between accepted problems persists across more datasets. The next confirmatory signals will be wider evaluations that break down where the 11.5 percentage-point gap concentrates and whether ensembles of structured and holistic judges reduce complementary failures.
| Item | |||
|---|---|---|---|
| HLE-Verified Gold (n = 185) | 105 certified; 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]) | Comparable precision at matched coverage; Jaccard overlap 0.14-0.36 | |
| GPQA Diamond (n = 65) | 97.1% certified precision (Wilson CI [85.1%, 99.5%]) | Not specified | |
| Adversarial poisoned proofs (n = 95, 15 domains) | 94.7% caught | 83.2% caught (p = 0.0017) | |
| Hidden premises detection | 90.6% | 62.5% (28 pp difference) | |
| Fabricated citations detection | 100% | 90% |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Reasoning VerificationRetrieval-Grounded Formal Concept Analysis: Verifiable Knowledge
Yujin Yang and Heejung Lee present a retrieval-augmented SLM using formal concept analysis and oracle checks.
Agri-SAGE: Simulation-Grounded Multi-Agent LLM for Farming
Agri-SAGE links retrieval-grounded multi-agent LLM reasoning with APSIM biophysical simulation to generate and validate context-aware.
Data-driven ML and GPT-5: arXiv finds limits for symbolic logic
An arXiv paper by Tiansi Dong, Mateja Jamnik and Pietro Liò argues supervised deep learning cannot reach symbolic-level syllogistic.
Governing Actions, Not Agents: Institutional Attestation Model
Jakob Salfeld-Nebgen formalises a governance model where agents plan but execution of high-risk acts requires independent.