VeryTrace: Verifying reasoning traces with a compilable DSL
Zero-shot verification-and-repair framework that formalizes traces into a compilable DSL and uses deterministic checks plus targeted LLM.
TL;DR
- 01Zero-shot verification-and-repair framework that formalizes traces into a compilable DSL and uses deterministic checks plus targeted LLM.
- 02The paper, by Ninghan Zhong, Ahmet Ege Tanriverdi, Kaan Kale and Sriram Vishwanath, was accepted at the LM4Plan Workshop @ ICML 2026.
- 03The DSL enforces an explicit trace structure so each step’s inputs and outputs are visible to automated checks.
VeryTrace is a zero-shot verification-and-repair framework that formalizes natural-language reasoning traces into a structured, compilable representation, submitted to arXiv on 23 Jun 2026 (arXiv:2606.24124). The paper, by Ninghan Zhong, Ahmet Ege Tanriverdi, Kaan Kale and Sriram Vishwanath, was accepted at the LM4Plan Workshop @ ICML 2026.
What is VeryTrace and how does it work?
VeryTrace converts free-form Chain-of-Thought traces into a Domain-Specific Language that makes dependencies explicit, mechanizes quantitative content as executable expressions and structures semantic inferences via deduction schemas. The system pairs that compilable DSL with a hybrid verifier that runs deterministic checks for computational correctness, dependency resolution and constraint satisfaction, and uses targeted LLM audits for semantic judgments that cannot be mechanized.
The DSL enforces an explicit trace structure so each step’s inputs and outputs are visible to automated checks. Deterministic components evaluate executable expressions and constraints, while the LLM audits intervene only when a judgment requires world knowledge or other non-mechanizable inference. This design enables the framework to locate errors at the step level and to attempt automated repairs to incorrect steps.
How did VeryTrace perform on benchmarks?
VeryTrace improved accuracy over zero-shot baselines on state-of-the-art LLMs across three diverse domains: competition mathematics (AIME 2025), robotics planning (LLM-BabyBench) and kinship reasoning (CLUTRR). The paper emphasizes these three datasets as its evaluation suite and highlights that the improvements come without domain-specific training or in-context examples.
The reported gains come from replacing unconstrained natural-language traces with a compilable formalism that admits deterministic verification, then supplementing those checks with targeted LLM audits for the remaining semantic gaps. The authors attribute step-level error localization and repair to this combination of mechanized checks and selective LLM scrutiny.
Why does this approach matter?
Formalizing reasoning traces addresses a core weakness of Chain-of-Thought prompting: errors in early steps can silently propagate to confident but incorrect conclusions. VeryTrace shifts some of the verification burden away from brittle natural-language interpretation into executable structure, letting deterministic procedures catch arithmetic mistakes and broken dependencies while reserving LLM judgment for genuinely semantic decisions.
That split reduces blind reliance on post-hoc LLM judgment across all steps and, according to the paper, produces more precise and generalizable improvements across multiple domains without extra training. For practitioners, that means a pathway to more auditable multi-step reasoning when exactness and step-level accountability matter.
What to watch
Look for the LM4Plan Workshop @ ICML 2026 presentation and any linked code, data or demos on the arXiv entry: the paper page lists sections for "Code, Data and Media Associated with this Article." How the approach scales to wider, noisy real-world traces and whether community code appears alongside the workshop presentation will determine how quickly adopters can evaluate and extend the work.
References and provenance: the paper is arXiv:2606.24124 [cs.AI], "VeryTrace: Verifying Reasoning Traces through Compilable Formalism and Structured Verification," submitted 23 Jun 2026, authors Ninghan Zhong, Ahmet Ege Tanriverdi, Kaan Kale and Sriram Vishwanath, accepted at LM4Plan Workshop @ ICML 2026.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsBIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
QMFOL benchmark: QMFOLBench with 2880 logic instances
QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.
DeFAb: Defeasible Abduction Benchmark, 372,648+ instances
DeFAb converts four decades of publicly funded knowledge bases into 372.
LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.