Radical AI Interpretability: a framework for reading model beliefs
Daniel A. Herrmann and Benjamin A. Levinstein tie radical-interpretation philosophy to mechanistic tools to define when belief and desire.
TL;DR
- 01Daniel A. Herrmann and Benjamin A. Levinstein tie radical-interpretation philosophy to mechanistic tools to define when belief and desire.
- 02Radical AI Interpretability, a manuscript by Daniel A.
- 03Levinstein, was submitted to arXiv on 25 Jun 2026 as arXiv:2606.26523.
Radical AI Interpretability, a manuscript by Daniel A. Herrmann and Benjamin A. Levinstein, was submitted to arXiv on 25 Jun 2026 as arXiv:2606.26523. The draft, slated to appear as a Cambridge Element in the Philosophy of Artificial Intelligence, lays out a formal framework for reading AI systems as agents and testing when attributions of belief and desire are justified.
What is the framework?
The framework treats AI systems as agents and asks, "given the computational facts about a system, how do we solve for its beliefs, desires, and meanings?" It combines the philosophical tradition of radical interpretation with mechanistic interpretability tools, and proposes criteria for both representationalist and interpretationist approaches tied to concrete tests current tools can run.
The authors frame the core project as solving for attitudes from computational structure. They situate radical interpretation, a philosophy approach that infers meaning from behavior and context, alongside mechanistic interpretability methods that probe model internals. The paper offers operational criteria and links them to tests interpretability researchers can perform.
How would researchers test these attributions?
Researchers should not treat beliefs, desires, or propositional structure in isolation, because attributions are jointly constrained and methods that fix one while measuring the others inherit distortions. The paper ties each theoretical stance, representationalist and interpretationist, to tests that mechanistic tools can carry out.
Practically, Herrmann and Levinstein argue that mechanistic interpretability can measure both a model's attitudes and its propositional structure, and that those measurements must be cross-checked. A method that, for example, freezes a representationalist reading of a concept while probing desires will propagate whatever misalignment the fixed representation introduced. The manuscript makes this holism central: attitudes constrain propositional structure, that structure constrains possible attitudes, and empirically tractable tests should target both.
Why it matters
The paper makes the normative claim that reliable attributions of goals and deception are safety-relevant: understanding a system's goals or detecting deception lets deployers trust or distrust behavior with evidence. By providing criteria that link philosophical accounts to mechanistic tests, the framework aims to make claims about beliefs and desires experimentally checkable rather than purely interpretive.
Herrmann and Levinstein place safety at the center of the interpretability problem. They state that the ability to read beliefs and desires off internals matters increasingly for safety, whether the objective is to understand a model's goals or to detect deception. That reframes interpretability from descriptive analysis to an empirical enterprise with concrete pass/fail tests.
What to watch
Look for the manuscript's publication as a Cambridge Element in the Philosophy of Artificial Intelligence and for follow-up work that operationalizes the specific tests the authors attach to representationalist and interpretationist criteria. The arXiv submission date, 25 Jun 2026, marks the start of public scrutiny and uptake.
If the community adopts the proposed criteria, the next signals will be papers demonstrating end-to-end tests: one set showing joint measurement of propositional structure and attitudes, and another exposing cases where piecemeal attribution produces detectable distortions. Those experiments will validate whether the framework moves interpretability from metaphysics to measurable practice.
Acknowledgements and metadata: the arXiv entry lists the authors as Daniel A. Herrmann and Benjamin A. Levinstein and identifies the version as arXiv:2606.26523. The PDF and TeX source are available from the arXiv record, and the authors label the draft as forthcoming in Cambridge Elements in the Philosophy of Artificial Intelligence.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI SafetyAgentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A
An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.
Human-centric AI and firm idiosyncratic risks, 2015–2023
Human-centric AI strategies are associated with lower firm idiosyncratic risk among Chinese listed firms.
OpenAI joins Appia Foundation to build shared AI standards
OpenAI supports evaluation frameworks, safety practices and global cooperation through the Appia Foundation.
AI4SE and SE4AI: A decade review of AI in systems engineering
H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.