Agentic Bootstrap and the m-value: exposing forking analysis paths
AI agents reproduced 72% of a human ideological gap; the paper introduces Agentic Bootstrap and the m-value to map analysis variation.
TL;DR
- 01AI agents reproduced 72% of a human ideological gap; the paper introduces Agentic Bootstrap and the m-value to map analysis variation.
- 02The paper further reports that 86% of opposing AI-generated analyses passed independent AI review and 78% passed majority human expert review.
- 03Applied to the human immigration study, the authors found that 13.5% of reported human analyses fell in the most extreme 5% of the analysis space, meaning those reports had m-values below 0.05.
The Agentic Garden of Forking Paths, a paper by Jiacheng Miao, Jonathan K Pritchard and James Zou submitted to arXiv on 1 Jul 2026, shows that AI agents can capture much of the analytical variation among human researchers and makes those hidden forking paths observable. The authors introduce the m-value, the multiverse value, and a procedure called Agentic Bootstrap that uses AI agents to sample plausible analysis paths and estimate how extreme a reported claim is within the space of defensible analyses.
What did the authors measure and find?
They measured how much AI agents can reproduce human analytical divergence, finding that in a study where 42 human research teams analyzed the same immigration dataset, AI agents reproduced 72% of the human ideological gap in reported effect estimates. The study also spanned four high-stakes domains, and the authors observed that assigning different personas to AI agents was sufficient for the agents to report divergent, often opposing, conclusions from the same data and question, with findings systematically aligned with those beliefs. The paper further reports that 86% of opposing AI-generated analyses passed independent AI review and 78% passed majority human expert review.
The authors interpret these numbers as evidence that different, methodologically defensible analytical choices can yield opposing conclusions and that AI agents capture much of that variation while making the space of choices explicit.
How does the m-value and Agentic Bootstrap work?
The m-value is the probability that an analysis path would produce a claim at least as extreme as the reported one, and Agentic Bootstrap estimates this probability by using AI agents to sample plausible analysis paths. In practice the authors use AI agents with assigned personas to generate many plausible analytic pipelines, treat that ensemble as a multiverse of defensible analyses, and compute how frequently paths in that ensemble produce effects as extreme as the reported claim.
Applied to the human immigration study, the authors found that 13.5% of reported human analyses fell in the most extreme 5% of the analysis space, meaning those reports had m-values below 0.05. The paper positions Agentic Bootstrap as a way to make the distribution of plausible analyses observable and to convert that distribution into a credibility criterion for scientific claims.
Why does this matter?
If a nontrivial share of published analyses sits in the extreme tail of an otherwise defensible analysis space, then selective exploration and reporting can drive strong claims even when every step looks methodologically plausible. The authors argue that the central problem may often not be flawed analysis but the selective choice among a large set of defensible analyses. AI agents lower the cost of exploring that set and therefore can amplify the problem by making selective reporting inexpensive and scalable. The reported figures underline two tensions: AI agents replicate human ideological gaps at scale, and many divergent analyses still pass independent AI and majority human expert review, which makes detecting problematic reports by inspection difficult.
What are the limits and what did the authors not claim?
The paper demonstrates alignment between persona-driven agent outputs and human ideological variation, and it proposes a statistical object, the m-value, plus a sampling procedure, Agentic Bootstrap. The authors do not offer a fixed cutoff that settles scientific disputes; they recommend evaluating evidence both by a single reported analysis and by that analysis's position within the distribution of reasonably possible analyses.
What to watch
Watch for independent replications of the paper's key numbers: the 72% reproduction of the human ideological gap, the 13.5% of human analyses with m<0.05, and the reported review pass rates of 86% by AI review and 78% by majority human expert review. Also watch for early adopters of Agentic Bootstrap or m-value reporting in domains where analytical flexibility has major policy or clinical consequences.
The arXiv entry for the paper is available at https://doi.org/10.48550/arXiv.2607.01507.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAgent4cs: Multi-agent code summarization, up to 38% gains
Agent4cs uses three cooperating agents to summarize large hierarchical codebases.
Autoformalization: Agent Instructions to Policy-as-Code
A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.
Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A
An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.
Data2Story: CSV-to-article pipeline with seven AI agents
A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.