Trust Between AI Agents: measuring formation, breakage
Yujiao Chen proposes a costly-verification trust metric; four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1.
TL;DR
- 01Yujiao Chen proposes a costly-verification trust metric; four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1.
- 02The experiment frames trust as the observable drop in checking a teammate's work inside a cooperative survival game, comparing agents' behavior to a memoryless version of the same model.
- 03The paper treats verification as costly: in the cooperative survival game, checking a teammate consumes resources while trusting a wrong answer can be fatal.
Yujiao Chen's paper Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems defines a behavioral way to measure trust between language-model agents using a costly verification task. The experiment frames trust as the observable drop in checking a teammate's work inside a cooperative survival game, comparing agents' behavior to a memoryless version of the same model.
How the measure works
The paper treats verification as costly: in the cooperative survival game, checking a teammate consumes resources while trusting a wrong answer can be fatal. Reduced verification relative to a memoryless baseline is the proposed signal of trust. The method therefore translates an internal disposition into an observable action: how often agents expend limited resources to verify teammates' outputs. Chen describes this as a "behavioral measure based on costly verification."
Experiments use six frontier model snapshots. Four named snapshots, Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro, show large reductions in verification when paired with a consistently reliable teammate. Two smaller snapshots show little or no such adjustment. The paper treats failures and their sequencing as interventions to study trust breakage and recovery: failures reverse the verification discount, and the pattern of failures changes subsequent behavior.
What the experiments found
Across the six snapshots the paper reports several consistent patterns. When paired with a reliably performing teammate, four snapshots—Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro—"reduce verification by roughly 60-85%." The two smaller snapshots do not exhibit comparable reductions and instead show little or no adjustment.
When failures occur the earlier discount disappears, but models diverge in how they allocate renewed scrutiny. Some models concentrate renewed checks on the culprit, while others respond by becoming more cautious toward the entire team. Recovery from failures is slower than initial trust formation. The paper also finds that clustered failures sustain suspicion far longer than the same number of failures spread out over time.
The behavioral differences map to practical outcomes in the game. Models that form trust verify less, decide more quickly, and achieve higher payoffs. By contrast, persistent over-verification correlates with indecision rather than improved safety.
Why it matters
Chen's approach turns an internal, hard-to-measure psychological concept into a concrete operational metric that can be evaluated before deployment. The differences across snapshots imply that model choice and snapshot selection will affect how multi-agent systems allocate scarce verification resources, how quickly they reach decisions, and how they respond to teammate failures. The paper argues that governance should focus on calibration of trust dispositions rather than defaulting to maximal suspicion.
What to watch
Look for follow-up work that ties the behavioral metric to specific architectural or training differences between snapshots, and for studies that apply the costly-verification measure to larger multi-agent systems. Also watch for experiments that test whether calibration interventions reduce the recovery lag after failures or change whether models target scrutiny at culprits versus whole teams.
Additional details: the paper was submitted to arXiv on 12 Jun 2026 under arXiv:2606.14923 [cs.AI], authored by Yujiao Chen. It includes the DOI link https://doi.org/10.48550/arXiv.2606.14923 and provides code, data, and media links in the arXiv entry.
| Item | |||||
|---|---|---|---|---|---|
| Claude Opus 4.6 | Reduce verification by roughly 60-85% | Varies: some concentrate scrutiny on culprit, others widen caution | Recovery slower than formation | Verify less, decide more quickly, achieve higher payoffs | |
| Claude Sonnet 4.6 | Reduce verification by roughly 60-85% | Varies: some concentrate scrutiny on culprit, others widen caution | Recovery slower than formation | Verify less, decide more quickly, achieve higher payoffs | |
| GPT-5.1 | Reduce verification by roughly 60-85% | Varies: some concentrate scrutiny on culprit, others widen caution | Recovery slower than formation | Verify less, decide more quickly, achieve higher payoffs | |
| Gemini 3.1 Pro | Reduce verification by roughly 60-85% | Varies: some concentrate scrutiny on culprit, others widen caution | Recovery slower than formation | Verify less, decide more quickly, achieve higher payoffs | |
| Two smaller snapshots | Little or no such adjustment | Not specified; models differ in responses | Recovery slower than formation | Persistent over-verification associated with indecision rather than safety | |
| General findings | Reduced verification relative to memoryless baseline measures trust | Failures reverse the verification discount; clustered failures sustain suspicion longer | Recovery is slower than formation | Calibration, not maximal suspicion, recommended for governance |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI SafetyDario Amodei's AI playbook: Anthropic's regulation plan
Amodei urges binding third-party audits, federal power to block risky models, export controls.
Germany approves DE-AISI, an AI security institute based on UK
The National Security Council authorised a German AI Security Institute to test advanced models.
Google DeepMind launches $10M multi-agent AI safety fund
A global call for proposals offers up to $10M to study group behaviours of interacting AI agents, backed by Schmidt Sciences.
OpenAI backs away from full automation, aims 'tandem' by 2028
Sam Altman and Jakub Pachocki say AI should work in 'tandem' with humans and propose an international body to slow frontier development.