AI SafetyJune 16, 20265 min read

Trust Between AI Agents: measuring formation, breakage

Yujiao Chen proposes a costly-verification trust metric; four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1.

The BrieftideJune 16, 2026

TL;DR

01Yujiao Chen proposes a costly-verification trust metric; four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1.
02The experiment frames trust as the observable drop in checking a teammate's work inside a cooperative survival game, comparing agents' behavior to a memoryless version of the same model.
03The paper treats verification as costly: in the cooperative survival game, checking a teammate consumes resources while trusting a wrong answer can be fatal.

Yujiao Chen's paper Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems defines a behavioral way to measure trust between language-model agents using a costly verification task. The experiment frames trust as the observable drop in checking a teammate's work inside a cooperative survival game, comparing agents' behavior to a memoryless version of the same model.

How the measure works

The paper treats verification as costly: in the cooperative survival game, checking a teammate consumes resources while trusting a wrong answer can be fatal. Reduced verification relative to a memoryless baseline is the proposed signal of trust. The method therefore translates an internal disposition into an observable action: how often agents expend limited resources to verify teammates' outputs. Chen describes this as a "behavioral measure based on costly verification."

Experiments use six frontier model snapshots. Four named snapshots, Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro, show large reductions in verification when paired with a consistently reliable teammate. Two smaller snapshots show little or no such adjustment. The paper treats failures and their sequencing as interventions to study trust breakage and recovery: failures reverse the verification discount, and the pattern of failures changes subsequent behavior.

What the experiments found

Across the six snapshots the paper reports several consistent patterns. When paired with a reliably performing teammate, four snapshots—Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro—"reduce verification by roughly 60-85%." The two smaller snapshots do not exhibit comparable reductions and instead show little or no adjustment.

When failures occur the earlier discount disappears, but models diverge in how they allocate renewed scrutiny. Some models concentrate renewed checks on the culprit, while others respond by becoming more cautious toward the entire team. Recovery from failures is slower than initial trust formation. The paper also finds that clustered failures sustain suspicion far longer than the same number of failures spread out over time.

The behavioral differences map to practical outcomes in the game. Models that form trust verify less, decide more quickly, and achieve higher payoffs. By contrast, persistent over-verification correlates with indecision rather than improved safety.

Why it matters

Chen's approach turns an internal, hard-to-measure psychological concept into a concrete operational metric that can be evaluated before deployment. The differences across snapshots imply that model choice and snapshot selection will affect how multi-agent systems allocate scarce verification resources, how quickly they reach decisions, and how they respond to teammate failures. The paper argues that governance should focus on calibration of trust dispositions rather than defaulting to maximal suspicion.

What to watch

Look for follow-up work that ties the behavioral metric to specific architectural or training differences between snapshots, and for studies that apply the costly-verification measure to larger multi-agent systems. Also watch for experiments that test whether calibration interventions reduce the recovery lag after failures or change whether models target scrutiny at culprits versus whole teams.

Additional details: the paper was submitted to arXiv on 12 Jun 2026 under arXiv:2606.14923 [cs.AI], authored by Yujiao Chen. It includes the DOI link https://doi.org/10.48550/arXiv.2606.14923 and provides code, data, and media links in the arXiv entry.

Behavioral outcomes across six model snapshots

Item
Claude Opus 4.6	Reduce verification by roughly 60-85%	Varies: some concentrate scrutiny on culprit, others widen caution	Recovery slower than formation	Verify less, decide more quickly, achieve higher payoffs
Claude Sonnet 4.6	Reduce verification by roughly 60-85%	Varies: some concentrate scrutiny on culprit, others widen caution	Recovery slower than formation	Verify less, decide more quickly, achieve higher payoffs
GPT-5.1	Reduce verification by roughly 60-85%	Varies: some concentrate scrutiny on culprit, others widen caution	Recovery slower than formation	Verify less, decide more quickly, achieve higher payoffs
Gemini 3.1 Pro	Reduce verification by roughly 60-85%	Varies: some concentrate scrutiny on culprit, others widen caution	Recovery slower than formation	Verify less, decide more quickly, achieve higher payoffs
Two smaller snapshots	Little or no such adjustment	Not specified; models differ in responses	Recovery slower than formation	Persistent over-verification associated with indecision rather than safety
General findings	Reduced verification relative to memoryless baseline measures trust	Failures reverse the verification discount; clustered failures sustain suspicion longer	Recovery is slower than formation	Calibration, not maximal suspicion, recommended for governance

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Dario Amodei's AI playbook: Anthropic's regulation plan

Amodei urges binding third-party audits, federal power to block risky models, export controls.

The BrieftideDAILY BRIEF

Germany approves DE-AISI, an AI security institute based on UK

The National Security Council authorised a German AI Security Institute to test advanced models.

The BrieftideDAILY BRIEF

Google DeepMind launches $10M multi-agent AI safety fund

A global call for proposals offers up to $10M to study group behaviours of interacting AI agents, backed by Schmidt Sciences.

The BrieftideDAILY BRIEF

OpenAI backs away from full automation, aims 'tandem' by 2028

Sam Altman and Jakub Pachocki say AI should work in 'tandem' with humans and propose an international body to slow frontier development.