Multi-Agent LLM Teams Hold Experts Back, ICML: 41.1% loss
Self-organizing multi-agent LLM teams at ICML underperform their best member.
TL;DR
- 01Self-organizing multi-agent LLM teams at ICML underperform their best member.
- 02The authors — Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao and James Zou — present experiments across human-inspired and frontier ML benchmarks.
- 03The worst-case drop reported was a 41.1% performance loss on ML benchmarks, and the paper frames this as a persistent gap between team output and the best individual member.
The ICML paper "Multi-Agent Teams Hold Experts Back," published July 2026, finds that self-organizing multi-agent LLM systems routinely fail to match their best member, sometimes incurring performance losses of up to 41.1% on ML benchmarks. The authors — Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao and James Zou — present experiments across human-inspired and frontier ML benchmarks.
What did the researchers test and find?
Across a set of human-inspired and frontier machine learning benchmarks, self-organizing LLM teams consistently failed to match their expert agent, even when team members were explicitly told who the expert was. The worst-case drop reported was a 41.1% performance loss on ML benchmarks, and the paper frames this as a persistent gap between team output and the best individual member.
The study contrasts these LLM teams with expectations drawn from organizational psychology about human teams, where strong synergy can allow a team to meet or exceed its top performer. Instead, the paper documents that unconstrained coordination among agents does not spontaneously produce that kind of synergy. Experiments let agents interact freely rather than follow fixed roles, workflows, or aggregation rules, isolating the dynamics of self-organization.
Why do LLM teams fail to use experts effectively?
Expert leveraging, rather than expert identification, is the primary bottleneck: teams tend to average expert and non-expert views rather than weight expertise appropriately. The authors identify a conversational tendency they call "integrative compromise," where discussions pull toward an average of opinions, and they show that this behavior increases with team size and correlates negatively with task performance.
The paper separates two failure modes. Identification failures would mean teams do not recognise who the expert is. The authors report that identification was not the main issue. Instead, teams recognised expertise but did not leverage it, producing outcomes worse than simply following the expert. The integrative compromise pattern explains why adding more agents can amplify the problem rather than dilute it.
How does consensus-seeking affect robustness to adversaries?
Consensus-seeking behavior increases robustness to adversarial agents, creating a trade-off between alignment and effective use of expertise. While integrative compromise degrades peak performance against non-adversarial benchmarks, the same averaging behavior makes teams less vulnerable when some agents behave adversarially.
The paper highlights this tension explicitly: consensus mechanisms that protect a team from adversarial influence also reduce the team’s ability to capitalise on a high-performing expert. Designers of multi-agent systems must weigh resilience against adversarial agents versus the need to respect and amplify expert reasoning.
Why it matters
The results expose a practical limitation for systems that rely on self-organizing multi-agent LLMs as autonomous collaborators. Deployments that assume free-form agent interaction will naturally surface coordination dynamics that suppress expert contributions, producing substantial performance losses — the paper documents losses up to 41.1% on ML benchmarks. Organizations building multi-agent workflows must therefore consider mechanisms beyond unconstrained interaction if they need to preserve expert-level outputs.
What to watch
Look for follow-up work that targets expert leveraging: concrete weighting rules, role scaffolds, or interaction protocols that preserve robustness while enabling teams to defer to known experts. The paper’s authors published their results at ICML in July 2026; progress will be visible in subsequent multi-agent papers that test interventions for expert weighting and that measure the same human-inspired and frontier ML benchmarks used here.
Authors and affiliations: Aneesh Pappu†, Batu El†, Hancheng Cao‡, Carmelo di Nolfo, Yanchao Sun, Meng Cao, James Zou† († Stanford University; ‡ Emory University).
| Item | ||
|---|---|---|
| Maximum reported performance loss | 41.1% | On ML benchmarks |
| Expert identification vs leveraging | Identification not primary bottleneck | Expert leveraging is primary bottleneck |
| Typical conversational pattern | "Integrative compromise" | Averaging expert and non-expert views |
| Effect of team size | Compromise increases with team size | Correlates negatively with performance |
| Robustness to adversarial agents | Improved by consensus-seeking | Creates trade-off with expertise utilization |
Written by The Brieftide · Source: Apple Machine Learning
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAgent4cs: Multi-agent code summarization, up to 38% gains
Agent4cs uses three cooperating agents to summarize large hierarchical codebases.
llm-coding-agent 0.1a0: GPT-5.5 coding agent and tools
Simon Willison published llm-coding-agent 0.1a0 on 2nd July 2026, a PyPI slop-alpha that exposes file.
Mnemosyne agentic transaction system: validation & repair
Mnemosyne implements Agentic Transaction Processing (ATP) to validate AI-generated actions under an executable constraint set C and repair.
Autoformalization: Agent Instructions to Policy-as-Code
A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.