Benchmarks & EvalsJune 18, 20265 min read

MultiCom multi-agent framework: ComRate dataset and 84.7% accuracy

The paper introduces ComRate, a 2.5 million community-notes dataset from $\mathbb{X}$ and MultiCom.

The BrieftideJune 18, 2026

TL;DR

01The paper introduces ComRate, a 2.5 million community-notes dataset from $\mathbb{X}$ and MultiCom.
02The paper pairs a large-scale dataset with a simulation-based evaluator that outputs structured, explainable judgments and an aggregation algorithm that combines votes and diagnostic signals.
03The authors list is Changxi Wen, Shuning Zhang, Bohao Chu, Yuwei Chuai, Hui Wang, Dai Shi, Xin Yi, and Hewu Li, and the submission date is 3 Jun 2026.

Changxi Wen and seven coauthors submitted "Towards Multi-Agent-Simulation-Based Community Note Evaluation" to arXiv on 3 Jun 2026, releasing ComRate, a dataset of 2.5 million community notes and over 209 million ratings sourced from $\mathbb{X}$, and presenting MultiCom, a persona-guided multi-agent rating framework that attains an average accuracy of 84.7% on the evaluation set. The paper pairs a large-scale dataset with a simulation-based evaluator that outputs structured, explainable judgments and an aggregation algorithm that combines votes and diagnostic signals.

What did the paper release?

The paper published ComRate, a large-scale dataset and a new evaluator called MultiCom: ComRate comprises 2.5 million community notes and over 209 million ratings collected from $\mathbb{X}$, and MultiCom achieves 84.7% average accuracy, balanced accuracy 68.3%, and macro-F1 60.1% on the evaluation set. The authors list is Changxi Wen, Shuning Zhang, Bohao Chu, Yuwei Chuai, Hui Wang, Dai Shi, Xin Yi, and Hewu Li, and the submission date is 3 Jun 2026. The arXiv entry also links to PDF, TeX source, and a "Code, Data and Media Associated with this Article" section.

ComRate aims to capture the scale of community-based fact-checking and cross-consensus ratings as used on the platform the paper cites. The dataset headline numbers are the primary concrete resource the paper provides for follow-up work: 2.5 million notes and more than 209 million rater evaluations.

How does MultiCom work?

MultiCom simulates a diverse rater population by clustering contributors in a matrix-factorized rater space and prompting persona agents to produce structured assessments following the official community notes rating schema. Persona agents output explainable judgments including confidence, agreement signals, and reasons, and an out-of-fold calibrated aggregation algorithm combines raw votes and diagnostic reason signals for the final prediction.

In more detail the pipeline in the paper clusters contributors to model heterogeneity in the rater pool, instantiates persona-guided agents that generate structured assessments (not just raw binary votes), and then aggregates those agent outputs with a calibration step. The authors emphasize structured, explainable outputs as part of the agent decisions: confidence scores, agreement signals, and textual reasons are reported as diagnostic features for aggregation.

How was MultiCom evaluated and how did it perform?

The evaluation set results reported in the paper show MultiCom outperforming unspecified alternative methods, with an average accuracy of 84.7%, balanced accuracy of 68.3%, and macro-F1 of 60.1%. Those three metrics are the specific, source-attributed performance numbers the authors provide for their approach on the held-out evaluation set.

The paper frames the experiments as demonstrating that persona-guided multi-agent simulation plus calibrated aggregation yields more reliable predictions than the alternatives the authors compare against, though the abstract provides only the summary metrics above.

Why it matters

Community-based fact-checking systems depend on cross-consensus signals, but human-rated cross-consensus is often slow and sparse. A 2.5 million-note dataset plus a simulation framework that produces structured, explainable judgments and reaches 84.7% average accuracy could materially lower the bar for automated or semi-automated evaluation of community notes. Platforms and researchers testing moderation pipelines or rater-behavior models now have a large empirical resource and a concrete evaluation baseline to compare against.

What to watch

Check the paper's "Code, Data and Media Associated with this Article" section on the arXiv page for actual ComRate and MultiCom artifacts and for release notes. The next concrete signals will be public dataset and code availability, independent replication of the 84.7% average accuracy, and whether follow-up work reports improvements in macro-F1 above 60.1% on the same evaluation set.

References: the arXiv submission "Towards Multi-Agent-Simulation-Based Community Note Evaluation" (submitted 3 Jun 2026) by Changxi Wen et al.

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

BIM-Edit: Benchmarking LLMs for IFC-based BIM Editing

BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.

The BrieftideDAILY BRIEF

ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates

ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.

The BrieftideDAILY BRIEF

LLM Agents: Predictive Validity vs Static Leaderboards

Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.

The BrieftideDAILY BRIEF

CombEval: Benchmarking combinatorial counting in 11 LLMs

CombEval is a dynamic, solver-verified benchmark for combinatorial counting that tests 11 LLMs across varied object types.