ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.
TL;DR
- 01ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.
- 02The accompanying ReMMDBench contains 500 samples and 2,756 images; ReMMD-Agent achieves 41.80% five-way accuracy and 39.12% macro-F1 using GPT-5.2 while cutting cost compared with prior agents.
- 03ReMMD is a framework combining a real-world benchmark and an agentic verifier designed for long, multilingual posts with multiple images.
ReMMD, published by Chenhao Dang, Dantong Zhu, Jun Yang, Conghui He, and Weijia Li on 23 Jun 2026 (arXiv:2606.24112), delivers a realistic multilingual multi-image verification framework and a persistent-memory agent for multimodal misinformation detection. The accompanying ReMMDBench contains 500 samples and 2,756 images; ReMMD-Agent achieves 41.80% five-way accuracy and 39.12% macro-F1 using GPT-5.2 while cutting cost compared with prior agents.
What is ReMMD and what does ReMMDBench contain?
ReMMD is a framework combining a real-world benchmark and an agentic verifier designed for long, multilingual posts with multiple images. ReMMDBench includes 500 samples, 2,756 images, five monolingual languages, two cross-lingual settings, three text-length tiers, multi-image posts, five-way veracity labels, eight distortion labels, evidence provenance, and rationales, giving researchers multi-faceted ground truth for verification.
The authors built the benchmark to reflect modern viral posts that mix long multilingual narratives, several images, mixed provenance, and subtle text-image framing errors. The paper states existing benchmarks and methods often isolate short captions, single images, binary labels, or one manipulation source, and so ReMMDBench aims to close that realism gap.
How does ReMMD-Agent work and how does it perform?
ReMMD-Agent is a persistent-memory verifier that decomposes posts into atomic points, builds a reusable evidence set, and predicts structured L1/L2/L3 outputs; it is evaluated against proprietary systems, open LVLMs, MMD-Agent, and T2-Agent. Across those comparators, the paper reports ReMMD-Agent achieves the best five-way veracity performance, with 41.80% accuracy and 39.12% macro-F1 when run with GPT-5.2.
The verifier’s persistent-memory design lets it reuse gathered evidence across a post’s atomic claims, which the authors argue lowers repeated evidence-search cost. Concretely, the paper states ReMMD-Agent reduces cost by 17.5% relative to MMD-Agent and by 79.9% relative to T2-Agent. The project materials are provided at the paper’s URL for inspection and reuse.
Why it matters
Multimodal misinformation increasingly combines long multilingual text and multiple images, so benchmarks that restrict to single-image, short-caption or binary-label setups underrepresent real-world difficulty. ReMMDBench supplies multi-image, multilingual, multi-label examples with provenance and rationales, enabling methods to train and be evaluated on the kinds of evidence search and cross-lingual reasoning that human verifiers face. The persistent-memory approach in ReMMD-Agent suggests a concrete systems path to reduce evidence-search cost while producing structured verification outputs, a practical trade-off for deployment where repeated lookups drive expense.
What to watch
Check the project URL included with the paper for code, data and evaluation scripts and for external reproductions of the GPT-5.2 results. The next concrete signals will be community adoption of ReMMDBench for cross-system comparisons and independent evaluations confirming the paper’s reported 41.80% accuracy and 39.12% macro-F1 under open and proprietary model setups.
References: ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection, Chenhao Dang et al., arXiv:2606.24112 (submitted 23 Jun 2026).
| Item | |||
|---|---|---|---|
| Five-way accuracy | 41.80% | n/a | n/a |
| Macro-F1 | 39.12% | n/a | n/a |
| Reported cost comparison | baseline | ReMMD-Agent reduces cost by 17.5% relative to MMD-Agent | ReMMD-Agent reduces cost by 79.9% relative to T2-Agent |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIAmazon Nova embeddings beat Cohere for Vexcel aerial search
Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.
LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.
ThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Reliability-Aware Inference reduces visual hallucinations in MLLMs
A retrieval-augmented, reliability-aware framework lifted ImageNet-100 accepted accuracy from 85.84% to 88.88% (89.04% coverage) and cut.