MKG-RAG-Bench: new benchmark for multimodal KG retrieval
A cross-domain benchmark with two multimodal knowledge graphs and LLM-curated QA splits evaluates retrieval's role in KG-augmented.
TL;DR
- 01A cross-domain benchmark with two multimodal knowledge graphs and LLM-curated QA splits evaluates retrieval's role in KG-augmented.
- 02Xiaochen Wang and four coauthors published MKG-RAG-Bench on arXiv (arXiv:2606.26458), submitted on 24 Jun 2026 and accepted by KDD'26.
- 03The paper introduces a cross-domain benchmark explicitly designed to measure retrieval quality in multimodal knowledge graph-augmented generation systems.
Xiaochen Wang and four coauthors published MKG-RAG-Bench on arXiv (arXiv:2606.26458), submitted on 24 Jun 2026 and accepted by KDD'26. The paper introduces a cross-domain benchmark explicitly designed to measure retrieval quality in multimodal knowledge graph-augmented generation systems.
What is MKG-RAG-Bench?
MKG-RAG-Bench is a cross-domain benchmark built from two multimodal knowledge graphs, covering general and medical domains, designed to evaluate retrieval as a first-class target in knowledge graph-augmented generation. The benchmark pairs those graphs with carefully aligned question-answering datasets so researchers can separately measure retrieval and downstream generation performance.
The authors position MKG-RAG-Bench against existing RAG evaluations, arguing that prior benchmarks largely overlook the distinctive retrieval challenges that arise when knowledge is multimodal and structured as a knowledge graph. The benchmark focuses on retrieval bottlenecks that stem from heterogeneous multimodal content and alignment difficulties across modalities.
How was the benchmark constructed?
The benchmark pipeline uses a large language model to curate the dataset: it filters low-utility knowledge, generates structurally grounded queries with exact supervision, and systematically covers diverse modality configurations. That LLM-based curation pipeline is a core part of the benchmark's construction, according to the paper.
Concrete construction choices include selecting two multimodal knowledge graphs spanning different domains (general and medical) and creating QA splits that enable controlled evaluation of both retrieval and generation. The paper highlights the need for exact supervision in queries so experiments can isolate retrieval performance from generation quality.
How well do current retrievers perform on MKG-RAG-Bench?
Experiments across representative retriever families and modality settings show that effective multimodal retrieval remains challenging, and that retrieval quality strongly determines generation outcomes. The authors report that, in their extensive experiments, retrieval remains a critical bottleneck for end-to-end MKG-RAG performance.
The paper does not publish a single winner retriever; rather, it emphasizes that retrievers originally designed for unstructured corpora poorly serve the heterogeneous, multimodal knowledge in MKG-RAG scenarios. By comparing retriever families and modality configurations, the benchmark exposes where retrieval fails and how those failures cascade into degraded generation.
Why it matters
The benchmark tackles a concrete gap: multimodal knowledge is heterogeneous, difficult to align across modalities, and often poorly served by retrievers designed for unstructured corpora. That makes retrieval the likely choke point for deploying knowledge graph-augmented generation in domains that rely on images, structured facts, and other nontext modalities.
By isolating retrieval as an explicit evaluation target and providing LLM-curated, exactly supervised QA splits across two domain graphs, MKG-RAG-Bench gives researchers a way to measure retrieval improvements independently of generation advances. Better retrievers on this benchmark should translate directly into more reliable, grounded generation for applications that depend on multimodal graphs.
What to watch
The paper has been accepted by KDD'26, so expect further discussion and evaluation results in the KDD'26 proceedings or presentations. Check the paper's arXiv page for the "Code, Data and Media" links the authors provide to access benchmark assets and reproduce the experiments.
Source: arXiv:2606.26458, "MKG-RAG-Bench: Benchmarking Retrieval in Multimodal Knowledge Graph-Augmented Generation," submitted 24 Jun 2026 and accepted by KDD'26.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIMIT Masked IRL: LLMs help robots clarify and ignore cues
MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.
Multimodal LLM evaluation: four missing capabilities (2026)
A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.
ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.
Amazon Nova embeddings beat Cohere for Vexcel aerial search
Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.