CaVe-VLM-CoT: Interpretable VLM scores 87.1% on ScienceQA
Modular agentic-RAG pipeline enforces step-level citations, achieving 87.1% accuracy on ScienceQA and a 56.6% CaVeScore.
TL;DR
- 01Modular agentic-RAG pipeline enforces step-level citations, achieving 87.1% accuracy on ScienceQA and a 56.6% CaVeScore.
- 02The system reports 87.1% accuracy and a 56.6% CaVeScore on ScienceQA, and 55.2% accuracy with a 35.7% CaVeScore on MMMU (30 subjects).
- 03CaVe-VLM-CoT is a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through explicit, step-level citation and verification.
CaVe-VLM-CoT, authored by Sneha Rao, Shaina Raza and Dhanesh Ramachandram and submitted 16 Jun 2026, introduces a modular, interpretable Vision-Language Model framework built as a five-stage closed-loop pipeline. The system reports 87.1% accuracy and a 56.6% CaVeScore on ScienceQA, and 55.2% accuracy with a 35.7% CaVeScore on MMMU (30 subjects).
What is CaVe-VLM-CoT?
CaVe-VLM-CoT is a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through explicit, step-level citation and verification. It defines five stages: Extractor, Retriever, Solver, Citation Injector, and Verifier, and introduces a suite of 23 component-wise metrics anchored by CaVeScore, a composite metric combining accuracy, citation precision and recall, attribution, and evidence grounding.
The authors position the framework to address hallucinations in Vision-Language Models by routing detected verification failures back to the Extractor for targeted re-retrieval. The design requires no architectural or prompt modifications to existing models, according to the submission.
How does the closed-loop pipeline work and how was it measured?
The pipeline flows from Extractor to Retriever to Solver to Citation Injector to Verifier, with Verifier-triggered feedback returned to the Extractor when claims are ungrounded. The framework measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding via 23 component-wise metrics, summarized by CaVeScore, which weights accuracy, citation precision and recall, attribution, and evidence grounding.
Evaluation numbers provided in the submission show CaVe-VLM-CoT achieved 87.1% accuracy and a 56.6% CaVeScore on ScienceQA. On MMMU (30 subjects) the framework achieved 55.2% accuracy and a 35.7% CaVeScore. The authors highlight that these results came without changing model architectures or prompt engineering, relying instead on the agentic-RAG control loop and citation-injection mechanism.
Why it matters
The framework shifts the focus from monolithic model changes to structured process controls: explicit citation injection and verifier-driven feedback create a traceable chain from input image and retrieved evidence to final answer. That traceability addresses two common failure modes: visually unfaithful outputs and unverified claims. The 23-metric suite and CaVeScore provide concrete, component-level measurements researchers can use to diagnose whether errors originate in retrieval, reasoning, attribution, or cross-modal alignment.
For practitioners, the claim that the pipeline needs no architecture or prompt changes matters because it suggests existing VLM deployments could adopt the approach as a modular layer to improve faithfulness. For benchmarkers, the paired accuracy and CaVeScore figures on ScienceQA and MMMU offer both task performance and evidence-grounding diagnostics rather than a single aggregated number.
What to watch
Look for code, data and demos linked to the submission and any follow-up evaluations on other VLM benchmarks or larger-scale vision-language tasks. The next concrete signal will be adoption of the CaVeScore or the 23 component-wise metrics by other groups, and published comparisons that separate gains from retrieval improvements versus gains from the verification-feedback loop.
Authors and submission details: the work is by Sneha Rao, Shaina Raza and Dhanesh Ramachandram, submitted to arXiv on 16 Jun 2026. Key evaluation points from the paper: 87.1% accuracy and 56.6% CaVeScore on ScienceQA; 55.2% accuracy and 35.7% CaVeScore on MMMU (30 subjects).
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIVisual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.
Hugging Face Spaces agents.md: chain image to 3D splats
An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.
LLM Research Papers 2026 (Jan–May): Curated list and trends
Sebastian Raschka assembled a curated list of LLM papers bookmarked from January through May 2026.