Multimodal AIJune 18, 20264 min read

CaVe-VLM-CoT: Interpretable VLM scores 87.1% on ScienceQA

Modular agentic-RAG pipeline enforces step-level citations, achieving 87.1% accuracy on ScienceQA and a 56.6% CaVeScore.

The BrieftideJune 18, 2026

TL;DR

01Modular agentic-RAG pipeline enforces step-level citations, achieving 87.1% accuracy on ScienceQA and a 56.6% CaVeScore.
02The system reports 87.1% accuracy and a 56.6% CaVeScore on ScienceQA, and 55.2% accuracy with a 35.7% CaVeScore on MMMU (30 subjects).
03CaVe-VLM-CoT is a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through explicit, step-level citation and verification.

CaVe-VLM-CoT, authored by Sneha Rao, Shaina Raza and Dhanesh Ramachandram and submitted 16 Jun 2026, introduces a modular, interpretable Vision-Language Model framework built as a five-stage closed-loop pipeline. The system reports 87.1% accuracy and a 56.6% CaVeScore on ScienceQA, and 55.2% accuracy with a 35.7% CaVeScore on MMMU (30 subjects).

What is CaVe-VLM-CoT?

CaVe-VLM-CoT is a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through explicit, step-level citation and verification. It defines five stages: Extractor, Retriever, Solver, Citation Injector, and Verifier, and introduces a suite of 23 component-wise metrics anchored by CaVeScore, a composite metric combining accuracy, citation precision and recall, attribution, and evidence grounding.

The authors position the framework to address hallucinations in Vision-Language Models by routing detected verification failures back to the Extractor for targeted re-retrieval. The design requires no architectural or prompt modifications to existing models, according to the submission.

How does the closed-loop pipeline work and how was it measured?

The pipeline flows from Extractor to Retriever to Solver to Citation Injector to Verifier, with Verifier-triggered feedback returned to the Extractor when claims are ungrounded. The framework measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding via 23 component-wise metrics, summarized by CaVeScore, which weights accuracy, citation precision and recall, attribution, and evidence grounding.

Evaluation numbers provided in the submission show CaVe-VLM-CoT achieved 87.1% accuracy and a 56.6% CaVeScore on ScienceQA. On MMMU (30 subjects) the framework achieved 55.2% accuracy and a 35.7% CaVeScore. The authors highlight that these results came without changing model architectures or prompt engineering, relying instead on the agentic-RAG control loop and citation-injection mechanism.

Why it matters

The framework shifts the focus from monolithic model changes to structured process controls: explicit citation injection and verifier-driven feedback create a traceable chain from input image and retrieved evidence to final answer. That traceability addresses two common failure modes: visually unfaithful outputs and unverified claims. The 23-metric suite and CaVeScore provide concrete, component-level measurements researchers can use to diagnose whether errors originate in retrieval, reasoning, attribution, or cross-modal alignment.

For practitioners, the claim that the pipeline needs no architecture or prompt changes matters because it suggests existing VLM deployments could adopt the approach as a modular layer to improve faithfulness. For benchmarkers, the paired accuracy and CaVeScore figures on ScienceQA and MMMU offer both task performance and evidence-grounding diagnostics rather than a single aggregated number.

What to watch

Look for code, data and demos linked to the submission and any follow-up evaluations on other VLM benchmarks or larger-scale vision-language tasks. The next concrete signal will be adoption of the CaVeScore or the 23 component-wise metrics by other groups, and published comparisons that separate gains from retrieval improvements versus gains from the verification-feedback loop.

Authors and submission details: the work is by Sneha Rao, Shaina Raza and Dhanesh Ramachandram, submitted to arXiv on 16 Jun 2026. Key evaluation points from the paper: 87.1% accuracy and 56.6% CaVeScore on ScienceQA; 55.2% accuracy and 35.7% CaVeScore on MMMU (30 subjects).

CaVe-VLM-CoT pipeline components and feedback loop