Multimodal AI6 min read

Reliability-Aware Inference reduces visual hallucinations in MLLMs

A retrieval-augmented, reliability-aware framework lifted ImageNet-100 accepted accuracy from 85.84% to 88.88% (89.04% coverage) and cut.

The Brieftide

TL;DR

  • 01A retrieval-augmented, reliability-aware framework lifted ImageNet-100 accepted accuracy from 85.84% to 88.88% (89.04% coverage) and cut.
  • 02Experiments on ImageNet-100 show accepted prediction accuracy rising from 85.84% to 88.88% at 89.04% coverage, and the hallucination-like accepted wrong-answer rate falling from 14.16% to 11.12%.
  • 03The authors constructed an external visual evidence database and used nearest-neighbor retrieval over normalized pretrained visual embeddings to supply evidence for each instance.

Pratheswaran Hariharan, Haiping Xu and Donghui Yan submitted a paper on 14 June 2026 proposing a retrieval-augmented, reliability-aware inference framework to reduce visual hallucinations in multimodal large language models. Experiments on ImageNet-100 show accepted prediction accuracy rising from 85.84% to 88.88% at 89.04% coverage, and the hallucination-like accepted wrong-answer rate falling from 14.16% to 11.12%.

What did the paper build and how does it work?

The authors constructed an external visual evidence database and used nearest-neighbor retrieval over normalized pretrained visual embeddings to supply evidence for each instance. Retrieved evidence feeds multiple reliability indicators — similarity strength, class-support agreement, evidence margin, entropy-based uncertainty and an aggregate reliability score — and a decision gate then chooses to accept the prediction, answer with caution, or abstain/fallback. A multimodal response-generation layer produces the final user-facing output conditioned on that reliability decision.

The system emphasizes instance-level reliability rather than retraining the base multimodal model. Database construction relies on pretrained visual embeddings and normalized feature representations; retrieval provides concrete visual neighbors that the framework uses to estimate trustworthiness before presenting an answer.

How well did it perform on ImageNet-100?

On ImageNet-100 the framework improved accepted-prediction accuracy from 85.84% to 88.88% at 89.04% coverage and reduced the hallucination-like accepted wrong-answer rate from 14.16% to 11.12%. These are the paper's primary empirical results reported by the authors.

The experiments show that integrating retrieval evidence, several reliability signals, and selective decision gating can improve calibration and reduce overconfident visual errors without retraining large multimodal models. The paper is 28 pages long and includes nine figures documenting the system and experimental results.

Why it matters

Multimodal models can produce overconfident visual errors when visual evidence is weak or ambiguous; the paper targets that failure mode with an evidence-backed, per-instance reliability estimate. By tying a decision gate and response generation to measurable signals such as similarity strength and entropy, the approach gives platforms a way to accept, hedge, or abstain based on explicit indicators rather than raw model confidence. That matters for any application where presenting an incorrect visual answer is costly, because the framework reduced accepted wrong answers on ImageNet-100 by a measurable margin.

What to watch

See whether the retrieval-augmented reliability signals generalize beyond ImageNet-100 to larger or more diverse vision-language benchmarks, and whether practitioners adopt the decision-gate pattern in deployed multimodal pipelines. Future work and replications that apply the same evidence and reliability indicators to other datasets will test how broadly the reported accuracy and wrong-answer improvements hold.

References and provenance: the paper "Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference" by Pratheswaran Hariharan, Haiping Xu and Donghui Yan, submitted to arXiv on 14 Jun 2026 (28 pages, 9 figures).

Architecture of the retrieval-augmented reliability-aware inference system
Input image / MLLM predictionExternal visual evidence database (pretrained visual embeddings, normalized features)Nearest-neighbor retrievalReliability indicators (similarity strength, class-support agreement, evidence margin, entropy uncertainty, aggregate score)Decision gate (accept | answer with caution | abstain/fallback)Multimodal response-generation layer (conditioned on reliability decision)
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement