Multimodal AIJune 19, 20265 min read

Qwen 2.5 7B vs XGBoost: detecting LLM epistemic blind spots

Paper (submitted 17 Jun 2026, accepted at EIML@ICML 2026) finds Qwen 2.5 7B's verbalized confidence stays 0.856–0.937 while accuracy varies.

The BrieftideJune 19, 2026

TL;DR

01Paper (submitted 17 Jun 2026, accepted at EIML@ICML 2026) finds Qwen 2.5 7B's verbalized confidence stays 0.856–0.937 while accuracy varies.
02Akshat Dasula, Prasanna Desikan and Jaideep Srivastava submitted a paper on 17 Jun 2026 that tests whether a large language model can recognize the limits of its knowledge on structured clinical data.
03The authors compare Qwen 2.5 7B and XGBoost, and report concrete failures of LLM verbalized confidence and methods to produce patient-specific reliability estimates.

Akshat Dasula, Prasanna Desikan and Jaideep Srivastava submitted a paper on 17 Jun 2026 that tests whether a large language model can recognize the limits of its knowledge on structured clinical data. The authors compare Qwen 2.5 7B and XGBoost, and report concrete failures of LLM verbalized confidence and methods to produce patient-specific reliability estimates.

What did the authors test and what were the headline findings?

The paper compares Qwen 2.5 7B against XGBoost on a clinical tabular prediction task and measures attribution divergence and calibration. The authors report that the LLM’s verbalized confidence is near-constant, ranging 0.856 to 0.937, even when accuracy varies between 49% and 75.3%. They also show an inverse difficulty effect: the LLM’s accuracy drops to 64.8% when XGBoost is 99% correct, yet the LLM matches XGBoost (73.8% vs. 73.1%) when XGBoost is moderately uncertain.

The paper frames these results as a cold start problem for LLMs on structured data and presents four primary findings: vacuous verbalized confidence, the inverse difficulty effect, the impact of few-shot and SHAP-derived feature evidence, and a cross-model calibrator that maps attribution divergence to reliability.

How did interventions change LLM behaviour and metrics?

Few-shot examples combined with SHAP-derived feature evidence reduced attribution disagreement and improved accuracy without retraining. Specifically, the Attribution Disagreement Score (ADS) fell from 1.54 to 0.38, while accuracy rose from 49% to 75.3% after these super-additive interventions. The authors further introduce a cross-model calibrator that uses attribution divergence signals to estimate LLM reliability; this calibrator reduced expected calibration error from 0.254 to 0.080, replacing the uninformative verbalized confidence with patient-specific reliability estimates.

In short, adding model-agnostic feature evidence and few-shot context substantially aligned the LLM with XGBoost and improved both disagreement and predictive performance on the task the authors studied.

Why it matters

The paper exposes a concrete failure mode: "LLM verbalized confidence is epistemically vacuous," a short phrase the authors use to describe confidence that tracks prompt format rather than correctness. That matters because clinical settings require per-patient reliability, not a generic confidence signal. Demonstrating that cross-model attribution divergence can produce patient-level reliability estimates without model internals or repeated inference suggests a practical path to reduce epistemic uncertainty when deploying LLMs on structured data.

The findings also show LLMs can underperform even when a conventional model is near-perfect, so treating LLM outputs as self-aware risk assessments would be unsafe without external calibration.

What to watch

Look for follow-up work evaluating the cross-model calibrator across other LLM architectures and clinical datasets and for replication of the ADS reductions when different feature-attribution methods are used. The authors note their approach requires no access to model internals, so confirmation on other LLMs would test how general attribution-divergence signals are as reliability predictors.

The paper was accepted at EIML@ICML 2026 and is available on arXiv (submitted 17 Jun 2026), providing the detailed experiments and metrics above.

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

ThinkDeception: Progressive RL framework for multimodal deception

ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.

The BrieftideDAILY BRIEF

Visual-Seeker: visual-native multimodal search surpasses rivals

Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.

The BrieftideDAILY BRIEF

Gemma 4 12B: unified, encoder-free multimodal model for laptops

Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.

The BrieftideDAILY BRIEF

Hugging Face Spaces agents.md: chain image to 3D splats

An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.