Qwen 2.5 7B vs XGBoost: detecting LLM epistemic blind spots
Paper (submitted 17 Jun 2026, accepted at EIML@ICML 2026) finds Qwen 2.5 7B's verbalized confidence stays 0.856–0.937 while accuracy varies.
TL;DR
- 01Paper (submitted 17 Jun 2026, accepted at EIML@ICML 2026) finds Qwen 2.5 7B's verbalized confidence stays 0.856–0.937 while accuracy varies.
- 02Akshat Dasula, Prasanna Desikan and Jaideep Srivastava submitted a paper on 17 Jun 2026 that tests whether a large language model can recognize the limits of its knowledge on structured clinical data.
- 03The authors compare Qwen 2.5 7B and XGBoost, and report concrete failures of LLM verbalized confidence and methods to produce patient-specific reliability estimates.
Akshat Dasula, Prasanna Desikan and Jaideep Srivastava submitted a paper on 17 Jun 2026 that tests whether a large language model can recognize the limits of its knowledge on structured clinical data. The authors compare Qwen 2.5 7B and XGBoost, and report concrete failures of LLM verbalized confidence and methods to produce patient-specific reliability estimates.
What did the authors test and what were the headline findings?
The paper compares Qwen 2.5 7B against XGBoost on a clinical tabular prediction task and measures attribution divergence and calibration. The authors report that the LLM’s verbalized confidence is near-constant, ranging 0.856 to 0.937, even when accuracy varies between 49% and 75.3%. They also show an inverse difficulty effect: the LLM’s accuracy drops to 64.8% when XGBoost is 99% correct, yet the LLM matches XGBoost (73.8% vs. 73.1%) when XGBoost is moderately uncertain.
The paper frames these results as a cold start problem for LLMs on structured data and presents four primary findings: vacuous verbalized confidence, the inverse difficulty effect, the impact of few-shot and SHAP-derived feature evidence, and a cross-model calibrator that maps attribution divergence to reliability.
How did interventions change LLM behaviour and metrics?
Few-shot examples combined with SHAP-derived feature evidence reduced attribution disagreement and improved accuracy without retraining. Specifically, the Attribution Disagreement Score (ADS) fell from 1.54 to 0.38, while accuracy rose from 49% to 75.3% after these super-additive interventions. The authors further introduce a cross-model calibrator that uses attribution divergence signals to estimate LLM reliability; this calibrator reduced expected calibration error from 0.254 to 0.080, replacing the uninformative verbalized confidence with patient-specific reliability estimates.
In short, adding model-agnostic feature evidence and few-shot context substantially aligned the LLM with XGBoost and improved both disagreement and predictive performance on the task the authors studied.
Why it matters
The paper exposes a concrete failure mode: "LLM verbalized confidence is epistemically vacuous," a short phrase the authors use to describe confidence that tracks prompt format rather than correctness. That matters because clinical settings require per-patient reliability, not a generic confidence signal. Demonstrating that cross-model attribution divergence can produce patient-level reliability estimates without model internals or repeated inference suggests a practical path to reduce epistemic uncertainty when deploying LLMs on structured data.
The findings also show LLMs can underperform even when a conventional model is near-perfect, so treating LLM outputs as self-aware risk assessments would be unsafe without external calibration.
What to watch
Look for follow-up work evaluating the cross-model calibrator across other LLM architectures and clinical datasets and for replication of the ADS reductions when different feature-attribution methods are used. The authors note their approach requires no access to model internals, so confirmation on other LLMs would test how general attribution-divergence signals are as reliability predictors.
The paper was accepted at EIML@ICML 2026 and is available on arXiv (submitted 17 Jun 2026), providing the detailed experiments and metrics above.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Visual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.
Hugging Face Spaces agents.md: chain image to 3D splats
An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.