Estimating Uncertainty in Classifier Performance: LLMs
Kylie Anglin finds Wald and basic percentile bootstrap often under-cover; recommends Agresti-Coull, Wilson.
TL;DR
- 01Kylie Anglin finds Wald and basic percentile bootstrap often under-cover; recommends Agresti-Coull, Wilson.
- 02Kylie Anglin submitted a methods paper to arXiv on 24 Jun 2026 that evaluates how researchers should estimate uncertainty for classifier performance metrics such as recall, precision and F1.
- 03Agresti-Coull, Wilson, Clopper-Pearson and a novel pseudo-count regularized bootstrap generally improve accuracy compared with default approaches.
Kylie Anglin submitted a methods paper to arXiv on 24 Jun 2026 that evaluates how researchers should estimate uncertainty for classifier performance metrics such as recall, precision and F1. The paper, arXiv:2606.26422, tests interval methods under conditions common in social science text classification: small to moderate sample sizes, infrequent constructs, and texts nested within individuals.
Which confidence-interval methods perform best for classifier metrics?
Agresti-Coull, Wilson, Clopper-Pearson and a novel pseudo-count regularized bootstrap generally improve accuracy compared with default approaches. Anglin shows that default methods such as the Wald interval and the basic percentile bootstrap are the least accurate, with coverage sometimes far below the nominal 95% level. The pseudo-count regularized bootstrap is highlighted as particularly relevant to calculating F1 scores.
The paper frames these comparisons through simulations that mimic real-world social science settings: small or moderate labeled datasets and low-prevalence constructs. Under those conditions Anglin finds the common analytic and bootstrap defaults underperform, while the named alternatives produce intervals with better coverage properties.
How should nested text data (texts within individuals) be handled?
Adjustment for both effective N and the appropriate degrees of freedom is necessary to produce accurate analytic intervals when texts are nested within individuals. Anglin evaluates bootstrap-based alternatives for clustered data and finds trade-offs: the hierarchical bootstrap is more accurate than the cluster bootstrap when individuals produce a moderate number of texts, but the hierarchical approach becomes overly conservative when individuals produce only a few texts.
The simulations therefore recommend explicit correction for the reduced effective sample size that nesting induces, and careful choice of bootstrap technique based on how many texts each individual contributes.
Why it matters
Performance metrics used as evidence of validity are point estimates subject to sampling variation; failing to report or misestimating their uncertainty can mislead conclusions about a model's validity. Anglin's findings matter because many current studies rely on small labelled samples or measure infrequent constructs, situations where common intervals can miss their stated coverage (the paper cites instances well below the nominal 95% level). Better interval methods and correct handling of nesting reduce false confidence and encourage researchers to plan validation sample sizes deliberately.
What to watch
Check the arXiv entry for the paper (arXiv:2606.26422) for linked code and data in the "Code, Data and Media Associated with this Article" section; those assets will determine how readily applied the recommended methods become. Also watch whether future classification papers in social science switch from Wald and basic percentile bootstrap intervals to the Agresti-Coull, Wilson, Clopper-Pearson or pseudo-count bootstrap approaches Anglin recommends.
References: Kylie Anglin, "Estimating Uncertainty in Classifier Performance with Applications to Large Language Models and Nested Data," arXiv:2606.26422, submitted 24 Jun 2026.
| Item | |||
|---|---|---|---|
| Wald interval | least accurate | Avoid when sample small or performance high; coverage can be far below the nominal 95% level | |
| Basic percentile bootstrap | least accurate | Avoid in small/moderate samples | |
| Agresti-Coull | improved accuracy | Recommended alternative to Wald | |
| Wilson | improved accuracy | Recommended alternative to Wald | |
| Clopper-Pearson | improved accuracy | Recommended for analytic intervals | |
| Pseudo-count regularized bootstrap | improved accuracy (esp. F1) | Recommended for F1 calculation | |
| Hierarchical bootstrap | more accurate (moderate texts per individual); overly conservative when few texts | Use for nested data with a moderate number of texts per individual | |
| Cluster bootstrap | less accurate than hierarchical in moderate-text cases | Not preferred when individuals produce a moderate number of texts |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
Browse the feedLatent ODE for Cine Cardiac MRI: UK Biobank HF prediction
Adding a latent score to refitted pooled cohort equations raised the stratified C-index from 0.704 to 0.785 in UK Biobank.
Harris Hawks Optimization for depression prediction in FSWs
An arXiv paper applies ensemble feature selection and Harris Hawks–tuned logistic regression to predict depression in 3.
Learn to Cluster: Quantifying Pedestrian Social Interaction (2026)
Xiaodan Shi proposes Learn to Cluster, a label-free probabilistic method that learns latent social interactions from trajectories to aid.
LLM Tutors: Benchmark scaffolding vs real-world uptake
A study of 9,490 chats across nine datasets finds benchmarks assume high scaffolding and uptake.