June 26, 20264 min read

Estimating Uncertainty in Classifier Performance: LLMs

Kylie Anglin finds Wald and basic percentile bootstrap often under-cover; recommends Agresti-Coull, Wilson.

The BrieftideJune 26, 2026

TL;DR

01Kylie Anglin finds Wald and basic percentile bootstrap often under-cover; recommends Agresti-Coull, Wilson.
02Kylie Anglin submitted a methods paper to arXiv on 24 Jun 2026 that evaluates how researchers should estimate uncertainty for classifier performance metrics such as recall, precision and F1.
03Agresti-Coull, Wilson, Clopper-Pearson and a novel pseudo-count regularized bootstrap generally improve accuracy compared with default approaches.

Kylie Anglin submitted a methods paper to arXiv on 24 Jun 2026 that evaluates how researchers should estimate uncertainty for classifier performance metrics such as recall, precision and F1. The paper, arXiv:2606.26422, tests interval methods under conditions common in social science text classification: small to moderate sample sizes, infrequent constructs, and texts nested within individuals.

Which confidence-interval methods perform best for classifier metrics?

Agresti-Coull, Wilson, Clopper-Pearson and a novel pseudo-count regularized bootstrap generally improve accuracy compared with default approaches. Anglin shows that default methods such as the Wald interval and the basic percentile bootstrap are the least accurate, with coverage sometimes far below the nominal 95% level. The pseudo-count regularized bootstrap is highlighted as particularly relevant to calculating F1 scores.

The paper frames these comparisons through simulations that mimic real-world social science settings: small or moderate labeled datasets and low-prevalence constructs. Under those conditions Anglin finds the common analytic and bootstrap defaults underperform, while the named alternatives produce intervals with better coverage properties.

How should nested text data (texts within individuals) be handled?

Adjustment for both effective N and the appropriate degrees of freedom is necessary to produce accurate analytic intervals when texts are nested within individuals. Anglin evaluates bootstrap-based alternatives for clustered data and finds trade-offs: the hierarchical bootstrap is more accurate than the cluster bootstrap when individuals produce a moderate number of texts, but the hierarchical approach becomes overly conservative when individuals produce only a few texts.

The simulations therefore recommend explicit correction for the reduced effective sample size that nesting induces, and careful choice of bootstrap technique based on how many texts each individual contributes.

Why it matters

Performance metrics used as evidence of validity are point estimates subject to sampling variation; failing to report or misestimating their uncertainty can mislead conclusions about a model's validity. Anglin's findings matter because many current studies rely on small labelled samples or measure infrequent constructs, situations where common intervals can miss their stated coverage (the paper cites instances well below the nominal 95% level). Better interval methods and correct handling of nesting reduce false confidence and encourage researchers to plan validation sample sizes deliberately.

What to watch

Check the arXiv entry for the paper (arXiv:2606.26422) for linked code and data in the "Code, Data and Media Associated with this Article" section; those assets will determine how readily applied the recommended methods become. Also watch whether future classification papers in social science switch from Wald and basic percentile bootstrap intervals to the Agresti-Coull, Wilson, Clopper-Pearson or pseudo-count bootstrap approaches Anglin recommends.

References: Kylie Anglin, "Estimating Uncertainty in Classifier Performance with Applications to Large Language Models and Nested Data," arXiv:2606.26422, submitted 24 Jun 2026.

Performance of interval methods described in the paper

Item
Wald interval	least accurate	Avoid when sample small or performance high; coverage can be far below the nominal 95% level
Basic percentile bootstrap	least accurate	Avoid in small/moderate samples
Agresti-Coull	improved accuracy	Recommended alternative to Wald
Wilson	improved accuracy	Recommended alternative to Wald
Clopper-Pearson	improved accuracy	Recommended for analytic intervals
Pseudo-count regularized bootstrap	improved accuracy (esp. F1)	Recommended for F1 calculation
Hierarchical bootstrap	more accurate (moderate texts per individual); overly conservative when few texts	Use for nested data with a moderate number of texts per individual
Cluster bootstrap	less accurate than hierarchical in moderate-text cases	Not preferred when individuals produce a moderate number of texts

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Browse the feed

The BrieftideDAILY BRIEF

Latent ODE for Cine Cardiac MRI: UK Biobank HF prediction

Adding a latent score to refitted pooled cohort equations raised the stratified C-index from 0.704 to 0.785 in UK Biobank.

The BrieftideDAILY BRIEF

Harris Hawks Optimization for depression prediction in FSWs

An arXiv paper applies ensemble feature selection and Harris Hawks–tuned logistic regression to predict depression in 3.

The BrieftideDAILY BRIEF

Learn to Cluster: Quantifying Pedestrian Social Interaction (2026)

Xiaodan Shi proposes Learn to Cluster, a label-free probabilistic method that learns latent social interactions from trajectories to aid.

The BrieftideDAILY BRIEF

LLM Tutors: Benchmark scaffolding vs real-world uptake

A study of 9,490 chats across nine datasets finds benchmarks assume high scaffolding and uptake.