4 min read

Estimating Uncertainty in Classifier Performance: LLMs

Kylie Anglin finds Wald and basic percentile bootstrap often under-cover; recommends Agresti-Coull, Wilson.

The Brieftide

TL;DR

  • 01Kylie Anglin finds Wald and basic percentile bootstrap often under-cover; recommends Agresti-Coull, Wilson.
  • 02Kylie Anglin submitted a methods paper to arXiv on 24 Jun 2026 that evaluates how researchers should estimate uncertainty for classifier performance metrics such as recall, precision and F1.
  • 03Agresti-Coull, Wilson, Clopper-Pearson and a novel pseudo-count regularized bootstrap generally improve accuracy compared with default approaches.

Kylie Anglin submitted a methods paper to arXiv on 24 Jun 2026 that evaluates how researchers should estimate uncertainty for classifier performance metrics such as recall, precision and F1. The paper, arXiv:2606.26422, tests interval methods under conditions common in social science text classification: small to moderate sample sizes, infrequent constructs, and texts nested within individuals.

Which confidence-interval methods perform best for classifier metrics?

Agresti-Coull, Wilson, Clopper-Pearson and a novel pseudo-count regularized bootstrap generally improve accuracy compared with default approaches. Anglin shows that default methods such as the Wald interval and the basic percentile bootstrap are the least accurate, with coverage sometimes far below the nominal 95% level. The pseudo-count regularized bootstrap is highlighted as particularly relevant to calculating F1 scores.

The paper frames these comparisons through simulations that mimic real-world social science settings: small or moderate labeled datasets and low-prevalence constructs. Under those conditions Anglin finds the common analytic and bootstrap defaults underperform, while the named alternatives produce intervals with better coverage properties.

How should nested text data (texts within individuals) be handled?

Adjustment for both effective N and the appropriate degrees of freedom is necessary to produce accurate analytic intervals when texts are nested within individuals. Anglin evaluates bootstrap-based alternatives for clustered data and finds trade-offs: the hierarchical bootstrap is more accurate than the cluster bootstrap when individuals produce a moderate number of texts, but the hierarchical approach becomes overly conservative when individuals produce only a few texts.

The simulations therefore recommend explicit correction for the reduced effective sample size that nesting induces, and careful choice of bootstrap technique based on how many texts each individual contributes.

Why it matters

Performance metrics used as evidence of validity are point estimates subject to sampling variation; failing to report or misestimating their uncertainty can mislead conclusions about a model's validity. Anglin's findings matter because many current studies rely on small labelled samples or measure infrequent constructs, situations where common intervals can miss their stated coverage (the paper cites instances well below the nominal 95% level). Better interval methods and correct handling of nesting reduce false confidence and encourage researchers to plan validation sample sizes deliberately.

What to watch

Check the arXiv entry for the paper (arXiv:2606.26422) for linked code and data in the "Code, Data and Media Associated with this Article" section; those assets will determine how readily applied the recommended methods become. Also watch whether future classification papers in social science switch from Wald and basic percentile bootstrap intervals to the Agresti-Coull, Wilson, Clopper-Pearson or pseudo-count bootstrap approaches Anglin recommends.

References: Kylie Anglin, "Estimating Uncertainty in Classifier Performance with Applications to Large Language Models and Nested Data," arXiv:2606.26422, submitted 24 Jun 2026.

Performance of interval methods described in the paper
Item
Wald intervalleast accurateAvoid when sample small or performance high; coverage can be far below the nominal 95% level
Basic percentile bootstrapleast accurateAvoid in small/moderate samples
Agresti-Coullimproved accuracyRecommended alternative to Wald
Wilsonimproved accuracyRecommended alternative to Wald
Clopper-Pearsonimproved accuracyRecommended for analytic intervals
Pseudo-count regularized bootstrapimproved accuracy (esp. F1)Recommended for F1 calculation
Hierarchical bootstrapmore accurate (moderate texts per individual); overly conservative when few textsUse for nested data with a moderate number of texts per individual
Cluster bootstrapless accurate than hierarchical in moderate-text casesNot preferred when individuals produce a moderate number of texts
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Browse the feed
Advertisement