LLMs identify CIUs in aphasic discourse: few-shot F1 0.776-0.817
Few-shot prompting let Llama-3.1-8B, Qwen2.5-7B and Mistral-7B reach mean F1s from 0.776 to 0.817 on 16 Cat Rescue transcripts.
TL;DR
- 01Few-shot prompting let Llama-3.1-8B, Qwen2.5-7B and Mistral-7B reach mean F1s from 0.776 to 0.817 on 16 Cat Rescue transcripts.
- 02The study, submitted 10 Apr 2026, benchmarked four public instruction-tuned LLMs on sixteen Cat Rescue transcripts spanning control, mild, moderate and severe aphasia.
- 03The authors measured accuracy, precision, recall, F1 and Cohen's kappa against consensus human labels, and reported results with five tables and four figures.
Instruction-tuned large language models can identify Correct Information Units, or CIUs, in aphasic picture-description transcripts when given few-shot examples, achieving mean few-shot F1 scores between 0.776 and 0.817 across three viable models. The study, submitted 10 Apr 2026, benchmarked four public instruction-tuned LLMs on sixteen Cat Rescue transcripts spanning control, mild, moderate and severe aphasia.
What did the researchers test and how?
The paper tested token-level CIU classification on sixteen picture-description transcripts annotated to the Nicholas and Brookshire (1993) CIU standard, comparing four instruction-tuned LLMs under zero-shot and two few-shot prompting conditions across five stratified random seeds. The authors measured accuracy, precision, recall, F1 and Cohen's kappa against consensus human labels, and reported results with five tables and four figures.
Researchers used the Cat Rescue stimulus to elicit discourse and sampled speakers across four severity strata: control, mild, moderate and severe aphasia. The experimental design contrasted zero-shot prompting, which the paper calls insufficient, with few-shot prompting that used either fixed global or per-chunk local example selection; the paper found no significant difference between those two example-selection strategies.
How did the models perform?
Three models produced competitive few-shot results, with mean few-shot F1 scores ranging from 0.776 to 0.817 across Llama-3.1-8B, Qwen2.5-7B and Mistral-7B, while Phi-3-mini was unstable and unreliable. Viable models showed high recall but lower precision, which the authors interpret as systematic over-classification of tokens as CIUs.
Zero-shot prompting failed to reach acceptable performance across models in this task, while few-shot prompting yielded substantial gains. The evaluation used five stratified random seeds to test robustness. Performance varied by discourse severity: the weakest results occurred with more severe aphasia, indicating the models struggle more as communicative impairment increases.
Why it matters
Automating CIU identification addresses a practical bottleneck: CIU scoring is time intensive and requires trained human raters. The paper shows that few-shot prompting can produce competitive token-level CIU classification without gradient-based task training, suggesting a viable human-in-the-loop workflow for discourse assessment systems. High recall paired with lower precision means models could help surface candidate CIUs for human review, but their agreement with human annotation is still insufficient for fully autonomous scoring.
What to watch
Look for follow-up work that reports per-model numeric breakdowns by severity and that tests calibration to reduce over-classification, and for studies that evaluate whether additional few-shot examples or graded fine-tuning closes the gap with human raters. A concrete next milestone would be public release of per-model F1, precision and recall by severity stratum; the current paper gives the aggregate few-shot F1 range and flags the instability of Phi-3-mini.
| Item | ||||
|---|---|---|---|---|
| Mean few-shot F1 | within 0.776–0.817 | within 0.776–0.817 | within 0.776–0.817 | unstable / not reliable |
| Zero-shot performance | insufficient | insufficient | insufficient | insufficient |
| Tendency (recall vs precision) | high recall, lower precision | high recall, lower precision | high recall, lower precision | unstable |
| Notes | viable in few-shot | viable in few-shot | viable in few-shot | did not yield reliable performance |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIAmazon Nova embeddings beat Cohere for Vexcel aerial search
Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.
LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.
ThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Reliability-Aware Inference reduces visual hallucinations in MLLMs
A retrieval-augmented, reliability-aware framework lifted ImageNet-100 accepted accuracy from 85.84% to 88.88% (89.04% coverage) and cut.