Multimodal AI5 min read

LLMs identify CIUs in aphasic discourse: few-shot F1 0.776-0.817

Few-shot prompting let Llama-3.1-8B, Qwen2.5-7B and Mistral-7B reach mean F1s from 0.776 to 0.817 on 16 Cat Rescue transcripts.

The Brieftide

TL;DR

  • 01Few-shot prompting let Llama-3.1-8B, Qwen2.5-7B and Mistral-7B reach mean F1s from 0.776 to 0.817 on 16 Cat Rescue transcripts.
  • 02The study, submitted 10 Apr 2026, benchmarked four public instruction-tuned LLMs on sixteen Cat Rescue transcripts spanning control, mild, moderate and severe aphasia.
  • 03The authors measured accuracy, precision, recall, F1 and Cohen's kappa against consensus human labels, and reported results with five tables and four figures.

Instruction-tuned large language models can identify Correct Information Units, or CIUs, in aphasic picture-description transcripts when given few-shot examples, achieving mean few-shot F1 scores between 0.776 and 0.817 across three viable models. The study, submitted 10 Apr 2026, benchmarked four public instruction-tuned LLMs on sixteen Cat Rescue transcripts spanning control, mild, moderate and severe aphasia.

What did the researchers test and how?

The paper tested token-level CIU classification on sixteen picture-description transcripts annotated to the Nicholas and Brookshire (1993) CIU standard, comparing four instruction-tuned LLMs under zero-shot and two few-shot prompting conditions across five stratified random seeds. The authors measured accuracy, precision, recall, F1 and Cohen's kappa against consensus human labels, and reported results with five tables and four figures.

Researchers used the Cat Rescue stimulus to elicit discourse and sampled speakers across four severity strata: control, mild, moderate and severe aphasia. The experimental design contrasted zero-shot prompting, which the paper calls insufficient, with few-shot prompting that used either fixed global or per-chunk local example selection; the paper found no significant difference between those two example-selection strategies.

How did the models perform?

Three models produced competitive few-shot results, with mean few-shot F1 scores ranging from 0.776 to 0.817 across Llama-3.1-8B, Qwen2.5-7B and Mistral-7B, while Phi-3-mini was unstable and unreliable. Viable models showed high recall but lower precision, which the authors interpret as systematic over-classification of tokens as CIUs.

Zero-shot prompting failed to reach acceptable performance across models in this task, while few-shot prompting yielded substantial gains. The evaluation used five stratified random seeds to test robustness. Performance varied by discourse severity: the weakest results occurred with more severe aphasia, indicating the models struggle more as communicative impairment increases.

Why it matters

Automating CIU identification addresses a practical bottleneck: CIU scoring is time intensive and requires trained human raters. The paper shows that few-shot prompting can produce competitive token-level CIU classification without gradient-based task training, suggesting a viable human-in-the-loop workflow for discourse assessment systems. High recall paired with lower precision means models could help surface candidate CIUs for human review, but their agreement with human annotation is still insufficient for fully autonomous scoring.

What to watch

Look for follow-up work that reports per-model numeric breakdowns by severity and that tests calibration to reduce over-classification, and for studies that evaluate whether additional few-shot examples or graded fine-tuning closes the gap with human raters. A concrete next milestone would be public release of per-model F1, precision and recall by severity stratum; the current paper gives the aggregate few-shot F1 range and flags the instability of Phi-3-mini.

Few-shot vs zero-shot performance by model
Item
Mean few-shot F1within 0.776–0.817within 0.776–0.817within 0.776–0.817unstable / not reliable
Zero-shot performanceinsufficientinsufficientinsufficientinsufficient
Tendency (recall vs precision)high recall, lower precisionhigh recall, lower precisionhigh recall, lower precisionunstable
Notesviable in few-shotviable in few-shotviable in few-shotdid not yield reliable performance
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement