Multimodal AIJune 16, 20265 min read

LLMs identify CIUs in aphasic discourse: few-shot F1 0.776-0.817

Few-shot prompting let Llama-3.1-8B, Qwen2.5-7B and Mistral-7B reach mean F1s from 0.776 to 0.817 on 16 Cat Rescue transcripts.

The BrieftideJune 16, 2026

TL;DR

01Few-shot prompting let Llama-3.1-8B, Qwen2.5-7B and Mistral-7B reach mean F1s from 0.776 to 0.817 on 16 Cat Rescue transcripts.
02The study, submitted 10 Apr 2026, benchmarked four public instruction-tuned LLMs on sixteen Cat Rescue transcripts spanning control, mild, moderate and severe aphasia.
03The authors measured accuracy, precision, recall, F1 and Cohen's kappa against consensus human labels, and reported results with five tables and four figures.

Instruction-tuned large language models can identify Correct Information Units, or CIUs, in aphasic picture-description transcripts when given few-shot examples, achieving mean few-shot F1 scores between 0.776 and 0.817 across three viable models. The study, submitted 10 Apr 2026, benchmarked four public instruction-tuned LLMs on sixteen Cat Rescue transcripts spanning control, mild, moderate and severe aphasia.

What did the researchers test and how?

The paper tested token-level CIU classification on sixteen picture-description transcripts annotated to the Nicholas and Brookshire (1993) CIU standard, comparing four instruction-tuned LLMs under zero-shot and two few-shot prompting conditions across five stratified random seeds. The authors measured accuracy, precision, recall, F1 and Cohen's kappa against consensus human labels, and reported results with five tables and four figures.

Researchers used the Cat Rescue stimulus to elicit discourse and sampled speakers across four severity strata: control, mild, moderate and severe aphasia. The experimental design contrasted zero-shot prompting, which the paper calls insufficient, with few-shot prompting that used either fixed global or per-chunk local example selection; the paper found no significant difference between those two example-selection strategies.

How did the models perform?

Three models produced competitive few-shot results, with mean few-shot F1 scores ranging from 0.776 to 0.817 across Llama-3.1-8B, Qwen2.5-7B and Mistral-7B, while Phi-3-mini was unstable and unreliable. Viable models showed high recall but lower precision, which the authors interpret as systematic over-classification of tokens as CIUs.

Zero-shot prompting failed to reach acceptable performance across models in this task, while few-shot prompting yielded substantial gains. The evaluation used five stratified random seeds to test robustness. Performance varied by discourse severity: the weakest results occurred with more severe aphasia, indicating the models struggle more as communicative impairment increases.

Why it matters

Automating CIU identification addresses a practical bottleneck: CIU scoring is time intensive and requires trained human raters. The paper shows that few-shot prompting can produce competitive token-level CIU classification without gradient-based task training, suggesting a viable human-in-the-loop workflow for discourse assessment systems. High recall paired with lower precision means models could help surface candidate CIUs for human review, but their agreement with human annotation is still insufficient for fully autonomous scoring.

What to watch

Look for follow-up work that reports per-model numeric breakdowns by severity and that tests calibration to reduce over-classification, and for studies that evaluate whether additional few-shot examples or graded fine-tuning closes the gap with human raters. A concrete next milestone would be public release of per-model F1, precision and recall by severity stratum; the current paper gives the aggregate few-shot F1 range and flags the instability of Phi-3-mini.

Few-shot vs zero-shot performance by model

Item
Mean few-shot F1	within 0.776–0.817	within 0.776–0.817	within 0.776–0.817	unstable / not reliable
Zero-shot performance	insufficient	insufficient	insufficient	insufficient
Tendency (recall vs precision)	high recall, lower precision	high recall, lower precision	high recall, lower precision	unstable
Notes	viable in few-shot	viable in few-shot	viable in few-shot	did not yield reliable performance

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Amazon Nova embeddings beat Cohere for Vexcel aerial search

Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.

The BrieftideDAILY BRIEF

LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study

Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.

The BrieftideDAILY BRIEF

ThinkDeception: Progressive RL framework for multimodal deception

ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.

The BrieftideDAILY BRIEF

Reliability-Aware Inference reduces visual hallucinations in MLLMs

A retrieval-augmented, reliability-aware framework lifted ImageNet-100 accepted accuracy from 85.84% to 88.88% (89.04% coverage) and cut.