Multimodal AIJuly 3, 20264 min read

DiffusionGemma-26B vs Gemma-4-26B: Radiology report benchmark

DiffusionGemma-26B matched or beat Gemma-4-26B on medical VQA, decoded 3.5–4.4x faster, and enables any-order infill for report drafting.

The BrieftideJuly 3, 2026

TL;DR

01DiffusionGemma-26B matched or beat Gemma-4-26B on medical VQA, decoded 3.5–4.4x faster, and enables any-order infill for report drafting.
02The study used a verbosity-robust LLM judge to score outputs and reports a finetuned model with 3.8B active parameters.
03The experiments used a verbosity-robust LLM judge for scoring, which the authors highlight when comparing diffusion and autoregressive outputs.

Max Van Puyvelde, Halil Ibrahim Gulluk, Wim Van Criekinge and Olivier Gevaert submitted a paper on 1 Jul 2026 describing Discrete Diffusion Language Models applied to interactive radiology report drafting. They adapted a mixture-of-experts diffusion language model, DiffusionGemma-26B, and benchmarked it against the same-size autoregressive sibling Gemma-4-26B under an identical LoRA recipe on medical visual question answering datasets.

What did the authors build and test?

They adapted a mixture-of-experts diffusion language model, DiffusionGemma-26B, and compared it to Gemma-4-26B, both evaluated under an identical LoRA recipe on medical visual question answering datasets. The study used a verbosity-robust LLM judge to score outputs and reports a finetuned model with 3.8B active parameters.

The paper frames the experiment against the status quo that "medical foundation models, however, remain almost entirely autoregressive." The diffusion variant operates by denoising a token canvas bidirectionally rather than emitting tokens left to right, which changes both decoding dynamics and user interaction patterns.

How did diffusion compare to autoregression?

DiffusionGemma-26B matched or exceeded Gemma-4-26B on all medical visual question answering datasets the authors evaluated; the finetuned model (3.8B active) is described as competitive with frontier vision-language models, and its decoding runs 3.5–4.4x faster.

Beyond raw VQA parity, the diffusion approach provides any-order infill: because the token canvas is denoised bidirectionally, a clinician can fix fragments of a report and have the model fill the text between them. The paper contrasts this with autoregressive decoding, which the authors say is subpar at arbitrary inpainting and therefore less suited to the terse, inconsistent reports often seen across clinicians and institutions.

The experiments used a verbosity-robust LLM judge for scoring, which the authors highlight when comparing diffusion and autoregressive outputs. The identical LoRA recipe constrained the comparison so both models were finetuned under the same framing.

Why it matters

Diffusion models offering any-order infill change how clinicians could interact with report drafting: radiologists can anchor parts of a report and let the model generate context-aware text between anchors. Faster decoding, cited as 3.5–4.4x speedup, lowers latency during interactive editing. Those two properties together address practical issues in clinical reporting workflows, where partial edits and terse phrasing are common.

The result also challenges the assumption that medical foundation models must remain autoregressive. If diffusion variants match or exceed AR performance on medical VQA while adding interactive drafting features, they create a different design space for medical language and vision-language models.

What to watch

Look for replication and wider evaluations: whether the any-order infill and the reported 3.5–4.4x decoding advantage hold across more clinical datasets, and whether toolmakers embed diffusion-based drafting into radiology workflows. The paper positions a finetuned 3.8B-active model as a near-term benchmark to follow.

Diffusion vs Autoregressive models in the paper

Item
DiffusionGemma-26B	DiffusionGemma-26B	Diffusion	26B	3.8B (finetuned)	3.5–4.4x faster	Matches or exceeds Gemma-4-26B
Gemma-4-26B	Gemma-4-26B	Autoregressive	26B	3.8B (LoRA, identical recipe)	Baseline	Matched or outperformed by DiffusionGemma-26B

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

MIT Masked IRL: LLMs help robots clarify and ignore cues

MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.

The BrieftideDAILY BRIEF

Multimodal LLM evaluation: four missing capabilities (2026)

A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.

The BrieftideDAILY BRIEF

ReMMD: Multilingual Multi-Image Benchmark and Agent Release

ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.

The BrieftideDAILY BRIEF

Amazon Nova embeddings beat Cohere for Vexcel aerial search

Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.