DiffusionGemma-26B vs Gemma-4-26B: Radiology report benchmark
DiffusionGemma-26B matched or beat Gemma-4-26B on medical VQA, decoded 3.5–4.4x faster, and enables any-order infill for report drafting.
TL;DR
- 01DiffusionGemma-26B matched or beat Gemma-4-26B on medical VQA, decoded 3.5–4.4x faster, and enables any-order infill for report drafting.
- 02The study used a verbosity-robust LLM judge to score outputs and reports a finetuned model with 3.8B active parameters.
- 03The experiments used a verbosity-robust LLM judge for scoring, which the authors highlight when comparing diffusion and autoregressive outputs.
Max Van Puyvelde, Halil Ibrahim Gulluk, Wim Van Criekinge and Olivier Gevaert submitted a paper on 1 Jul 2026 describing Discrete Diffusion Language Models applied to interactive radiology report drafting. They adapted a mixture-of-experts diffusion language model, DiffusionGemma-26B, and benchmarked it against the same-size autoregressive sibling Gemma-4-26B under an identical LoRA recipe on medical visual question answering datasets.
What did the authors build and test?
They adapted a mixture-of-experts diffusion language model, DiffusionGemma-26B, and compared it to Gemma-4-26B, both evaluated under an identical LoRA recipe on medical visual question answering datasets. The study used a verbosity-robust LLM judge to score outputs and reports a finetuned model with 3.8B active parameters.
The paper frames the experiment against the status quo that "medical foundation models, however, remain almost entirely autoregressive." The diffusion variant operates by denoising a token canvas bidirectionally rather than emitting tokens left to right, which changes both decoding dynamics and user interaction patterns.
How did diffusion compare to autoregression?
DiffusionGemma-26B matched or exceeded Gemma-4-26B on all medical visual question answering datasets the authors evaluated; the finetuned model (3.8B active) is described as competitive with frontier vision-language models, and its decoding runs 3.5–4.4x faster.
Beyond raw VQA parity, the diffusion approach provides any-order infill: because the token canvas is denoised bidirectionally, a clinician can fix fragments of a report and have the model fill the text between them. The paper contrasts this with autoregressive decoding, which the authors say is subpar at arbitrary inpainting and therefore less suited to the terse, inconsistent reports often seen across clinicians and institutions.
The experiments used a verbosity-robust LLM judge for scoring, which the authors highlight when comparing diffusion and autoregressive outputs. The identical LoRA recipe constrained the comparison so both models were finetuned under the same framing.
Why it matters
Diffusion models offering any-order infill change how clinicians could interact with report drafting: radiologists can anchor parts of a report and let the model generate context-aware text between anchors. Faster decoding, cited as 3.5–4.4x speedup, lowers latency during interactive editing. Those two properties together address practical issues in clinical reporting workflows, where partial edits and terse phrasing are common.
The result also challenges the assumption that medical foundation models must remain autoregressive. If diffusion variants match or exceed AR performance on medical VQA while adding interactive drafting features, they create a different design space for medical language and vision-language models.
What to watch
Look for replication and wider evaluations: whether the any-order infill and the reported 3.5–4.4x decoding advantage hold across more clinical datasets, and whether toolmakers embed diffusion-based drafting into radiology workflows. The paper positions a finetuned 3.8B-active model as a near-term benchmark to follow.
| Item | ||||||
|---|---|---|---|---|---|---|
| DiffusionGemma-26B | DiffusionGemma-26B | Diffusion | 26B | 3.8B (finetuned) | 3.5–4.4x faster | Matches or exceeds Gemma-4-26B |
| Gemma-4-26B | Gemma-4-26B | Autoregressive | 26B | 3.8B (LoRA, identical recipe) | Baseline | Matched or outperformed by DiffusionGemma-26B |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIMIT Masked IRL: LLMs help robots clarify and ignore cues
MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.
Multimodal LLM evaluation: four missing capabilities (2026)
A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.
ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.
Amazon Nova embeddings beat Cohere for Vexcel aerial search
Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.