June 27, 20264 min read

ByteDance iLLaDA vs Qwen2.5: diffusion 8B matches base benchmarks

An 8B diffusion model pretrained on 12T tokens, iLLaDA averages 63.9 and narrowly tops Qwen2.5 7B.

The BrieftideJune 27, 2026

TL;DR

01An 8B diffusion model pretrained on 12T tokens, iLLaDA averages 63.9 and narrowly tops Qwen2.5 7B.
02ByteDance and researchers at Renmin University released iLLaDA on June 27, 2026, an 8 billion parameter diffusion language model pretrained on 12 trillion tokens.
03Autoregressive models such as GPT and Qwen produce tokens left to right, with each new token conditioned only on previous ones.

ByteDance and researchers at Renmin University released iLLaDA on June 27, 2026, an 8 billion parameter diffusion language model pretrained on 12 trillion tokens. At the base evaluation level iLLaDA posts an average score of 63.9, edging the autoregressive Qwen2.5 7B, which scores 63.3.

How does iLLaDA work compared with autoregressive models?

iLLaDA generates text via diffusion rather than autoregressive sampling: it starts from masked tokens and iteratively refines all positions in parallel, allowing bidirectional attention across the sequence. Autoregressive models such as GPT and Qwen produce tokens left to right, with each new token conditioned only on previous ones. The paper positions iLLaDA as a dense 8B model trained from scratch for quality, unlike some diffusion variants that reuse autoregressive backbones.

Diffusion approaches let every position attend to every other position simultaneously and refine placeholders over multiple passes, a process the authors compare to how image diffusion models shape an image from noise. The team pretrained iLLaDA on 12 trillion tokens, up from 2.3 trillion for its predecessor LLaDA, and fine-tuned it for twelve epochs.

How does iLLaDA perform against Qwen2.5 and other models?

iLLaDA beats its diffusion predecessor and a fine-tuned diffusion competitor on most benchmark slices, and it slightly outperforms Qwen2.5 7B on the average score at base level: 63.9 versus 63.3. The authors report a sharp jump for iLLaDA-Base over LLaDA: a 21.6 point increase on the BBH reasoning test.

The paper’s benchmark table lists specific results by task and model. Selected figures: MMLU 74.8 for iLLaDA versus 71.9 for Qwen2.5 7B; BBH 71.3 versus 63.9; GSM8K 81.9 versus 78.9. On average iLLaDA scores 63.9, LLaDA 51.1, Dream 7B 61.4, and Qwen2.5 7B 63.3.

A gap appears at the instruct level after alignment and fine-tuning. iLLaDA-Instruct scores 67.1 while Qwen2.5 7B Instruct hits 77.1, with most of the difference driven by math and code benchmarks. The authors attribute the instruct gap to the extra reinforcement learning alignment applied to Qwen2.5, which iLLaDA lacks. They also note iLLaDA can get stuck in reasoning loops on harder tasks.

Why does this matter?

Diffusion models trained from scratch can reach parity with same-class autoregressive models on many base benchmarks, showing the generation method alone does not preclude competitive quality. The instruct-level shortfall highlights that alignment and fine-tuning choices remain decisive: extra reinforcement learning alignment in Qwen2.5 produces a substantial lead in instruction-following and code/math tasks.

This implies that teams choosing diffusion must match not only pretraining scale but also postpretrain alignment pipelines to compete in instruction and production settings.

What to watch

Watch for diffusion models that incorporate reinforcement learning alignment or additional instruction fine-tuning; that is the clearest path shown in the paper for closing the instruct gap. Also follow comparisons that hold backbone size and benchmark variants identical, since the authors note direct numerical comparisons are difficult when weight classes and benchmark versions differ.

Benchmark comparison table

Below are the model scores pulled from the paper’s table.

Model / Task	iLLaDA 8B	LLaDA 8B	Dream 7B	Qwen2.5 7B
Training tokens	12T	2.3T	18T + 0.6T	18T
MMLU	74.8	65.9	69.5	71.9
BBH	71.3	49.7	57.9	63.9
ARC-C	60.8	45.9	59.8	51.5
Hellaswag	76.6	70.5	73.3	79.0
GSM8K	81.9	70.3	77.2	78.9
Math	38.4	31.4	39.6	41.1
HumanEval	50.0	35.4	57.9	56.7
MBPP	57.8	40.0	56.2	63.6
Average	63.9	51.1	61.4	63.3

Benchmark scores for iLLaDA, LLaDA, Dream and Qwen2.5

Item
Training tokens	12T	2.3T	18T + 0.6T	18T
MMLU	74.8	65.9	69.5	71.9
BBH	71.3	49.7	57.9	63.9
ARC-C	60.8	45.9	59.8	51.5
Hellaswag	76.6	70.5	73.3	79
GSM8K	81.9	70.3	77.2	78.9
Math	38.4	31.4	39.6	41.1
HumanEval	50	35.4	57.9	56.7
MBPP	57.8	40	56.2	63.6
Average	63.9	51.1	61.4	63.3

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Browse the feed

The BrieftideDAILY BRIEF

Linear-attention revival: Qwen3-Next, MiniMax-M1 and Kimi Linear

A string of 2025 releases and reversals has reignited linear-attention hybrids.

The BrieftideDAILY BRIEF

Diffusion Language Models: Eight DLMs evaluated across tasks

Authors evaluate eight state-of-the-art DLMs on eight benchmarks, measuring generation quality and computational efficiency while varying.

The BrieftideDAILY BRIEF

Gemma 4 DeepMind release, benchmarks and capabilities

DeepMind released Gemma 4, a family of open language models built for advanced reasoning and agent workflows.

The BrieftideDAILY BRIEF

NebulaExp-8B post-training pipeline: full-scale ablation

A transparent, ablation-driven post-training recipe for Qwen3-8B-base using 3.84M SFT samples and a 200K RL candidate pool.