4 min read

ByteDance iLLaDA vs Qwen2.5: diffusion 8B matches base benchmarks

An 8B diffusion model pretrained on 12T tokens, iLLaDA averages 63.9 and narrowly tops Qwen2.5 7B.

The Brieftide

TL;DR

  • 01An 8B diffusion model pretrained on 12T tokens, iLLaDA averages 63.9 and narrowly tops Qwen2.5 7B.
  • 02ByteDance and researchers at Renmin University released iLLaDA on June 27, 2026, an 8 billion parameter diffusion language model pretrained on 12 trillion tokens.
  • 03Autoregressive models such as GPT and Qwen produce tokens left to right, with each new token conditioned only on previous ones.

ByteDance and researchers at Renmin University released iLLaDA on June 27, 2026, an 8 billion parameter diffusion language model pretrained on 12 trillion tokens. At the base evaluation level iLLaDA posts an average score of 63.9, edging the autoregressive Qwen2.5 7B, which scores 63.3.

How does iLLaDA work compared with autoregressive models?

iLLaDA generates text via diffusion rather than autoregressive sampling: it starts from masked tokens and iteratively refines all positions in parallel, allowing bidirectional attention across the sequence. Autoregressive models such as GPT and Qwen produce tokens left to right, with each new token conditioned only on previous ones. The paper positions iLLaDA as a dense 8B model trained from scratch for quality, unlike some diffusion variants that reuse autoregressive backbones.

Diffusion approaches let every position attend to every other position simultaneously and refine placeholders over multiple passes, a process the authors compare to how image diffusion models shape an image from noise. The team pretrained iLLaDA on 12 trillion tokens, up from 2.3 trillion for its predecessor LLaDA, and fine-tuned it for twelve epochs.

How does iLLaDA perform against Qwen2.5 and other models?

iLLaDA beats its diffusion predecessor and a fine-tuned diffusion competitor on most benchmark slices, and it slightly outperforms Qwen2.5 7B on the average score at base level: 63.9 versus 63.3. The authors report a sharp jump for iLLaDA-Base over LLaDA: a 21.6 point increase on the BBH reasoning test.

The paper’s benchmark table lists specific results by task and model. Selected figures: MMLU 74.8 for iLLaDA versus 71.9 for Qwen2.5 7B; BBH 71.3 versus 63.9; GSM8K 81.9 versus 78.9. On average iLLaDA scores 63.9, LLaDA 51.1, Dream 7B 61.4, and Qwen2.5 7B 63.3.

A gap appears at the instruct level after alignment and fine-tuning. iLLaDA-Instruct scores 67.1 while Qwen2.5 7B Instruct hits 77.1, with most of the difference driven by math and code benchmarks. The authors attribute the instruct gap to the extra reinforcement learning alignment applied to Qwen2.5, which iLLaDA lacks. They also note iLLaDA can get stuck in reasoning loops on harder tasks.

Why does this matter?

Diffusion models trained from scratch can reach parity with same-class autoregressive models on many base benchmarks, showing the generation method alone does not preclude competitive quality. The instruct-level shortfall highlights that alignment and fine-tuning choices remain decisive: extra reinforcement learning alignment in Qwen2.5 produces a substantial lead in instruction-following and code/math tasks.

This implies that teams choosing diffusion must match not only pretraining scale but also postpretrain alignment pipelines to compete in instruction and production settings.

What to watch

Watch for diffusion models that incorporate reinforcement learning alignment or additional instruction fine-tuning; that is the clearest path shown in the paper for closing the instruct gap. Also follow comparisons that hold backbone size and benchmark variants identical, since the authors note direct numerical comparisons are difficult when weight classes and benchmark versions differ.

Benchmark comparison table

Below are the model scores pulled from the paper’s table.

Model / Task iLLaDA 8B LLaDA 8B Dream 7B Qwen2.5 7B
Training tokens 12T 2.3T 18T + 0.6T 18T
MMLU 74.8 65.9 69.5 71.9
BBH 71.3 49.7 57.9 63.9
ARC-C 60.8 45.9 59.8 51.5
Hellaswag 76.6 70.5 73.3 79.0
GSM8K 81.9 70.3 77.2 78.9
Math 38.4 31.4 39.6 41.1
HumanEval 50.0 35.4 57.9 56.7
MBPP 57.8 40.0 56.2 63.6
Average 63.9 51.1 61.4 63.3
Benchmark scores for iLLaDA, LLaDA, Dream and Qwen2.5
Item
Training tokens12T2.3T18T + 0.6T18T
MMLU74.865.969.571.9
BBH71.349.757.963.9
ARC-C60.845.959.851.5
Hellaswag76.670.573.379
GSM8K81.970.377.278.9
Math38.431.439.641.1
HumanEval5035.457.956.7
MBPP57.84056.263.6
Average63.951.161.463.3
Advertisement

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Browse the feed
Advertisement