Multimodal AI4 min read

DigenRL: Disaggregated RL for Diffusion Visual LLMs, 1.56–2.10x

DigenRL disaggregates rollout and training for diffusion-based generative LLMs, boosting throughput 1.56–2.10x versus veRL-Omni and GenRL.

The Brieftide

TL;DR

  • 01DigenRL disaggregates rollout and training for diffusion-based generative LLMs, boosting throughput 1.56–2.10x versus veRL-Omni and GenRL.
  • 02DigenRL is a disaggregated reinforcement learning framework for diffusion-based generative large language models, submitted to arXiv on 23 Jun 2026.
  • 03GAP and TSP change how diffusion models are partitioned for parallel execution so rollout and training can overlap more effectively.

DigenRL is a disaggregated reinforcement learning framework for diffusion-based generative large language models, submitted to arXiv on 23 Jun 2026. The system targets the inefficiencies of colocated execution and, in experiments on three hardware testbeds with 16–32 GPUs, DigenRL achieved 1.56–2.10x throughput improvements over veRL-Omni and GenRL.

How does DigenRL speed diffusion RL?

DigenRL speeds diffusion-based generative LLM training by separating rollout and training resources and adding three pipeline optimizations: generation-axis pipeline and time-step parallelism, trainer-assisted generation, and a tightly one-step constrained asynchronous strategy. The paper introduces generation-axis pipeline (GAP) and time-step parallelism (TSP) to enable finer-grained pipelining between rollout and training, an elastic trainer-assisted generation (TAG) method that lets trainer GPUs dynamically assist rollouts, and a one-step constrained asynchronous strategy to use the pipeline tail bubble.

GAP and TSP change how diffusion models are partitioned for parallel execution so rollout and training can overlap more effectively. TAG allows trainer-side GPU resources to temporarily execute rollout generations when idle, reducing idle time in a disaggregated deployment. The asynchronous constraint reduces synchronization stalls while keeping correctness for the one-step interactions the authors target.

How was DigenRL evaluated and compared?

The authors ran experiments on three hardware testbeds using clusters of 16–32 GPUs and four generative models: HunyuanVideo-13B, Wan2.1-14B, FLUX.1-12B, and QwenImage-20B. Across those setups, DigenRL produced throughput improvements in the range 1.56–2.10x compared with state-of-the-art diffusion RL systems veRL-Omni and GenRL. The paper is 14 pages long and includes 18 figures and 1 table documenting the results.

The evaluation emphasizes heterogeneous GPU support and flexible resource allocation, contrasting with veRL-Omni which, the paper notes, relies on colocated execution that couples rollout and training resources and limits independent scaling. The authors position DigenRL to accommodate heterogeneous GPUs and to facilitate more efficient task scheduling in disaggregated architectures.

Why it matters

Disaggregating rollout and training removes the requirement that the same machines handle both tasks, which can free teams to scale compute pools independently and to mix GPU types. The paper’s techniques directly target the common performance problem in disaggregated RL systems: execution bubbles created by mismatched rates of rollout and training. Allowing trainer GPUs to assist rollouts and applying finer-grained pipelining reduces those bubbles, which explains the measured 1.56–2.10x throughput range. For teams training diffusion-oriented generative LLMs, that translates to better utilization of GPU fleets and a clearer path to heterogeneous deployment.

What to watch

Check the paper’s Code, Data and Media section and the external toggles listed on the arXiv page such as Hugging Face, DagsHub, Replicate and Hugging Face Spaces for code or demos linked to the submission. Independent reproductions on public testbeds and per-system breakdowns of the reported 1.56–2.10x gains versus veRL-Omni and GenRL will be the next concrete signals to confirm how broadly the improvements apply.

Key experimental facts from the paper
Item
Throughput improvement (x)1.56–2.101.56–2.10
Testbed GPUs16–3216–32
Models usedHunyuanVideo-13B; Wan2.1-14B; FLUX.1-12B; QwenImage-20BHunyuanVideo-13B; Wan2.1-14B; FLUX.1-12B; QwenImage-20B
Paper length / figures / tables14 pages; 18 figures; 1 table14 pages; 18 figures; 1 table
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement