Learning Unmasking Policies for dLLMs: RL Sampling Paper July 2026
Apple-affiliated researchers train reinforcement-learning policies that map token confidences to unmasking decisions.
TL;DR
- 01Apple-affiliated researchers train reinforcement-learning policies that map token confidences to unmasking decisions.
- 02The experiments show these learned policies match state-of-the-art heuristics in semi-autoregressive (block) generation and outperform them in the full-diffusion setting.
- 03In this setup the dLLM is treated as the environment; the policy observes token confidence signals and outputs which tokens to reveal at each diffusion step.
Apple-affiliated researchers published "Learning Unmasking Policies for Diffusion Language Models," a paper presented at ICML and published July 2026, that trains reinforcement-learning policies to choose which tokens to unmask during diffusion decoding. The experiments show these learned policies match state-of-the-art heuristics in semi-autoregressive (block) generation and outperform them in the full-diffusion setting.
What did the paper do and what were the core findings?
The paper formalizes masked diffusion sampling as a Markov decision process with the diffusion large language model, or dLLM, acting as the environment, and trains lightweight policies with reinforcement learning to select tokens to unmask at each diffusion step. The authors report that their RL-trained policies match the performance of state-of-the-art heuristics when used with semi-autoregressive (block) generation, and outperform those heuristics in the full-diffusion setting.
The work positions sampling strategy as a critical design choice for dLLMs, noting that heuristic approaches such as confidence thresholding can improve sample quality and token throughput relative to random unmasking but require manual tuning and see degraded performance with larger block sizes.
How does the learned unmasking policy work?
The method uses a lightweight policy implemented as a single-layer transformer that maps the dLLM's token confidences to unmasking decisions, with the masked diffusion sampling cast as an MDP for reinforcement learning to optimize. In this setup the dLLM is treated as the environment; the policy observes token confidence signals and outputs which tokens to reveal at each diffusion step.
The paper contrasts this trained policy with common heuristics, calling out confidence thresholding as an example heuristic that improves both sample quality and token throughput versus random unmasking, but which imposes manual tuning and worsens with larger block sizes. The trained policy removes the need for hand-tuned thresholds by learning a mapping from confidences to actions.
How does this compare to prior dLLM decoding practices?
Prior approaches relied on heuristics such as confidence thresholding and remasking to pick tokens to keep during block-wise generation; those heuristics can save computation but may discard partially decoded tokens and need careful tuning. The authors highlight that remasking in state-of-the-art block-wise dLLMs decodes only the most confident tokens and discards the rest, which wastes computation, motivating learned policies and alternative recycling methods explored in related work.
The paper cites two strands of related research: a July 2, 2026 ICML paper on Residual Context Diffusion Language Models that argues recycling computation from discarded tokens is beneficial, and a January 21, 2026 ICLR paper, DiffuCoder, which explored masked diffusion models for code generation.
Why it matters
Training the sampling policy reframes a brittle, hand-tuned part of the dLLM pipeline as an optimizable component, reducing manual tuning and addressing sensitivity to block size. That matters because diffusion LLMs aim to decode tokens in parallel and promise more efficient inference than autoregressive models; improving token selection can raise sample quality and token throughput while preserving parallelism.
The specific technical takeaway is practical: a single-layer transformer policy can learn an effective mapping from token confidences to unmasking actions, and in experiments it matches heuristics for semi-autoregressive generation and exceeds them in full-diffusion regimes.
What to watch
Look for follow-up experiments that quantify the computational trade-offs of learned policies versus heuristic remasking, and for code or model releases that show how the single-layer transformer policy integrates into existing dLLM inference stacks. Also watch whether future work measures the policies across larger block sizes and varied generation tasks.
Authors and provenance: the paper, "Learning Unmasking Policies for Diffusion Language Models," lists Metod Jazbec and Theo X. Olausson as equal contributors among other coauthors and appears in the Methods and Algorithms, Speech and Natural Language Processing program at the ICML conference, published July 2026. Some authors note work done while at Apple.
Written by The Brieftide · Source: Apple Machine Learning
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIMMIR-TCM: multimodal TCM AI framework outperforms GPT-4o, Gemini
MMIR-TCM pairs Memory-SAM, fine-tuned Qwen3-VL and a Qwen3 RAG pipeline.
MIT Masked IRL: LLMs help robots clarify and ignore cues
MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.
Multimodal LLM evaluation: four missing capabilities (2026)
A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.
ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.