ICT improves Qwen2.5 reasoning: top 10% tokens +4.58% pass@4
ICT uses token-level Jensen-Shannon divergence to pick the top 10% unique tokens, raising Qwen2.5 pass@4 by an average of 4.58%.
TL;DR
- 01ICT uses token-level Jensen-Shannon divergence to pick the top 10% unique tokens, raising Qwen2.5 pass@4 by an average of 4.58%.
- 02The framework reframes the optimization target from a single scalar uncertainty signal to distributional properties of logits, targeting a small subset of tokens for update.
- 03The paper grounds the theory in both Shannon entropy and second-order Rényi entropy to explain how selective updates regulate policy concentration and stabilize training.
Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning, submitted to arXiv on 18 Jun 2026 by Xuanzhi Feng and eight coauthors, introduces the Independent Combinatorial Tokens (ICT) framework to address instability in Reinforcement Learning with Verifiable Rewards (RLVR). The paper argues that uniform token updates cause entropy collapse while over-emphasizing Shannon entropy causes entropy explosion, and proposes selective token updates guided by Jensen-Shannon divergence between token-logit distributions.
What is ICT and how does it work?
ICT identifies tokens whose token-logit distributions deviate from the norm using the Jensen-Shannon divergence and focuses updates on those tokens, rather than applying uniform scalar uncertainty objectives. This selective updating treats tokens with distinctive distributional patterns as branching points for exploration, and the authors show theoretically that it simultaneously reduces overall Shannon entropy and controls probability concentration measured by second-order Rényi entropy.
The framework reframes the optimization target from a single scalar uncertainty signal to distributional properties of logits, targeting a small subset of tokens for update. The paper grounds the theory in both Shannon entropy and second-order Rényi entropy to explain how selective updates regulate policy concentration and stabilize training.
How well did ICT perform on Qwen2.5?
Updating only the top 10% of unique tokens on Qwen2.5 models (0.5B, 1.5B, 7B) produced measurable gains: an average pass@4 improvement of 4.58% and a maximum gain of 14.9% over GRPO, 20-Entropy, and STAPO across seven benchmarks covering math, commonsense, and Olympiad-level problems. These results come from experiments reported in the paper comparing ICT to three baselines.
The authors tested ICT on Qwen2.5 at three model scales and measured improvements across seven distinct benchmarks. The reported numbers are an average pass@4 uplift of 4.58% with peak improvements reaching 14.9% relative to the listed baselines, when restricting updates to the top 10% of unique tokens.
Why it matters
Selective token updates change the optimization dynamics of RLVR training by preventing two failure modes the paper highlights: convergence to suboptimal, over-concentrated token distributions (entropy collapse) and runaway exploration driven by excessive entropy (entropy explosion). By acting on distributional deviations at the token-logit level, ICT preserves useful exploration while controlling concentration, which addresses a core instability in prior scalar-entropy approaches.
That shift matters for researchers and engineers who rely on RLVR to improve chain-of-thought or multi-step reasoning, because it offers a principled way to decide where the learning signal should be applied. The reported gains on Qwen2.5 and the theoretical links to Shannon and second-order Rényi entropy suggest the approach can change how policy concentration is managed in reasoning-focused LLM training.
What to watch
Whether the ICT selective-update rule replicates beyond Qwen2.5 and scales to other model families, training regimes, or larger production models. Confirmation will come from independent reproduction on architectures and benchmarks outside the seven tasks reported, and from application to other RLVR workflows that previously suffered from entropy collapse or explosion.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Open Source AIZhipu AI GLM-5.2: 1M-token context, closes gap with Opus 4.8
GLM-5.2 ships under the MIT license with a stable one-million-token context and scores 74.4% on FrontierSWE, one point behind Opus 4.8.
OpenAI: PRC-linked influence operations target US AI debates
OpenAI says PRC-linked campaigns are using AI to push narratives on U.S. tech debates, data centers, tariffs and false ChatGPT claims.
OpenAI: LSEG scales trusted AI, empowers 4,000 staff
LSEG uses OpenAI to scale trusted AI across its global business, accelerating insights, shrinking release cycles and empowering 4.
Industrial policy OpenAI proposes for the Intelligence Age
OpenAI published a people-first industrial policy on June 9, 2026, and opened a pilot grants program with fellowships.