E3RL: 4B and 8B models beat AIME SOTA by 5.349%/6.514%
E^3RL uses dynamic epistemic entropy and erasable reinforcement learning on DeepMath-103k to raise AIME scores for 4B and 8B models.
TL;DR
- 01E^3RL uses dynamic epistemic entropy and erasable reinforcement learning on DeepMath-103k to raise AIME scores for 4B and 8B models.
- 02Ziliang Wang and seven coauthors submitted a paper to arXiv (arXiv:2606.17735) on 16 Jun 2026 proposing dynamic epistemic entropy orchestrated erasable reinforcement learning, abbreviated E^3RL.
- 03The method was trained on the DeepMath-103k dataset and the authors report that 4B and 8B parameter models exceed prior state-of-the-art on the AIME benchmark by 5.349% and 6.514%, respectively.
Ziliang Wang and seven coauthors submitted a paper to arXiv (arXiv:2606.17735) on 16 Jun 2026 proposing dynamic epistemic entropy orchestrated erasable reinforcement learning, abbreviated E^3RL. The method was trained on the DeepMath-103k dataset and the authors report that 4B and 8B parameter models exceed prior state-of-the-art on the AIME benchmark by 5.349% and 6.514%, respectively.
How does E^3RL work?
E^3RL treats the model's endogenous local autoregressive cross-entropy as an intrinsic coordinate of epistemic uncertainty and uses that signal to guide erasable interventions. The approach introduces segment-level adaptive dynamic thresholds and advantage allocation so the model can excise localized logical defects while reusing historical key-value cache streams. That combination gives the reasoning process a self-healing capability, according to the paper, and the authors emphasize the design keeps memory growth linear.
The technical core is a shift away from external supervisory signals: instead of relying on outside correctness signals, E^3RL grounds decisions in the model's own autoregressive cross-entropy. Segment-level thresholds identify where epistemic perturbations occur, advantage allocation determines which segments to modify, and erasable updates remove or replace faulty generation fragments while preserving useful cached state. The paper frames this as a defense against the autoregressive cascade, where small early errors propagate and cause later reasoning collapse.
How was E^3RL evaluated and what were the results?
The team trained E^3RL on DeepMath-103k and evaluated long-sequence mathematical reasoning, reporting concrete gains on AIME: the 4B parameter model improved over previous SOTA by 5.349%, and the 8B parameter model improved by 6.514%. The paper states E^3RL "reshapes the exploration efficiency of long-sequence reasoning and improves sample efficiency while maintaining linear memory overhead."
Those numbers are the specific, source-provided performance claims tied to mathematical reasoning benchmarks. The authors position DeepMath-103k as the training corpus for the method and single out AIME as a representative long-horizon reasoning test where the autoregressive cascade is especially damaging. The reported improvements apply to the two parameter scales named in the paper: 4B and 8B.
Why it matters
E^3RL directly targets an architectural weakness in autoregressive LLM generation: early epistemic perturbations that cascade through subsequent steps. If the model-internal cross-entropy can reliably flag and localize defects, then erasable interventions could reduce catastrophic drift in long-horizon tasks such as multi-step mathematical proofs. That matters for any application where a single early mistake can invalidate a long chain of reasoning, and it frames memory-efficient self-repair as an alternative to heavier external supervision or larger model scale.
The paper also ties its claims to systems concerns: the method aims to preserve linear memory overhead and improve sample efficiency, which matters for practitioners who must balance compute, storage, and training data budgets while pursuing stronger reasoning behavior.
What to watch
Check whether the E^3RL gains reported on DeepMath-103k and AIME replicate across other long-horizon domains and whether independent implementations reproduce the 5.349% and 6.514% gains at 4B and 8B scales. Another key signal will be whether the approach generalizes beyond mathematical reasoning to tasks where logical defects are less structured but equally consequential.
The paper is available on arXiv as arXiv:2606.17735 (submitted 16 Jun 2026) and lists Ziliang Wang, Kang An, Faqiang Qian, Jialu Cai, Cijun Ouyang, Yuhang Wang, Qibing Ren, and Yichao Wu as authors.
| Item | |||
|---|---|---|---|
| 4B model | 4B | AIME | 5.349% |
| 8B model | 8B | AIME | 6.514% |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI SafetyAI4SE and SE4AI: A decade review of AI in systems engineering
H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.
Deepmind AI Control Roadmap: agents treated as insider threats
Deepmind ties permissions to verified behavior, models agents as rogue employees.
Dario Amodei's AI playbook: Anthropic's regulation plan
Amodei urges binding third-party audits, federal power to block risky models, export controls.
Germany approves DE-AISI, an AI security institute based on UK
The National Security Council authorised a German AI Security Institute to test advanced models.