Attribution-Guided pruning for MoE: 5.27× memory cut on Qwen3-30B
A structural pruning framework reallocates prune budgets at the channel level.
TL;DR
- 01A structural pruning framework reallocates prune budgets at the channel level.
- 02The paper proposes a structural pruning framework that reformulates prune-ratio allocation as a channel-score coverage maximization problem and solves it with an attribution-based approximation.
- 03The approach departs from prior expert-level compression techniques that remove entire experts or rank experts with coarse importance scores.
Yifu Ding and six coauthors submitted on 16 Jun 2026 a paper titled "Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression." The authors present a structural pruning framework tailored for Mixture-of-Experts models and report that their method preserves model accuracy under 50% or 25% structured pruning when combined with 4-bit quantization; on Qwen3-30B-A3B it reduces memory footprint by 5.27×.
What is the method?
The paper proposes a structural pruning framework that reformulates prune-ratio allocation as a channel-score coverage maximization problem and solves it with an attribution-based approximation. The authors start from the observation that information inside MoE experts concentrates in a small subset of channels, leaving substantial redundancy even inside experts labeled important, and they use that observation to guide fine-grained, channel-level pruning decisions.
The approach departs from prior expert-level compression techniques that remove entire experts or rank experts with coarse importance scores. Instead of treating each expert as the atomic unit, the method computes channel scores and allocates pruning ratios to maximize coverage of those scores across the network. The paper describes an attribution-based approximation used to make that optimization tractable for large MoE models.
How did it perform on benchmarks?
The authors evaluated the method on DeepSeek and Qwen MoE models and found that it preserves model accuracy under 50% or 25% structured pruning when paired with 4-bit quantization. Specifically, on Qwen3-30B-A3B the method reduces memory footprint by 5.27× and, according to the abstract, it consistently outperforms state-of-the-art baselines across diverse benchmarks.
The submission is concise: 9 pages with 5 figures, and it was submitted to ICML 2026. The reported combination of structured pruning and low-bit quantization is the core deployment claim: substantial structured pruning levels (50% and 25%) are compatible with 4-bit quantization while maintaining accuracy in the evaluated MoE settings. The authors contrast their channel-centric allocation against prior coarse expert-wise pruning, arguing the latter can misallocate pruning budgets because it misses intra-expert redundancy.
Why it matters
Mixture-of-Experts models scale compute efficiently, yet remain expensive to deploy due to their substantial memory footprint and inference overhead. By moving pruning decisions to the channel level and framing allocation as a coverage maximization problem, the paper addresses a specific bottleneck: coarse expert-wise pruning can leave heavy redundancy on the table. The reported 5.27× memory reduction on Qwen3-30B-A3B shows this structural focus can yield materially smaller footprints while preserving accuracy under aggressive pruning and low-bit quantization.
This approach changes where pruning effort is spent. Teams optimizing MoE deployments can target channel-level structure inside experts rather than only removing whole experts, which can improve compression efficiency and make high-capacity MoE models easier to deploy under tight memory or latency constraints.
What to watch
The paper was submitted to ICML 2026; watch for the conference proceedings and the full 9-page manuscript with 5 figures for experimental detail and methodology. The next signals to look for are reproduction of the reported results on DeepSeek and Qwen MoE models, and any open-source code or implementation notes that detail the attribution-based approximation and allocation procedure.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Visual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.
Hugging Face Spaces agents.md: chain image to 3D splats
An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.