Generic TB-Coverage improves MoE pruning for Qwen1.5, DeepSeek
Uses WikiText2 and C4 calibration to boost zero-shot accuracy across six benchmarks at 25%.
TL;DR
- 01Uses WikiText2 and C4 calibration to boost zero-shot accuracy across six benchmarks at 25%.
- 02The authors tested Generic TB-Coverage on two MoE models: Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base, using fixed retention budgets of 25%, 50% and 75%.
- 03The reported gains are largest under aggressive pruning, specifically at the 25% and 50% retention settings.
Generic TB-Coverage is a coverage-aware expert pruning method that uses only generic text corpora (WikiText2 and C4) for calibration and preserves per-corpus high-utility experts under fixed-budget pruning. The paper, submitted 2 Jul 2026 by Yongqin Zeng and six coauthors, evaluates the method on Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base and reports improved average accuracy on six common zero-shot benchmarks.
What does Generic TB-Coverage do?
Generic TB-Coverage profiles each MoE expert separately on multiple generic corpora and enforces a fixed-budget coverage rule that keeps high-utility experts from each corpus before building the final pruning mask. Instead of collapsing expert utility into a single aggregated importance score, the method measures per-expert utility on WikiText2 and C4 and ensures cross-corpus coverage when selecting which experts to retain.
This approach aims to avoid biasing the retained expert set toward experts favored by dominant calibration patterns, a shortcoming the paper attributes to many existing expert-pruning methods that rely on a single aggregated score.
How was the method tested and what changed?
The authors tested Generic TB-Coverage on two MoE models: Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base, using fixed retention budgets of 25%, 50% and 75%. Across those retention budgets, the paper states the method improves average accuracy on six common zero-shot benchmarks compared with random pruning, REAP, and ExpertSparsity, while also reducing perplexity degradation on the calibration corpora WikiText2 and C4.
The reported gains are largest under aggressive pruning, specifically at the 25% and 50% retention settings. The authors emphasize that these improvements hold with fixed pruning budgets and without any downstream calibration data.
Why it matters
Preserving cross-corpus expert coverage offers a practical generic-data prior for MoE pruning, the paper argues. If per-expert utility varies by calibration corpus, aggregating those utilities into a single score can drop experts that matter for some evaluation distributions. Generic TB-Coverage counters that by explicitly protecting experts that show high utility on each corpus, which the authors show yields better zero-shot accuracy and less perplexity damage, especially when pruning aggressively.
That pattern matters because MoE models are commonly pruned to save compute or memory, and pruning decisions that overfit a single calibration distribution can erode out-of-distribution or zero-shot performance. The paper demonstrates a concrete strategy to make pruning more robust using only generic text corpora.
What to watch
Check the arXiv entry for the paper (submitted 2 Jul 2026) for full results and artifacts: the page includes PDF and TeX Source links and toggles for code and data. The next signals to follow are whether the authors publish pruning masks or code via those links and whether the same coverage-aware pruning gains replicate on other MoE families or with different calibration corpora.
| Item | |||||
|---|---|---|---|---|---|
| Generic TB-Coverage | Generic TB-Coverage | Qwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base | 25%, 50%, 75% | Improves average accuracy over random pruning, REAP, ExpertSparsity | Reduces perplexity degradation |
| Random pruning | Random pruning | Qwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base | 25%, 50%, 75% | Baseline (worse than Generic TB-Coverage) | Higher perplexity degradation |
| REAP | REAP | Qwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base | 25%, 50%, 75% | Worse than Generic TB-Coverage | Higher perplexity degradation |
| ExpertSparsity | ExpertSparsity | Qwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base | 25%, 50%, 75% | Worse than Generic TB-Coverage | Higher perplexity degradation |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Model CompressionProcedural Memory Distillation: PMD boosts benchmarks
An arXiv paper submitted 1 Jul 2026 introduces Procedural Memory Distillation (PMD).
Unconventional AI Un-0: oscillator model promises 1,000x lower
Naveen Rao's startup released Un-0, an image model on an oscillator-based architecture aiming for 1,000x inference power savings.
Agentic evolution: physically constrained foundation models
A multi-agent engine uses an Evolutionary Knowledge Graph to evolve Q-Enhance and MoE-Salient-AQ.
CompressKV: KV-cache compression keeps 97% with 3%
Semantic-retrieval-guided framework CompressKV preserves over 97% of full-cache performance on LongBench using 3% of KV storage.