Model Compression4 min read

Generic TB-Coverage improves MoE pruning for Qwen1.5, DeepSeek

Uses WikiText2 and C4 calibration to boost zero-shot accuracy across six benchmarks at 25%.

The Brieftide

TL;DR

  • 01Uses WikiText2 and C4 calibration to boost zero-shot accuracy across six benchmarks at 25%.
  • 02The authors tested Generic TB-Coverage on two MoE models: Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base, using fixed retention budgets of 25%, 50% and 75%.
  • 03The reported gains are largest under aggressive pruning, specifically at the 25% and 50% retention settings.

Generic TB-Coverage is a coverage-aware expert pruning method that uses only generic text corpora (WikiText2 and C4) for calibration and preserves per-corpus high-utility experts under fixed-budget pruning. The paper, submitted 2 Jul 2026 by Yongqin Zeng and six coauthors, evaluates the method on Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base and reports improved average accuracy on six common zero-shot benchmarks.

What does Generic TB-Coverage do?

Generic TB-Coverage profiles each MoE expert separately on multiple generic corpora and enforces a fixed-budget coverage rule that keeps high-utility experts from each corpus before building the final pruning mask. Instead of collapsing expert utility into a single aggregated importance score, the method measures per-expert utility on WikiText2 and C4 and ensures cross-corpus coverage when selecting which experts to retain.

This approach aims to avoid biasing the retained expert set toward experts favored by dominant calibration patterns, a shortcoming the paper attributes to many existing expert-pruning methods that rely on a single aggregated score.

How was the method tested and what changed?

The authors tested Generic TB-Coverage on two MoE models: Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base, using fixed retention budgets of 25%, 50% and 75%. Across those retention budgets, the paper states the method improves average accuracy on six common zero-shot benchmarks compared with random pruning, REAP, and ExpertSparsity, while also reducing perplexity degradation on the calibration corpora WikiText2 and C4.

The reported gains are largest under aggressive pruning, specifically at the 25% and 50% retention settings. The authors emphasize that these improvements hold with fixed pruning budgets and without any downstream calibration data.

Why it matters

Preserving cross-corpus expert coverage offers a practical generic-data prior for MoE pruning, the paper argues. If per-expert utility varies by calibration corpus, aggregating those utilities into a single score can drop experts that matter for some evaluation distributions. Generic TB-Coverage counters that by explicitly protecting experts that show high utility on each corpus, which the authors show yields better zero-shot accuracy and less perplexity damage, especially when pruning aggressively.

That pattern matters because MoE models are commonly pruned to save compute or memory, and pruning decisions that overfit a single calibration distribution can erode out-of-distribution or zero-shot performance. The paper demonstrates a concrete strategy to make pruning more robust using only generic text corpora.

What to watch

Check the arXiv entry for the paper (submitted 2 Jul 2026) for full results and artifacts: the page includes PDF and TeX Source links and toggles for code and data. The next signals to follow are whether the authors publish pruning masks or code via those links and whether the same coverage-aware pruning gains replicate on other MoE families or with different calibration corpora.

Summary: Generic TB-Coverage vs other expert-pruning methods
Item
Generic TB-CoverageGeneric TB-CoverageQwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base25%, 50%, 75%Improves average accuracy over random pruning, REAP, ExpertSparsityReduces perplexity degradation
Random pruningRandom pruningQwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base25%, 50%, 75%Baseline (worse than Generic TB-Coverage)Higher perplexity degradation
REAPREAPQwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base25%, 50%, 75%Worse than Generic TB-CoverageHigher perplexity degradation
ExpertSparsityExpertSparsityQwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base25%, 50%, 75%Worse than Generic TB-CoverageHigher perplexity degradation
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement