Model CompressionJuly 3, 20264 min read

Generic TB-Coverage improves MoE pruning for Qwen1.5, DeepSeek

Uses WikiText2 and C4 calibration to boost zero-shot accuracy across six benchmarks at 25%.

The BrieftideJuly 3, 2026

TL;DR

01Uses WikiText2 and C4 calibration to boost zero-shot accuracy across six benchmarks at 25%.
02The authors tested Generic TB-Coverage on two MoE models: Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base, using fixed retention budgets of 25%, 50% and 75%.
03The reported gains are largest under aggressive pruning, specifically at the 25% and 50% retention settings.

Generic TB-Coverage is a coverage-aware expert pruning method that uses only generic text corpora (WikiText2 and C4) for calibration and preserves per-corpus high-utility experts under fixed-budget pruning. The paper, submitted 2 Jul 2026 by Yongqin Zeng and six coauthors, evaluates the method on Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base and reports improved average accuracy on six common zero-shot benchmarks.

What does Generic TB-Coverage do?

Generic TB-Coverage profiles each MoE expert separately on multiple generic corpora and enforces a fixed-budget coverage rule that keeps high-utility experts from each corpus before building the final pruning mask. Instead of collapsing expert utility into a single aggregated importance score, the method measures per-expert utility on WikiText2 and C4 and ensures cross-corpus coverage when selecting which experts to retain.

This approach aims to avoid biasing the retained expert set toward experts favored by dominant calibration patterns, a shortcoming the paper attributes to many existing expert-pruning methods that rely on a single aggregated score.

How was the method tested and what changed?

The authors tested Generic TB-Coverage on two MoE models: Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base, using fixed retention budgets of 25%, 50% and 75%. Across those retention budgets, the paper states the method improves average accuracy on six common zero-shot benchmarks compared with random pruning, REAP, and ExpertSparsity, while also reducing perplexity degradation on the calibration corpora WikiText2 and C4.

The reported gains are largest under aggressive pruning, specifically at the 25% and 50% retention settings. The authors emphasize that these improvements hold with fixed pruning budgets and without any downstream calibration data.

Why it matters

Preserving cross-corpus expert coverage offers a practical generic-data prior for MoE pruning, the paper argues. If per-expert utility varies by calibration corpus, aggregating those utilities into a single score can drop experts that matter for some evaluation distributions. Generic TB-Coverage counters that by explicitly protecting experts that show high utility on each corpus, which the authors show yields better zero-shot accuracy and less perplexity damage, especially when pruning aggressively.

That pattern matters because MoE models are commonly pruned to save compute or memory, and pruning decisions that overfit a single calibration distribution can erode out-of-distribution or zero-shot performance. The paper demonstrates a concrete strategy to make pruning more robust using only generic text corpora.

What to watch

Check the arXiv entry for the paper (submitted 2 Jul 2026) for full results and artifacts: the page includes PDF and TeX Source links and toggles for code and data. The next signals to follow are whether the authors publish pruning masks or code via those links and whether the same coverage-aware pruning gains replicate on other MoE families or with different calibration corpora.

Summary: Generic TB-Coverage vs other expert-pruning methods

Item
Generic TB-Coverage	Generic TB-Coverage	Qwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base	25%, 50%, 75%	Improves average accuracy over random pruning, REAP, ExpertSparsity	Reduces perplexity degradation
Random pruning	Random pruning	Qwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base	25%, 50%, 75%	Baseline (worse than Generic TB-Coverage)	Higher perplexity degradation
REAP	REAP	Qwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base	25%, 50%, 75%	Worse than Generic TB-Coverage	Higher perplexity degradation
ExpertSparsity	ExpertSparsity	Qwen1.5-MoE-A2.7B; DeepSeek-MoE-16B-Base	25%, 50%, 75%	Worse than Generic TB-Coverage	Higher perplexity degradation

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Procedural Memory Distillation: PMD boosts benchmarks

An arXiv paper submitted 1 Jul 2026 introduces Procedural Memory Distillation (PMD).

The BrieftideDAILY BRIEF

Unconventional AI Un-0: oscillator model promises 1,000x lower

Naveen Rao's startup released Un-0, an image model on an oscillator-based architecture aiming for 1,000x inference power savings.

The BrieftideDAILY BRIEF

Agentic evolution: physically constrained foundation models

A multi-agent engine uses an Evolutionary Knowledge Graph to evolve Q-Enhance and MoE-Salient-AQ.

The BrieftideDAILY BRIEF

CompressKV: KV-cache compression keeps 97% with 3%

Semantic-retrieval-guided framework CompressKV preserves over 97% of full-cache performance on LongBench using 3% of KV storage.