Benchmarks & Evals4 min read

CompressKV: KV-cache compression keeps 97% with 3%

Semantic-retrieval-guided framework CompressKV preserves over 97% of full-cache performance on LongBench using 3% of KV storage.

The Brieftide

TL;DR

  • 01Semantic-retrieval-guided framework CompressKV preserves over 97% of full-cache performance on LongBench using 3% of KV storage.
  • 02CompressKV, a semantic-retrieval-guided KV-cache compression framework, was submitted to arXiv on 23 Jun 2026 by Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li and Grace Li Zhang.
  • 03CompressKV consistently outperforms existing KV-cache eviction methods across memory budgets, according to experiments reported in the paper.

CompressKV, a semantic-retrieval-guided KV-cache compression framework, was submitted to arXiv on 23 Jun 2026 by Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li and Grace Li Zhang. The paper reports that CompressKV preserves over 97% of full-cache performance using only 3% of the KV cache on LongBench question-answering tasks and achieves 90% accuracy with just 0.7% KV storage on Needle-in-a-Haystack.

How does CompressKV work?

CompressKV identifies a small subset of attention heads called Semantic Retrieval Heads, then retains KV pairs for tokens selected by those heads and allocates cache budgets across layers based on offline estimates of layer-wise eviction error. The framework departs from prior heuristics that aggregate attention scores across all heads in GQA-based LLMs; the authors argue those heuristics ignore different head functionalities and can evict critical tokens. By using SRHs that capture initial and final tokens of a prompt and semantically important mid-context evidence, CompressKV selects which token KV pairs to keep and distributes limited cache space per layer to reduce eviction error.

How well does CompressKV perform compared with existing KV eviction methods?

CompressKV consistently outperforms existing KV-cache eviction methods across memory budgets, according to experiments reported in the paper. The authors highlight two concrete results: on LongBench question-answering tasks CompressKV preserves over 97% of full-cache performance while using only 3% of the KV cache, and on Needle-in-a-Haystack it reaches 90% accuracy with 0.7% KV storage. The paper frames these outcomes as an improved resource performance trade-off for long-context LLM inference and states experiments on LongBench and Needle-in-a-Haystack support the claim.

Why it matters

Long-context LLM inference is constrained by the memory footprint and decoding cost of KV caches, which limits deployment on resource-constrained hardware. CompressKV targets that bottleneck by reducing KV storage dramatically while keeping performance close to a full cache. The framework focuses on head-level semantics and layer-aware budgeting, not just global heuristics, so it addresses a specific failure mode: evicting tokens that later prove critical. If the reported results hold across more models and tasks, CompressKV could lower the memory barrier for running long-context LLMs in edge or cost-sensitive environments.

What to watch

The authors made their code publicly available at the URL provided in the paper, so community replication and wider benchmarks are the next milestones. Also note an arXiv admin comment in the submission pointing to substantial text overlap with arXiv:2508.02401, which readers and reviewers may consider when evaluating novelty and reproducibility of the claims.

CompressKV reported results
Item
LongBench (QA)3%Preserves over 97% of full-cache performance
Needle-in-a-Haystack0.7%Achieves 90% accuracy
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement