CompressKV: KV-cache compression keeps 97% with 3%
Semantic-retrieval-guided framework CompressKV preserves over 97% of full-cache performance on LongBench using 3% of KV storage.
TL;DR
- 01Semantic-retrieval-guided framework CompressKV preserves over 97% of full-cache performance on LongBench using 3% of KV storage.
- 02CompressKV, a semantic-retrieval-guided KV-cache compression framework, was submitted to arXiv on 23 Jun 2026 by Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li and Grace Li Zhang.
- 03CompressKV consistently outperforms existing KV-cache eviction methods across memory budgets, according to experiments reported in the paper.
CompressKV, a semantic-retrieval-guided KV-cache compression framework, was submitted to arXiv on 23 Jun 2026 by Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li and Grace Li Zhang. The paper reports that CompressKV preserves over 97% of full-cache performance using only 3% of the KV cache on LongBench question-answering tasks and achieves 90% accuracy with just 0.7% KV storage on Needle-in-a-Haystack.
How does CompressKV work?
CompressKV identifies a small subset of attention heads called Semantic Retrieval Heads, then retains KV pairs for tokens selected by those heads and allocates cache budgets across layers based on offline estimates of layer-wise eviction error. The framework departs from prior heuristics that aggregate attention scores across all heads in GQA-based LLMs; the authors argue those heuristics ignore different head functionalities and can evict critical tokens. By using SRHs that capture initial and final tokens of a prompt and semantically important mid-context evidence, CompressKV selects which token KV pairs to keep and distributes limited cache space per layer to reduce eviction error.
How well does CompressKV perform compared with existing KV eviction methods?
CompressKV consistently outperforms existing KV-cache eviction methods across memory budgets, according to experiments reported in the paper. The authors highlight two concrete results: on LongBench question-answering tasks CompressKV preserves over 97% of full-cache performance while using only 3% of the KV cache, and on Needle-in-a-Haystack it reaches 90% accuracy with 0.7% KV storage. The paper frames these outcomes as an improved resource performance trade-off for long-context LLM inference and states experiments on LongBench and Needle-in-a-Haystack support the claim.
Why it matters
Long-context LLM inference is constrained by the memory footprint and decoding cost of KV caches, which limits deployment on resource-constrained hardware. CompressKV targets that bottleneck by reducing KV storage dramatically while keeping performance close to a full cache. The framework focuses on head-level semantics and layer-aware budgeting, not just global heuristics, so it addresses a specific failure mode: evicting tokens that later prove critical. If the reported results hold across more models and tasks, CompressKV could lower the memory barrier for running long-context LLMs in edge or cost-sensitive environments.
What to watch
The authors made their code publicly available at the URL provided in the paper, so community replication and wider benchmarks are the next milestones. Also note an arXiv admin comment in the submission pointing to substantial text overlap with arXiv:2508.02401, which readers and reviewers may consider when evaluating novelty and reproducibility of the claims.
| Item | |||
|---|---|---|---|
| LongBench (QA) | 3% | Preserves over 97% of full-cache performance | |
| Needle-in-a-Haystack | 0.7% | Achieves 90% accuracy |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsT2D-Bench: Benchmarking LLMs for Type 2 Diabetes Evidence
A multi-layer clinical-lifestyle knowledge graph flags unsupported LLM diabetes recommendations and corrects them across 100 vignettes.
RIFT-Bench: Dynamic Red-teaming for Agentic AI Systems
A graph-driven methodology with automated Discovery and Scanning phases.
BIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates
ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.