SemHash-LLM: multi-granularity hashing for deduplication
A multi-granularity semantic hashing framework combining projection hashing.
TL;DR
- 01A multi-granularity semantic hashing framework combining projection hashing.
- 02The system first extracts multi-granularity signals: character, token and document features are fused by a gating mechanism.
- 03Semantic projection hashing maps distilled LLM embeddings to compact binary codes for fast indexing and retrieval.
SemHash-LLM, a paper by Xinyi Fang, Kejian Tong, Jiabei Liu, Tao Ning and Yuhang He (arXiv:2607.01601), was submitted to arXiv on 2 Jul 2026 proposing a multi-granularity semantic hashing framework for large-scale document deduplication. The authors unify semantic projection hashing, attention weighted MinHash, contrastive boundary learning and selective LLM-based adjudication, and report "less than one percent neural verification cost."
How does SemHash-LLM work?
SemHash-LLM combines character, token and document level signals through gated fusion, then applies a cascaded filtering pipeline that reduces candidates before neural verification. The pipeline pairs semantic projection hashing, which learns compact binary codes in a distilled LLM embedding space, with attention weighted MinHash to suppress boilerplate and emphasize informative content.
The system first extracts multi-granularity signals: character, token and document features are fused by a gating mechanism. Semantic projection hashing maps distilled LLM embeddings to compact binary codes for fast indexing and retrieval. Attention weighted MinHash then downweights repetitive or boilerplate regions, improving candidate selection. Contrastive boundary learning sets adaptive decision thresholds and provides uncertainty estimates that feed a cascaded filter to reduce the number of items that reach the most expensive verification stage, which the authors implement as selective LLM-based adjudication.
How well does it perform?
The authors state that experiments show SemHash-LLM achieves strong duplicate detection quality while keeping neural verification costs extremely low, specifically under one percent. Adaptive decision boundaries and uncertainty estimation are credited with improving robustness against template pollution, short text perturbation, containment and viral fragments.
The paper emphasizes efficiency at scale: compact binary codes and MinHash-style filtering perform the heavy lifting, and only a small fraction of candidates are escalated to LLM adjudication. The exact datasets, numeric accuracy metrics and evaluation splits are presented in the paper itself; the abstract highlights the overall result as strong duplicate detection quality with less than one percent neural verification cost.
Why does it matter?
SemHash-LLM addresses a core trade-off for large-scale deduplication: preserving semantic equivalence typically requires expensive neural comparisons, while hash-and-filter pipelines risk missing subtle semantic duplicates. By projecting distilled LLM embeddings into compact binary codes and combining them with attention weighted MinHash and adaptive boundaries, the framework aims to retain semantic sensitivity without paying the full cost of neural verification for every candidate.
If the experimental claims hold across real-world corpora, the approach could cut the volume of expensive LLM checks dramatically, which matters for teams that must deduplicate at web or enterprise scale and want to limit inference costs while handling noisy templates, short or viral content and containment cases.
What to watch
Look for the authors' code, data and media links on the paper's arXiv page and for the paper's DOI via DataCite, which the arXiv entry notes is pending registration. Also watch for detailed evaluation tables and release of the distilled embedding models the paper uses, which will determine how portable the compact binary projection is across different corpora.
Authors: Xinyi Fang, Kejian Tong, Jiabei Liu, Tao Ning, Yuhang He. Submitted to arXiv on 2 Jul 2026 as arXiv:2607.01601.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIMIT Masked IRL: LLMs help robots clarify and ignore cues
MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.
Multimodal LLM evaluation: four missing capabilities (2026)
A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.
ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.
Amazon Nova embeddings beat Cohere for Vexcel aerial search
Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.