SchemaRAG: Dynamic schema pruning cuts latency 47%, ups micro-F1
A retrieval-augmented generation method, SchemaRAG prunes large output schemas and showed up to 48% token cost savings on healthcare and.
TL;DR
- 01A retrieval-augmented generation method, SchemaRAG prunes large output schemas and showed up to 48% token cost savings on healthcare and.
- 02SchemaRAG addresses those limits by retrieving a reduced subset of the target schema to condition the LLM, rather than passing the entire schema every time.
- 03The authors present SchemaRAG as an engineering and algorithmic response to prompt-scale problems.
SchemaRAG, a retrieval-augmented generation framework from Sin Yu Bonnie Ho and five coauthors, was submitted to arXiv on 4 May 2026 and targets structured information extraction when target schemas are large and complex. The approach dynamically prunes the output schema space using schema metadata and few-shot examples; on real-world healthcare and e-commerce datasets it achieved up to an 8.8% increase in micro-F1, a 47% reduction in latency, and a 48% reduction in token costs.
What is SchemaRAG and how does it work?
SchemaRAG is a retrieval-augmented generation framework that dynamically prunes the output schema space for schema-conditioned information extraction tasks, leveraging schema metadata and few-shot examples when available. The paper frames the problem around the costs of including a full schema in a prompt: larger schemas increase cost and latency, risk lost-in-the-middle performance degradation, and can exceed context length limits. SchemaRAG addresses those limits by retrieving a reduced subset of the target schema to condition the LLM, rather than passing the entire schema every time.
The authors present SchemaRAG as an engineering and algorithmic response to prompt-scale problems. They position schema metadata and any available few-shot examples as the signals used to select a smaller, focused schema slice for each input, which lets the extraction model operate with a tighter context and fewer tokens.
How well does SchemaRAG perform?
On the paper's real-world testbeds, SchemaRAG delivered concrete gains: up to an 8.8% increase in micro-F1, a 47% reduction in latency, and a 48% reduction in token costs. The evaluation datasets are described as coming from healthcare and e-commerce, and the authors report these three headline improvements as evidence for the method's effectiveness and efficiency.
Those numbers are presented as upper-bound improvements observed in the experiments. The authors summarize the results as demonstrating the practicality of SchemaRAG for large-schema extraction, emphasizing both quality (micro-F1) and operational metrics (latency and token cost).
Why does it matter?
Large output schemas have become a real bottleneck for LLM-driven extraction: they push up token usage and latency, and they can exceed model context windows. SchemaRAG cuts into all three pain points at once, improving extraction quality while reducing the two main operational costs of LLM use, latency and token consumption. That combination matters for teams building extraction pipelines in domains where schemas are deep and variable, such as the healthcare and e-commerce examples used in the paper.
The paper’s framing suggests a path to making schema-conditioned extraction more practical in production settings where prompt length and cost are hard constraints. Reducing token use by up to 48% and lowering latency by up to 47% could directly affect throughput and billable usage for services that rely on large LLM contexts.
What to watch
The work appears in the Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, Industry Track, pages 1114 to 1127, San Diego, California, USA, July 2026, so the conference presentation will be the next public milestone. The arXiv submission is listed as arXiv:2607.00008 (submitted 4 May 2026) and links to a DOI in the ACL proceedings.
Authors: Sin Yu Bonnie Ho, Arlie Coles, Erik Larsson, Eric Marshall, Nathan Bodenstab, and Paul Vozila. The paper positions SchemaRAG as a retrieval-augmented strategy to make large-schema extraction more efficient while improving micro-F1 and reducing both latency and token cost.
(Estimated reading time: 5 minutes.)
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Enterprise AI AdoptionNVIDIA Confidential Computing: 98% performance, Blackwell GPUs
NVIDIA’s Confidential Computing secures models and data on Blackwell (HGX B300) while adding typically under 8% throughput or per‑token.
Teleperformance AI: Achieving Operational Excellence Now
Teleperformance says firms with Lean Six Sigma or BPM discipline can better translate AI investments; a sponsored report cites $113B market.
Microsoft Frontier Company launches with $2.5B investment
The unit will deploy 6,000 industry and engineering experts to deliver enterprise AI projects using Microsoft’s existing tools.
Multi-Agent Orchestration for Enterprise AI: arXiv Paper
An arXiv paper (18 Jun 2026) evaluates DAG Plan and Execute versus ReAct across 208 enterprise scenarios and adds a Task Manager that cuts.