Enterprise AI Adoption5 min read

SchemaRAG: Dynamic schema pruning cuts latency 47%, ups micro-F1

A retrieval-augmented generation method, SchemaRAG prunes large output schemas and showed up to 48% token cost savings on healthcare and.

The Brieftide

TL;DR

  • 01A retrieval-augmented generation method, SchemaRAG prunes large output schemas and showed up to 48% token cost savings on healthcare and.
  • 02SchemaRAG addresses those limits by retrieving a reduced subset of the target schema to condition the LLM, rather than passing the entire schema every time.
  • 03The authors present SchemaRAG as an engineering and algorithmic response to prompt-scale problems.

SchemaRAG, a retrieval-augmented generation framework from Sin Yu Bonnie Ho and five coauthors, was submitted to arXiv on 4 May 2026 and targets structured information extraction when target schemas are large and complex. The approach dynamically prunes the output schema space using schema metadata and few-shot examples; on real-world healthcare and e-commerce datasets it achieved up to an 8.8% increase in micro-F1, a 47% reduction in latency, and a 48% reduction in token costs.

What is SchemaRAG and how does it work?

SchemaRAG is a retrieval-augmented generation framework that dynamically prunes the output schema space for schema-conditioned information extraction tasks, leveraging schema metadata and few-shot examples when available. The paper frames the problem around the costs of including a full schema in a prompt: larger schemas increase cost and latency, risk lost-in-the-middle performance degradation, and can exceed context length limits. SchemaRAG addresses those limits by retrieving a reduced subset of the target schema to condition the LLM, rather than passing the entire schema every time.

The authors present SchemaRAG as an engineering and algorithmic response to prompt-scale problems. They position schema metadata and any available few-shot examples as the signals used to select a smaller, focused schema slice for each input, which lets the extraction model operate with a tighter context and fewer tokens.

How well does SchemaRAG perform?

On the paper's real-world testbeds, SchemaRAG delivered concrete gains: up to an 8.8% increase in micro-F1, a 47% reduction in latency, and a 48% reduction in token costs. The evaluation datasets are described as coming from healthcare and e-commerce, and the authors report these three headline improvements as evidence for the method's effectiveness and efficiency.

Those numbers are presented as upper-bound improvements observed in the experiments. The authors summarize the results as demonstrating the practicality of SchemaRAG for large-schema extraction, emphasizing both quality (micro-F1) and operational metrics (latency and token cost).

Why does it matter?

Large output schemas have become a real bottleneck for LLM-driven extraction: they push up token usage and latency, and they can exceed model context windows. SchemaRAG cuts into all three pain points at once, improving extraction quality while reducing the two main operational costs of LLM use, latency and token consumption. That combination matters for teams building extraction pipelines in domains where schemas are deep and variable, such as the healthcare and e-commerce examples used in the paper.

The paper’s framing suggests a path to making schema-conditioned extraction more practical in production settings where prompt length and cost are hard constraints. Reducing token use by up to 48% and lowering latency by up to 47% could directly affect throughput and billable usage for services that rely on large LLM contexts.

What to watch

The work appears in the Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, Industry Track, pages 1114 to 1127, San Diego, California, USA, July 2026, so the conference presentation will be the next public milestone. The arXiv submission is listed as arXiv:2607.00008 (submitted 4 May 2026) and links to a DOI in the ACL proceedings.

Authors: Sin Yu Bonnie Ho, Arlie Coles, Erik Larsson, Eric Marshall, Nathan Bodenstab, and Paul Vozila. The paper positions SchemaRAG as a retrieval-augmented strategy to make large-schema extraction more efficient while improving micro-F1 and reducing both latency and token cost.

(Estimated reading time: 5 minutes.)

Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement