Baidu Unlimited OCR: R-SWA enables dozens-page OCR in one pass
Baidu's Unlimited OCR uses Reference Sliding Window Attention to keep the KV cache fixed and process dozens of pages in a single inference.
TL;DR
- 01Baidu's Unlimited OCR uses Reference Sliding Window Attention to keep the KV cache fixed and process dozens of pages in a single inference.
- 02Baidu researchers built an OCR system that processes dozens of document pages in a single inference pass while keeping memory use and decoding speed constant, the team published on Jul 5, 2026.
- 03The model achieves this by redesigning attention into a Reference Sliding Window Attention that limits how much previously generated output the decoder can attend to.
Baidu researchers built an OCR system that processes dozens of document pages in a single inference pass while keeping memory use and decoding speed constant, the team published on Jul 5, 2026. The model achieves this by redesigning attention into a Reference Sliding Window Attention that limits how much previously generated output the decoder can attend to.
How does Unlimited OCR keep memory and speed constant?
Unlimited OCR uses Reference Sliding Window Attention, which lets each generated token attend to all reference tokens and visual tokens but only the last 128 previously generated output tokens, capping the KV cache. Visual tokens are encoded once and remain unchanged, avoiding the gradual blurring that a standard sliding window would impose on image features.
The system pairs a frozen DeepEncoder with a mixture-of-experts decoder of three billion parameters, of which about 500 million are active during inference. The DeepEncoder compresses a 1024-by-1024-pixel PDF image to 256 tokens. R-SWA runs the KV cache as a fixed-length queue: each new token pushes out the oldest, so memory use equals the fixed sum of prefix length and window size rather than growing with total output length.
How well does Unlimited OCR perform on benchmarks and long documents?
Unlimited OCR scores 93 percent overall on the OmniDocBench v1.5 document benchmark, six percentage points above the Deepseek OCR baseline, and hits 93.92 percent on OmniDocBench v1.6, putting it at the top of end-to-end system rankings. In long-horizon tests the model keeps an edit distance below 0.11 even past 40 pages, and it records a Distinct-35 score of 97 percent at 40-plus pages.
Restricting the window to 128 tokens on single pages does not harm accuracy; the authors report a slight improvement, which they attribute to R-SWA forcing tighter focus on the dense OCR task. Speed also benefits: in Base mode Unlimited OCR reaches 5,580 tokens per second versus 4,951 for Deepseek OCR, a 12.7 percent increase. In a theoretical ideal-parallelism comparison the authors show a 35 percent lead at around 6,000 output tokens while the baseline's throughput drops with length.
The team notes remaining errors are mainly due to Base mode's resolution limit when text becomes tiny, not lost attention from R-SWA. Training used about two million document samples, split nine-to-one between single-page and multi-page data; multi-page data was synthetically built by stitching single pages into documents ranging from two to 50 pages. All data packed into sequences of 32,000 tokens; training ran for 4,000 steps on 8 times 16 Nvidia A800 GPUs. The DeepEncoder stayed frozen and only the language model parameters were updated.
Why it matters
R-SWA stops the KV cache from ballooning as output length grows, which removes a practical limit that forced most systems to process pages sequentially and reset state. That lets a single inference pass handle long documents without slowing down, reducing the infrastructure penalty of long-horizon OCR. The approach also keeps visual encodings stable, so image features do not degrade over long decoding runs.
The technique applies beyond OCR: the authors propose R-SWA for other reference-based tasks such as speech recognition and translation. The paper also points out an industry effect: image-based text can use far less compute than its digital equivalent, a trait developers already exploit to cut token costs in other models.
What to watch
Baidu plans 128,000-token training runs and a prefill pool that would let the model fetch relevant KV blocks dynamically, a concrete next milestone that would raise the model's effective document capacity. Also watch whether R-SWA appears in speech and translation systems and whether the 128,000-token models preserve the reported throughput and low error rates.
Code and model weights are available on GitHub and Hugging Face, the model runs on ModelScope and the inference engines vLLM and SGLang, and a demo is hosted on Hugging Face Spaces. The authors frame the decoder's restricted lookback as a kind of "soft forgetting" modeled on how humans copy text, and they see the idea as transferable to other long-reference tasks.
| Item | |||
|---|---|---|---|
| OmniDocBench v1.5 overall score | 93 percent | Six percentage points lower than Unlimited OCR | |
| OmniDocBench v1.6 overall score | 93.92 percent (top end-to-end ranking) | n/a | |
| Long-horizon edit distance (past 40 pages) | Below 0.11 | n/a | |
| Distinct-35 at 40+ pages | 97 percent | n/a | |
| Throughput, Base mode (tokens per second) | 5,580 tokens/s | 4,951 tokens/s | |
| Decoder attention window on output | Last 128 generated tokens | Full attention (grows with output) |
Written by The Brieftide · Source: The Decoder
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIMMIR-TCM: multimodal TCM AI framework outperforms GPT-4o, Gemini
MMIR-TCM pairs Memory-SAM, fine-tuned Qwen3-VL and a Qwen3 RAG pipeline.
MIT Masked IRL: LLMs help robots clarify and ignore cues
MIT’s Masked IRL uses two LLMs to clarify vague prompts, cut demonstration data nearly fivefold.
Multimodal LLM evaluation: four missing capabilities (2026)
A paper by Po-han Li et al. finds benchmarks miss temporal-spatial coherence, physical-world understanding.
ReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.