MosaicLeaks benchmark: PA-DR cuts leakage from 34% to 9.9%
MosaicLeaks, published June 18, 2026, uses 1,001 multi-hop chains to show PA-DR keeps chain success near 58.7% while slicing.
TL;DR
- 01MosaicLeaks, published June 18, 2026, uses 1,001 multi-hop chains to show PA-DR keeps chain success near 58.7% while slicing.
- 02MosaicLeaks, published June 18, 2026, defines a controlled deep-research privacy task and a dataset of 1,001 multi-hop research chains over local enterprise documents and a fixed web corpus.
- 03The benchmark measures three leakage types and shows that a privacy-aware RL method, PA-DR, holds chain success near 58.7% while cutting answer/full-information leakage to 9.9%.
MosaicLeaks, published June 18, 2026, defines a controlled deep-research privacy task and a dataset of 1,001 multi-hop research chains over local enterprise documents and a fixed web corpus. The benchmark measures three leakage types and shows that a privacy-aware RL method, PA-DR, holds chain success near 58.7% while cutting answer/full-information leakage to 9.9%.
What does MosaicLeaks measure and how is the dataset built?
MosaicLeaks measures whether an observer can infer private enterprise facts from an agent's outgoing web-query log, using three specific leakage categories: intent leakage, answer leakage, and full-information leakage. The benchmark contains 1,001 chains that interleave local and web sub-questions, with a final split of 559 training chains, 98 validation chains, and 344 held-out-company test chains.
Construction happens in three stages: seed private facts from enterprise documents, create bridge documents so each web hop depends on earlier local answers, and validate chains for answerability and retrievability. The dataset intentionally creates tasks likely to induce the "mosaic effect," where individually benign queries together reveal private information.
How severe was leakage on standard agents and what did training change?
A base Qwen3-4B agent had 48.7% strict chain success and 34.0% answer/full-information leakage on the benchmark. Training for task-only performance raised strict chain success to 59.3% but increased answer/full-information leakage to 51.7%, because models learned to pack more private context into web queries.
PA-DR, Privacy-Aware Deep Research, combines situational task rewards with a learned privacy reward. The situational task reward evaluates each planning, choosing, and reading call against similar calls at the same stage, giving precise credit for correct retrieval behavior. The learned privacy reward uses a Qwen3-4B classifier that estimates whether current queries leak private information directly or create a new mosaic leak when added to the existing log, and penalizes the larger risk. With that dual reward, the method achieves 58.7% strict chain success while reducing answer/full-information leakage to 9.9%.
Training efficiency also improved. MosaicLeaks shows situational rewards reach outcome-only RL task performance with roughly 5 to 6 times fewer generated samples. In the benchmark's reported figures, outcome-reward training used 963k generated samples, while the task plus PA-DR setup reached comparable strict success using 706k generated samples and required 183k samples to hit about 55% strict chain success.
How does PA-DR change agent behavior in practice?
PA-DR does not make the agent stop searching. The benchmark shows PA-DR agents issue more web queries than the base model, but those queries omit revealing specifics such as exact numeric metrics or dates. The situational reward encourages choosing and reading documents that directly answer the current hop, while the privacy reward assigns the privacy cost to the planning decision that would make the query log more revealing. The result is similar task performance with much lower leakage.
Why it matters
MosaicLeaks demonstrates a concrete tension: improving retrieval quality can increase privacy leakage because richer queries leak private fragments. The benchmark shows a fix that trains privacy into the agent's decision process, not one that relies on prompts. As the paper states plainly, "You can't prompt privacy in. You have to train it in." That matters for enterprises deploying research agents who must protect internal metrics, dates, and named facts while still getting useful public information.
What to watch
Broader studies that move beyond the benchmark's synthetic enterprise documents and fixed web corpus will be the next test: the authors note that real deployments, other agent designs, and broader tasks still need their own study. Watch for replications on live enterprise corpora and for adaptations of PA-DR to different agent harnesses and classifier models.
| Item | |||
|---|---|---|---|
| Base Qwen3-4B | 48.7% | 34.0% | |
| Task-only RL (Task reward) | 59.3% | 51.7% | |
| Situational task reward | 59.3% | 51.7% | |
| Task + PA-DR reward | 58.7% | 9.9% |
Written by The Brieftide · Source: Hugging Face
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAdobe creative agents arrive in Photoshop, Premiere, and more
Firefly-powered AI assistants automate multi-step production tasks across Creative Cloud and plug into ChatGPT, Claude.
CODA-BENCH benchmark: testing code agents on data tasks
CODA-BENCH places agents in a Kaggle-based Linux sandbox with 1,009 tasks across 31 communities and an average of 980 files per task.
SWE-Explore: benchmark shows AI coding agents miss key lines
SWE-Explore isolates code search from repair and finds agents hit the right files but cover only 14–19% of the lines that matter.
OpenAI acquires Ona to add persistent agents to Codex
The deal brings Ona's cloud development environments into Codex so agents can continue tasks for hours or days in customers' clouds.