AI Safety4 min read

SAGE unlearning fix: post-hoc sanitization for LLM retention

SAGE sanitizes final unlearning update vectors to restore retained LLM capabilities without rerunning the original unlearning pipeline.

The Brieftide

TL;DR

  • 01SAGE sanitizes final unlearning update vectors to restore retained LLM capabilities without rerunning the original unlearning pipeline.
  • 02SAGE, introduced in an arXiv paper submitted on 16 Jun 2026, is a post-hoc method that sanitizes final unlearning update vectors to recover retained capabilities in large language models.
  • 03The paper, arXiv:2606.18309, lists nine authors including Jingyuan Zhang and Yucheng Bai and is available via DOI https://doi.org/10.48550/arXiv.2606.18309.

SAGE, introduced in an arXiv paper submitted on 16 Jun 2026, is a post-hoc method that sanitizes final unlearning update vectors to recover retained capabilities in large language models. The paper, arXiv:2606.18309, lists nine authors including Jingyuan Zhang and Yucheng Bai and is available via DOI https://doi.org/10.48550/arXiv.2606.18309.

How does SAGE work?

SAGE is a source-agnostic, post-hoc correction applied to the final update vector produced by any unlearning method: it collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form to suppress components that align with high-energy retained directions while preserving the source method's forgetting carrier. In practice, SAGE takes the final unlearning update vector as given and then projects away update components that would damage retention, using activation geometry estimated from a small retain proxy dataset.

The paper frames this pipeline around what the authors call the "retention activation bias", a quantity they use to quantify retention damage independently of how the unlearning update was produced. SAGE does not require rerunning the original unlearning pipeline; instead it operates on the final vector and performs a closed-form sanitization that is agnostic to the unlearning source. The method therefore separates the forgetting carrier, which carries the intended removal, from components that disproportionately harm retained behaviors.

What evidence do the authors present?

The authors report that SAGE was evaluated across multiple unlearning methods, model scales, and benchmarks, and that it "consistently relieves the retain-forget trade-off." The paper summarizes the method and its empirical claim in the abstract: SAGE identifies post-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning. The submission lists the full author set: Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, and Xiaolin Huang, and the arXiv record includes links to code, data, and media associated with the article.

Why it matters

Unlearning in large models routinely forces a trade-off: stronger forgetting often reduces retained capabilities. SAGE reframes part of that trade-off as a geometric problem in activation space and offers a lightweight, post-hoc correction step. That matters because it enables practitioners to apply any existing unlearning method and then recover retention without repeating expensive training or unlearning runs. If the paper's cross-method, cross-scale claims hold under community replication, SAGE could become a practical lever for teams balancing data-removal requests and service quality.

What to watch

Look for community replication of the paper's claim that SAGE "consistently relieves the retain-forget trade-off" across the unlearning methods and model scales the authors studied. Also check the paper's arXiv record for linked code and demos that would let practitioners run the closed-form sanitization on their own final update vectors.

References: arXiv:2606.18309, "SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector," submitted 16 Jun 2026, DOI https://doi.org/10.48550/arXiv.2606.18309.

SAGE post-hoc sanitization data flow
apply unlearningoutputscollect inputs -> extractinputanchor geometryproduce sanitized vectorapply updateOriginal LLM modelAny unlearning methodproduces final update vectorFinal unlearning update vectorSmall retain proxycollects real module inputsDominant activation geometryextracted from proxy inputsSAGE sanitizersource-anchored closed-form optimizationSanitized update vectorModel after sanitized update
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

More in AI Safety
Advertisement