HARC: Coupling harmfulness and refusal for LLM safety
HARC pairs harmfulness and refusal directions at prompt and response positions to boost robustness while preserving capability and.
TL;DR
- 01HARC pairs harmfulness and refusal directions at prompt and response positions to boost robustness while preserving capability and.
- 02The authors show the intervention is confined to the harmfulness-refusal subspace, leaves the rest of the residual stream intact, and does not inflate over-refusal.
- 03HARC pairs the harmfulness and refusal directions across both prompt and response positions via a fine-tuning intervention confined to that subspace, leaving the rest of the residual stream unchanged.
HARC, a fine-tuning method introduced by Shei Pern Chua and Fangzhao Wu in a paper submitted on 1 Jul 2026 (arXiv:2607.00572), couples the models internal harmfulness and refusal directions across both prompt and response positions to improve safety alignment without harming general capability. The authors show the intervention is confined to the harmfulness-refusal subspace, leaves the rest of the residual stream intact, and does not inflate over-refusal.
What did the authors find about jailbreaks and internal safety representations?
They found aligned LLMs encode harmfulness and refusal as separable directions in the residual stream at prompt-side token positions, and jailbreaks succeed by suppressing either the refusal or harmfulness direction before any token is generated, with distinct attack classes occupying separable regions of the harmfulness-refusal plane. Prior work established that harmfulness and refusal appear as separate directions at prompt-side positions; this paper extends that analysis and shows those attack patterns map to distinct regions in a two-dimensional harmfulness-refusal plane. The authors also extended the analysis to response-token positions and discovered the model often recognizes harmful content while it is generating that content, even when it failed to recognize the input as harmful at the prompt side.
How does HARC work and what does it change?
HARC pairs the harmfulness and refusal directions across both prompt and response positions via a fine-tuning intervention confined to that subspace, leaving the rest of the residual stream unchanged. By coupling those two directions, the method prevents jailbreaks that rely on suppressing only one direction. The intervention operates inside the harmfulness-refusal subspace, which the authors report preserves general model capability and avoids inflating refusal rates. Across extensive experiments the paper states HARC achieves the strongest robustness-capability-usability trade-off among six baselines spanning the major training-time and inference-time safety methods tested.
The paper reports that the harmfulness and refusal directions at prompt and response positions transfer across five model families and two scales without architecture-specific tuning, indicating the signals the method targets are not tied to a single architecture. Because the tuning is confined to a low-dimensional subspace rather than broad model weights or output filters, HARC aims to block jailbreak vectors while leaving non-safety behavior intact.
Why it matters
Targeting the harmfulness-refusal subspace reframes safety interventions from coarse, model-wide changes to a focused, representation-level approach. That focus explains why HARC can both block classes of jailbreaks and maintain capability: it alters the internal safety signals without disrupting unrelated representations. If the reported transfer across five model families and two scales holds up under broader testing, HARC could influence how teams design training-time and inference-time safety pipelines by shifting effort into identifying and coupling internal safety directions rather than applying blanket filtering or aggressive refusal policies.
What to watch
Look for replication of the papers main claims: whether HARCs subspace intervention reproduces the strongest robustness-capability-usability trade-off across independent benchmarks and additional model families. Also watch for public releases of code and data associated with the submission that would let other researchers verify the separable attack regions in the harmfulness-refusal plane and test the method on larger or different-scale models.
Sources: HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment, Shei Pern Chua and Fangzhao Wu, arXiv:2607.00572, submitted 1 Jul 2026.
| Item | ||
|---|---|---|
| HARC | Pairs harmfulness and refusal across prompt and response positions | Achieves the strongest robustness-capability-usability trade-off among six baselines |
| Six baselines (training-time and inference-time methods) | Major existing training-time and inference-time safety approaches | Weaker trade-offs compared with HARC as reported in the paper |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI SafetyConstructive Alignment: Governing Preference Dynamics in AI
Max Kanwal and Caryn Tran reframe alignment as governing evolving human preference trajectories rather than optimizing fixed preferences.
Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A
An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.
Anthropic's Power Play: Leading AI Now to Make It Safer
Anthropic says building dominant AI models and accumulating influence are necessary to steer the technology away from catastrophic risks.
Human-centric AI and firm idiosyncratic risks, 2015–2023
Human-centric AI strategies are associated with lower firm idiosyncratic risk among Chinese listed firms.