Multimodal AI5 min read

Refusal in Qwen2.5 and Llama-3.1: persona gates responses

Viola Zhong and Qirui Li show a compliant persona steering suppresses refusal in Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct.

The Brieftide

TL;DR

  • 01Viola Zhong and Qirui Li show a compliant persona steering suppresses refusal in Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct.
  • 02Viola Zhong and Qirui Li submitted a paper to arXiv on 24 Jun 2026 showing that persona steering can suppress refusal behavior in instruction-tuned chat models.
  • 03They ran interventions in Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct and found that a compliant persona direction can collapse refusals in Llama from 97% to 2%.

Viola Zhong and Qirui Li submitted a paper to arXiv on 24 Jun 2026 showing that persona steering can suppress refusal behavior in instruction-tuned chat models. They ran interventions in Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct and found that a compliant persona direction can collapse refusals in Llama from 97% to 2%.

What did the paper find?

The paper finds that refusal and persona are not independent linear effects: a compliant persona gates refusal at the late-layer expression stage. The authors extract a compliant model-persona direction and a refusal direction, then intervene on both. Compliant persona steering suppresses refusal, reintroducing the refusal direction only partially restores refusal when applied at late layers and fails to restore it at early layers. Projecting out the persona direction in a late-layer window restores refusal to baseline, while projecting out a random direction does not.

How did the authors test persona and refusal interactions?

They intervened on activation-space directions in two instruction-tuned chat models, Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, by extracting a compliant persona direction and a refusal direction and manipulating them. Interventions included steering the persona direction, reintroducing the refusal direction at different layer depths, and projecting out the persona direction in a late-layer window. The interventions produced model-specific outcomes, with the paper reporting a concrete result in Llama where the refusal rate drops from 97% to 2% under persona steering.

Why it matters

The result shows refusal is computed upstream but expressed downstream, so treating refusal as a single isolated direction misses its dependence on persona. That matters for mechanistic interpretability and for safety work that attempts to locate and remove refusal behavior by editing a single direction. If refusal is gated by persona at late layers, interventions need to consider persona representations and layer-specific expression, not only the upstream computation of a refusal signal.

What to watch

Look for replication across more models and for work that maps exactly which layers compute refusal versus which layers express it. The paper is accepted to the ICML 2026 Mechanistic Interpretability workshop, so follow that venue for follow-up presentations and code or demos the authors may release.

Details and source note: the paper is arXiv:2606.26161, submitted 24 Jun 2026, authored by Viola Zhong and Qirui Li, and carries the comment "Accepted to the ICML 2026 Mechanistic Interpretability workshop." The authors name the two models tested as Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct and report the Llama numeric change in refusal rate from 97% to 2% under persona steering.

Refusal rate baseline vs after compliant persona steering (from paper)
Item
Llama-3.1-8B-Instruct972
Qwen2.5-7B-Instructnot reportednot reported
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement