Refusal in Qwen2.5 and Llama-3.1: persona gates responses
Viola Zhong and Qirui Li show a compliant persona steering suppresses refusal in Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct.
TL;DR
- 01Viola Zhong and Qirui Li show a compliant persona steering suppresses refusal in Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct.
- 02Viola Zhong and Qirui Li submitted a paper to arXiv on 24 Jun 2026 showing that persona steering can suppress refusal behavior in instruction-tuned chat models.
- 03They ran interventions in Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct and found that a compliant persona direction can collapse refusals in Llama from 97% to 2%.
Viola Zhong and Qirui Li submitted a paper to arXiv on 24 Jun 2026 showing that persona steering can suppress refusal behavior in instruction-tuned chat models. They ran interventions in Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct and found that a compliant persona direction can collapse refusals in Llama from 97% to 2%.
What did the paper find?
The paper finds that refusal and persona are not independent linear effects: a compliant persona gates refusal at the late-layer expression stage. The authors extract a compliant model-persona direction and a refusal direction, then intervene on both. Compliant persona steering suppresses refusal, reintroducing the refusal direction only partially restores refusal when applied at late layers and fails to restore it at early layers. Projecting out the persona direction in a late-layer window restores refusal to baseline, while projecting out a random direction does not.
How did the authors test persona and refusal interactions?
They intervened on activation-space directions in two instruction-tuned chat models, Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, by extracting a compliant persona direction and a refusal direction and manipulating them. Interventions included steering the persona direction, reintroducing the refusal direction at different layer depths, and projecting out the persona direction in a late-layer window. The interventions produced model-specific outcomes, with the paper reporting a concrete result in Llama where the refusal rate drops from 97% to 2% under persona steering.
Why it matters
The result shows refusal is computed upstream but expressed downstream, so treating refusal as a single isolated direction misses its dependence on persona. That matters for mechanistic interpretability and for safety work that attempts to locate and remove refusal behavior by editing a single direction. If refusal is gated by persona at late layers, interventions need to consider persona representations and layer-specific expression, not only the upstream computation of a refusal signal.
What to watch
Look for replication across more models and for work that maps exactly which layers compute refusal versus which layers express it. The paper is accepted to the ICML 2026 Mechanistic Interpretability workshop, so follow that venue for follow-up presentations and code or demos the authors may release.
Details and source note: the paper is arXiv:2606.26161, submitted 24 Jun 2026, authored by Viola Zhong and Qirui Li, and carries the comment "Accepted to the ICML 2026 Mechanistic Interpretability workshop." The authors name the two models tested as Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct and report the Llama numeric change in refusal rate from 97% to 2% under persona steering.
| Item | |||
|---|---|---|---|
| Llama-3.1-8B-Instruct | 97 | 2 | |
| Qwen2.5-7B-Instruct | not reported | not reported |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIReMMD: Multilingual Multi-Image Benchmark and Agent Release
ReMMD introduces ReMMDBench (500 samples, 2,756 images) and ReMMD-Agent; GPT-5.2 yields 41.80% accuracy and 39.12% macro-F1.
Amazon Nova embeddings beat Cohere for Vexcel aerial search
Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.
LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.
ThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.