STEER attack: multilingual jailbreaks hit 93% on open 8B models
STEER converts refusal-trigger words into low-resource languages to bypass safety.
TL;DR
- 01STEER converts refusal-trigger words into low-resource languages to bypass safety.
- 02STEER, a technique introduced in a July 2, 2026 arXiv paper by Joshua Adrian Cahyono, is a gradient-guided attack that manipulates multilingual phrasing to bypass LLM safety refusals.
- 03The paper shows STEER converts words that drive refusal into low-resource languages iteratively, suppressing refusal while keeping the harmful intent of the prompt.
STEER, a technique introduced in a July 2, 2026 arXiv paper by Joshua Adrian Cahyono, is a gradient-guided attack that manipulates multilingual phrasing to bypass LLM safety refusals. The paper shows STEER converts words that drive refusal into low-resource languages iteratively, suppressing refusal while keeping the harmful intent of the prompt.
What is STEER and how does it work?
STEER stands for Safety Targeted Embedding Exploit via Refinement, and it is a gradient-guided method that identifies words contributing most strongly to a model's refusal behavior and translates them into low-resource languages to reduce refusal. The technique scores refusal-contributing words using gradients, then iteratively replaces them with translations to preserve the prompt's intent while lowering the model's refusal signal.
The paper frames the technique as a probe of an "epistemic gap" created when safety training is concentrated in English. By moving parts of a prompt out of the English training distribution, STEER suppresses the model's refusal mechanism without needing direct access to the target model's safety policy text.
How effective is STEER across models and benchmarks?
STEER achieves high attack success rates on multiple benchmarks: up to 93.0% on JailbreakBench and 96.7% on AdvBench across six open-source 8B-parameter models. The paper also reports that STEER outperforms both random code-switching and Greedy Coordinate Gradient (GCG) in these tests.
The author tested STEER on six open-source models at the 8B parameter scale and measured success on two adversarial benchmarks named in the paper: JailbreakBench and AdvBench. Beyond the open-source models, STEER-crafted prompts transferred to GPT-4o-mini, producing a 35.5% attack success rate without requiring access to that target model, which the paper uses to argue the weakness is not architecture-specific.
Why does this expose a multilingual safety problem?
The paper argues that safety alignment dominated by English leaves uncertain how refusal mechanisms generalize to low-resource languages and mixed-language code-switching, and that gap can be exploited. Because STEER reduces refusal by moving refusal-triggering tokens into languages underrepresented in safety training, the technique demonstrates that English-focused alignment cannot be assumed to cover multilingual or code-switched inputs.
Cahyono frames the issue as an epistemic gap: models appear confidently to generate harmful outputs when faced with inputs falling outside the distribution of their safety training. The observed transfer to GPT-4o-mini, at 35.5% success without model access, supports the claim that the vulnerability spans architectures rather than being limited to one implementation.
What should defenders do differently?
The paper recommends two remedies: broaden coverage during alignment to include low-resource languages and code-switched inputs, and deploy mechanisms that explicitly detect and abstain on out-of-distribution inputs. Those steps aim to close the "epistemic gap" by preventing safety classifiers from failing silently when prompts stray from English-language distributions.
What to watch
Watch for follow-up work that measures STEER on models at other parameter scales and for evaluations that test explicit out-of-distribution detectors on code-switched inputs. A concrete signal will be whether teams extend safety alignment datasets to include the low-resource languages and code-switching patterns STEER exploits, and whether that reduces the reported 93.0% and 96.7% success rates on JailbreakBench and AdvBench.
Author and submission details: the paper, "Safety Targeted Embedding Exploit via Refinement," was submitted to arXiv on July 2, 2026 by Joshua Adrian Cahyono.
| Item | |||
|---|---|---|---|
| JailbreakBench | 93 | Measured across six open-source 8B-parameter models | |
| AdvBench | 96.7 | Measured across six open-source 8B-parameter models | |
| GPT-4o-mini (transfer) | 35.5 | Cross-model transfer success without access to target |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI SafetyAgentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A
An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.
Anthropic's Power Play: Leading AI Now to Make It Safer
Anthropic says building dominant AI models and accumulating influence are necessary to steer the technology away from catastrophic risks.
Human-centric AI and firm idiosyncratic risks, 2015–2023
Human-centric AI strategies are associated with lower firm idiosyncratic risk among Chinese listed firms.
OpenAI joins Appia Foundation to build shared AI standards
OpenAI supports evaluation frameworks, safety practices and global cooperation through the Appia Foundation.