AI Safety5 min read

STEER attack: multilingual jailbreaks hit 93% on open 8B models

STEER converts refusal-trigger words into low-resource languages to bypass safety.

The Brieftide

TL;DR

  • 01STEER converts refusal-trigger words into low-resource languages to bypass safety.
  • 02STEER, a technique introduced in a July 2, 2026 arXiv paper by Joshua Adrian Cahyono, is a gradient-guided attack that manipulates multilingual phrasing to bypass LLM safety refusals.
  • 03The paper shows STEER converts words that drive refusal into low-resource languages iteratively, suppressing refusal while keeping the harmful intent of the prompt.

STEER, a technique introduced in a July 2, 2026 arXiv paper by Joshua Adrian Cahyono, is a gradient-guided attack that manipulates multilingual phrasing to bypass LLM safety refusals. The paper shows STEER converts words that drive refusal into low-resource languages iteratively, suppressing refusal while keeping the harmful intent of the prompt.

What is STEER and how does it work?

STEER stands for Safety Targeted Embedding Exploit via Refinement, and it is a gradient-guided method that identifies words contributing most strongly to a model's refusal behavior and translates them into low-resource languages to reduce refusal. The technique scores refusal-contributing words using gradients, then iteratively replaces them with translations to preserve the prompt's intent while lowering the model's refusal signal.

The paper frames the technique as a probe of an "epistemic gap" created when safety training is concentrated in English. By moving parts of a prompt out of the English training distribution, STEER suppresses the model's refusal mechanism without needing direct access to the target model's safety policy text.

How effective is STEER across models and benchmarks?

STEER achieves high attack success rates on multiple benchmarks: up to 93.0% on JailbreakBench and 96.7% on AdvBench across six open-source 8B-parameter models. The paper also reports that STEER outperforms both random code-switching and Greedy Coordinate Gradient (GCG) in these tests.

The author tested STEER on six open-source models at the 8B parameter scale and measured success on two adversarial benchmarks named in the paper: JailbreakBench and AdvBench. Beyond the open-source models, STEER-crafted prompts transferred to GPT-4o-mini, producing a 35.5% attack success rate without requiring access to that target model, which the paper uses to argue the weakness is not architecture-specific.

Why does this expose a multilingual safety problem?

The paper argues that safety alignment dominated by English leaves uncertain how refusal mechanisms generalize to low-resource languages and mixed-language code-switching, and that gap can be exploited. Because STEER reduces refusal by moving refusal-triggering tokens into languages underrepresented in safety training, the technique demonstrates that English-focused alignment cannot be assumed to cover multilingual or code-switched inputs.

Cahyono frames the issue as an epistemic gap: models appear confidently to generate harmful outputs when faced with inputs falling outside the distribution of their safety training. The observed transfer to GPT-4o-mini, at 35.5% success without model access, supports the claim that the vulnerability spans architectures rather than being limited to one implementation.

What should defenders do differently?

The paper recommends two remedies: broaden coverage during alignment to include low-resource languages and code-switched inputs, and deploy mechanisms that explicitly detect and abstain on out-of-distribution inputs. Those steps aim to close the "epistemic gap" by preventing safety classifiers from failing silently when prompts stray from English-language distributions.

What to watch

Watch for follow-up work that measures STEER on models at other parameter scales and for evaluations that test explicit out-of-distribution detectors on code-switched inputs. A concrete signal will be whether teams extend safety alignment datasets to include the low-resource languages and code-switching patterns STEER exploits, and whether that reduces the reported 93.0% and 96.7% success rates on JailbreakBench and AdvBench.

Author and submission details: the paper, "Safety Targeted Embedding Exploit via Refinement," was submitted to arXiv on July 2, 2026 by Joshua Adrian Cahyono.

STEER attack success rates across targets
Item
JailbreakBench93Measured across six open-source 8B-parameter models
AdvBench96.7Measured across six open-source 8B-parameter models
GPT-4o-mini (transfer)35.5Cross-model transfer success without access to target
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

More in AI Safety
Advertisement