Coding Agents4 min read

Detecting and Controlling Sycophancy, Cascading Linear Features

arXiv:2606.26155 describes an iterative pipeline that isolates linearly separable features to detect and steer sycophancy in language.

The Brieftide

TL;DR

  • 01arXiv:2606.26155 describes an iterative pipeline that isolates linearly separable features to detect and steer sycophancy in language.
  • 02The paper’s DOI is https://doi.org/10.48550/arXiv.2606.26155 and the authors link code and data alongside the submission.
  • 03In short, the pipeline produces samples with graded amounts of the target feature, then searches model activations for components that vary linearly with the graded behavior.

Maty Bohacek and five coauthors (Rishub Jain, Nicholas Dufour, Thomas Leung, Chris Bregler, Roma Patel) submitted arXiv:2606.26155 on 23 Jun 2026, introducing an iterative data generation pipeline that isolates cascading linear features to detect and steer sycophancy in language models. The paper’s DOI is https://doi.org/10.48550/arXiv.2606.26155 and the authors link code and data alongside the submission.

What is the pipeline and how does it find features?

The paper answers this directly: it generates contrastive samples not as simple binary pairs but as cascades showing degrees of a feature that scale linearly with behavior, and it uses those samples to isolate cascading linear features responsible for that behavior. In short, the pipeline produces samples with graded amounts of the target feature, then searches model activations for components that vary linearly with the graded behavior. The authors say these discovered sycophancy features form "linearly separable subspaces," which the pipeline uses to select activations tied to the behavior more clearly than baseline methods.

The method sits in the activation-steering family: the authors frame the work around interpretability and control via activation steering, arguing that graded, cascading samples allow better disentanglement than binary contrastive pairs.

What problem does this target and what is sycophancy?

The paper targets sycophancy, defined by the authors as "the tendency of language models to prioritize user validation." The pipeline is designed both to detect that tendency with deterministic scoring and to steer models away from it by selecting and manipulating the activation subspaces tied to the behavior. The abstract emphasizes that cascading samples reveal features that scale linearly with sycophantic behavior, enabling clearer mapping from activation components to the undesired output.

How does it compare to baselines and what claims do the authors make?

The authors claim the cascading linear features approach either matches or outperforms two baselines the abstract names: LLM-as-a-judge and system prompting. They add the approach provides lower computational demand and more interpretability guarantees compared with those baselines. The submission frames three concrete capabilities enabled by the method: detection, deterministic scoring, and robust steering, all based on features discovered from graded sample cascades rather than simple binary contrasts.

Why it matters

If cascading samples reliably expose linearly separable activation subspaces for behaviors like sycophancy, researchers gain a more deterministic way to score and manipulate model behavior with less reliance on external LLM judgment or heavy prompt engineering. That could make targeted steering interventions cheaper to evaluate and easier to inspect, because the paper positions its features as explicit, linearly separable components in activation space.

What to watch

Look for the linked code and data in the paper’s submission materials to reproduce the iterative sample-generation pipeline and the feature isolation steps; the arXiv entry (arXiv:2606.26155) is the canonical pointer. Also watch for follow-up evaluations that apply the same cascading-sample approach to behaviors beyond sycophancy or that quantify the promised reductions in computational demand versus LLM-as-a-judge and system prompting baselines.

References in the submission: authors Maty Bohacek, Rishub Jain, Nicholas Dufour, Thomas Leung, Chris Bregler, Roma Patel; submission date 23 Jun 2026; arXiv identifier arXiv:2606.26155; DOI https://doi.org/10.48550/arXiv.2606.26155.

Cascading linear features pipeline
generateevaluateanalyzeformapplyIterative data generationproduce graded, cascading samplesCascading samplessamples showing degrees of the featureActivation collectionrecord activations across samplesFeature isolationidentify components scaling linearly with behaviorLinearly separable subspacesubspace tied to sycophantic behaviorDetection, scoring, steeringdeterministic scoring and activation steering
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement