Multimodal AI5 min read

Linear-attention revival: Qwen3-Next, MiniMax-M1 and Kimi Linear

A string of 2025 releases and reversals has reignited linear-attention hybrids.

The Brieftide

TL;DR

  • 01A string of 2025 releases and reversals has reignited linear-attention hybrids.
  • 02MiniMax-M1, Qwen3-Next, DeepSeek V3.2 and Kimi Linear have driven a visible revival of linear attention hybrids across large models in 2025.
  • 03MiniMax-M1 is a 456B parameter mixture-of-experts model with 46B active parameters that came out in June.

MiniMax-M1, Qwen3-Next, DeepSeek V3.2 and Kimi Linear have driven a visible revival of linear attention hybrids across large models in 2025. MiniMax-M1 is a 456B parameter mixture-of-experts model with 46B active parameters that came out in June. Qwen3-Next followed in August, DeepSeek V3.2 was announced in September, and Kimi Linear arrived in October. In a recent reversal the MiniMax team then released a new 230B parameter M2 model without linear attention, saying linear attention proved tricky in production for reasoning and multi-turn tasks.

What changed this year

The classic scaled-dot-product attention from Attention Is All You Need remains the dominant mechanism, and its cost grows quadratically with sequence length. Early linear-attention work replaced the softmax QK^T term with kernel feature approximations such as ϕ(x)=elu(x)+1 to reduce complexity from O(n^2) to O(n). Those earlier variants largely failed to gain traction because they degraded accuracy.

This year saw renewed interest in efficient attention as several teams integrated linear or subquadratic attention into large models. MiniMax-M1, Qwen3-Next and DeepSeek V3.2 all replace traditional quadratic attention variants in most or all layers with efficient linear variants. DeepSeek V3.2 uses a sparse attention mechanism that is not strictly linear but is subquadratic in computational cost, so it is grouped with the other efficient-attention designs.

Qwen3-Next implements a hybrid that mixes Gated DeltaNet blocks with Gated Attention blocks in a 3:1 ratio across 48 transformer blocks, for example: Layer 1: Linear attention → MoE; Layer 2: Linear attention → MoE; Layer 3: Linear attention → MoE; Layer 4: Full attention → MoE, and so on. Qwen3-Next’s hybrid enabled a native 262k token context length in terms of memory usage; by contrast a previous 235B-A22B model supported 32k natively and 131k with YaRN scaling.

Gated attention is essentially regular full attention with an additional sigmoid gate that multiplies the attention output. The Qwen3-Next developers say, "[...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model." Gated DeltaNet is the linear-attention layer adopted from the Gated Delta Networks work; it combines the gated decay mechanism of Mamba2 with a delta rule that updates a hidden memory state using prediction errors.

How teams reacted

The MiniMax team’s move back to a 230B M2 model without linear attention was explicit: the team stated that linear attention "seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks." That reversal shows the tension in production settings between the efficiency gains of linear attention and the model behaviors required for reasoning and persistent multi-turn state.

At the same time, Qwen3-Next and Kimi Linear continued to ship linear-attention hybrids, adopting elements like Gated DeltaNet and gated attention to push context length and memory efficiency. DeepSeek’s V3.2 is included in the same category because its sparse attention is at least subquadratic.

Why it matters

Linear-attention hybrids directly target the fundamental O(n^2) cost of standard attention, promising much lower memory and compute for long contexts and enabling native multi-hundred-thousand-token windows such as Qwen3-Next’s 262k. The trade-off is clear in practice: earlier linear approximations hurt accuracy, and MiniMax’s rollback underscores real-world risks for reasoning and multi-turn agentic applications. The field is converging on hybrid designs that try to regain accuracy through gating, delta-rule memory updates and selective full-attention layers, so the next generation of models will test whether those fixes are sufficient.

What to watch

Look for benchmark results on reasoning and multi-turn tasks for Qwen3-Next, Kimi Linear and DeepSeek V3.2, and for any follow-up evaluations from the MiniMax team on why M2 dropped linear attention. Also watch for the PyTorch Conference 2025 talk mentioned by the author to be uploaded to the official PyTorch YouTube channel, which may add implementation detail and empirical insights.

2025 chronology of notable linear-attention model releases
  1. June 2025
    MiniMax-M1

    456B parameter MoE, 46B active parameters; replaced many layers with linear attention

  2. August 2025
    Qwen3-Next

    Hybrid Gated DeltaNet + Gated Attention, native 262k token context length; 48 transformer blocks alternating 3:1

  3. September 2025
    DeepSeek V3.2

    Sparse attention mechanism, subquadratic computational costs

  4. October 2025
    Kimi Linear

    Released with linear attention

  5. After June 2025
    MiniMax-M2

    New 230B parameter model released without linear attention due to poor accuracy in reasoning and multi-turn tasks

Advertisement

Written by The Brieftide · Source: Ahead of AI

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement