Linear-attention revival: Qwen3-Next, MiniMax-M1 and Kimi Linear
A string of 2025 releases and reversals has reignited linear-attention hybrids.
TL;DR
- 01A string of 2025 releases and reversals has reignited linear-attention hybrids.
- 02MiniMax-M1, Qwen3-Next, DeepSeek V3.2 and Kimi Linear have driven a visible revival of linear attention hybrids across large models in 2025.
- 03MiniMax-M1 is a 456B parameter mixture-of-experts model with 46B active parameters that came out in June.
MiniMax-M1, Qwen3-Next, DeepSeek V3.2 and Kimi Linear have driven a visible revival of linear attention hybrids across large models in 2025. MiniMax-M1 is a 456B parameter mixture-of-experts model with 46B active parameters that came out in June. Qwen3-Next followed in August, DeepSeek V3.2 was announced in September, and Kimi Linear arrived in October. In a recent reversal the MiniMax team then released a new 230B parameter M2 model without linear attention, saying linear attention proved tricky in production for reasoning and multi-turn tasks.
What changed this year
The classic scaled-dot-product attention from Attention Is All You Need remains the dominant mechanism, and its cost grows quadratically with sequence length. Early linear-attention work replaced the softmax QK^T term with kernel feature approximations such as ϕ(x)=elu(x)+1 to reduce complexity from O(n^2) to O(n). Those earlier variants largely failed to gain traction because they degraded accuracy.
This year saw renewed interest in efficient attention as several teams integrated linear or subquadratic attention into large models. MiniMax-M1, Qwen3-Next and DeepSeek V3.2 all replace traditional quadratic attention variants in most or all layers with efficient linear variants. DeepSeek V3.2 uses a sparse attention mechanism that is not strictly linear but is subquadratic in computational cost, so it is grouped with the other efficient-attention designs.
Qwen3-Next implements a hybrid that mixes Gated DeltaNet blocks with Gated Attention blocks in a 3:1 ratio across 48 transformer blocks, for example: Layer 1: Linear attention → MoE; Layer 2: Linear attention → MoE; Layer 3: Linear attention → MoE; Layer 4: Full attention → MoE, and so on. Qwen3-Next’s hybrid enabled a native 262k token context length in terms of memory usage; by contrast a previous 235B-A22B model supported 32k natively and 131k with YaRN scaling.
Gated attention is essentially regular full attention with an additional sigmoid gate that multiplies the attention output. The Qwen3-Next developers say, "[...] the attention output gating mechanism helps eliminate issues like Attention Sink and Massive Activation, ensuring numerical stability across the model." Gated DeltaNet is the linear-attention layer adopted from the Gated Delta Networks work; it combines the gated decay mechanism of Mamba2 with a delta rule that updates a hidden memory state using prediction errors.
How teams reacted
The MiniMax team’s move back to a 230B M2 model without linear attention was explicit: the team stated that linear attention "seemed to work fine with regular prompts, but it had poor accuracy in reasoning and multi-turn tasks." That reversal shows the tension in production settings between the efficiency gains of linear attention and the model behaviors required for reasoning and persistent multi-turn state.
At the same time, Qwen3-Next and Kimi Linear continued to ship linear-attention hybrids, adopting elements like Gated DeltaNet and gated attention to push context length and memory efficiency. DeepSeek’s V3.2 is included in the same category because its sparse attention is at least subquadratic.
Why it matters
Linear-attention hybrids directly target the fundamental O(n^2) cost of standard attention, promising much lower memory and compute for long contexts and enabling native multi-hundred-thousand-token windows such as Qwen3-Next’s 262k. The trade-off is clear in practice: earlier linear approximations hurt accuracy, and MiniMax’s rollback underscores real-world risks for reasoning and multi-turn agentic applications. The field is converging on hybrid designs that try to regain accuracy through gating, delta-rule memory updates and selective full-attention layers, so the next generation of models will test whether those fixes are sufficient.
What to watch
Look for benchmark results on reasoning and multi-turn tasks for Qwen3-Next, Kimi Linear and DeepSeek V3.2, and for any follow-up evaluations from the MiniMax team on why M2 dropped linear attention. Also watch for the PyTorch Conference 2025 talk mentioned by the author to be uploaded to the official PyTorch YouTube channel, which may add implementation detail and empirical insights.
- June 2025MiniMax-M1
456B parameter MoE, 46B active parameters; replaced many layers with linear attention
- August 2025Qwen3-Next
Hybrid Gated DeltaNet + Gated Attention, native 262k token context length; 48 transformer blocks alternating 3:1
- September 2025DeepSeek V3.2
Sparse attention mechanism, subquadratic computational costs
- October 2025Kimi Linear
Released with linear attention
- After June 2025MiniMax-M2
New 230B parameter model released without linear attention due to poor accuracy in reasoning and multi-turn tasks
Written by The Brieftide · Source: Ahead of AI
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIDeepMind Gemma 4 12B release - encoder-free decoder-only LLM
A 12B-parameter Gemma 4 variant removes the separate visual encoder, processing text and images with a single decoder-only model.
Hugging Face Spaces: Multimedia Building Blocks demo
Hugging Face Spaces project assembles modular components to prototype multimodal agents handling text, images, audio and video.
2026 LLM Research Roundup Jan-May: Alignment, RAG, Multimodal
Curated highlights from Jan–May 2026 covering alignment, retrieval-augmented models, multimodal advances, evaluation, and efficiency.
Qwen3.7-Plus by Alibaba: multimodal autonomous agent
Combines visual perception, GUI control and code generation in one multimodal agent loop for extended task automation.