Multimodal AI5 min read

DeepSeek V3: 671B model, MLA and MoE architectural choices, 2025

DeepSeek V3 pairs Multi-Head Latent Attention with a 256-expert MoE to activate just 37 billion parameters per token from a 671 billion.

The Brieftide

TL;DR

  • 01DeepSeek V3 pairs Multi-Head Latent Attention with a 256-expert MoE to activate just 37 billion parameters per token from a 671 billion.
  • 02DeepSeek V3 pairs those choices with KV caching and a shared-expert MoE so that only 37 billion parameters are active per inference step.
  • 03That compression adds an extra matrix multiplication but reduces KV cache memory use, and queries are compressed only during training, not at inference.

DeepSeek V3, introduced in December 2024 and surfaced widely with the DeepSeek R1 release in January 2025, is a 671-billion-parameter model that foregrounds two architectural ideas: Multi-Head Latent Attention and a large Mixture-of-Experts design. DeepSeek V3 pairs those choices with KV caching and a shared-expert MoE so that only 37 billion parameters are active per inference step.

Multi-Head Latent Attention versus grouped-query attention

DeepSeek V3 uses Multi-Head Latent Attention, an approach that compresses the key and value tensors into a lower-dimensional space before storing them in the KV cache and projects them back to full size at inference time. That compression adds an extra matrix multiplication but reduces KV cache memory use, and queries are compressed only during training, not at inference.

Grouped-Query Attention, by contrast, reduces key and value computation by having groups of heads share key-value projections. The article illustrates GQA with a simple example: if there are two key-value groups and four attention heads, heads 1 and 2 share one set of keys and values while heads 3 and 4 share another. GQA lowers parameter count and KV cache bandwidth, and prior ablation studies have shown it performs comparably to standard Multi-Head Attention in many settings.

DeepSeek-V2 ablations cited in the source indicate MLA offered better modeling performance than GQA, which helps explain why the DeepSeek team chose MLA over GQA for V3. The DeepSeek-V2 paper also supplies comparative tables that show GQA performing worse than MHA while MLA improves over MHA.

Mixture-of-Experts at scale: 256 experts and a shared expert

DeepSeek V3 replaces standard FeedForward blocks with Mixture-of-Experts modules. Each MoE module in V3 contains 256 experts, and the model totals 671 billion parameters. The key efficiency trick is sparsity: a router activates only a small subset of experts per token. In DeepSeek V3, nine experts are active at a time, consisting of one shared expert plus eight experts selected by the router. That yields roughly 37 billion parameters used per inference step rather than the full 671 billion.

The design also intentionally includes a shared expert that is always active for every token. That shared-expert concept appears earlier in the DeepSeek 2024 MoE and the DeepSpeedMoE literature, where experiments showed it can boost overall modeling performance because repeated patterns need not be learned redundantly by multiple experts.

OLMo 2 and normalization-focused efficiency

OLMo 2, released in January (before Llama 4, Gemma 3, and Qwen 3), emphasizes transparency: the Allen Institute published training data, code, and detailed technical reports. Architecturally, OLMo 2 largely follows the original GPT layout but swaps LayerNorm for RMSNorm and adds a QK-norm. Unlike DeepSeek V3, OLMo 2 retains traditional Multi-Head Attention.

The OLMo 2 paper positions its models near the Pareto frontier of modeling benchmark performance versus pre-training cost measured in FLOPs, indicating that its normalization choices contributed to an efficient compute-to-performance trade-off.

Why it matters

Together these designs show two concurrent trends: one path pushes raw parameter count higher while using sparsity and KV-cache compression to keep inference compute and memory practical, the other pursues tighter compute efficiency through normalization and transparent reporting. For model builders, the DeepSeek V3 choices highlight how MoE plus a shared expert and MLA can scale model capacity without forcing full-parameter inference. For infrastructure teams, MLA and sparse experts change the metric mix that matters: total parameters no longer map directly to inference cost.

What to watch

Look for comparative ablations that directly measure "KV Cache per Token" savings between MLA and GQA, a comparison the DeepSeek-V2 notes would be valuable. Also watch whether other open models adopt a shared-expert MoE pattern or retain dense FeedForward blocks while tuning normalization (RMSNorm, QK-norm) as OLMo 2 does.

Key components in DeepSeek V3 and OLMo 2 architectures
DeepSeek V3Multi-Head Latent Attention (compresses KV for cache)Mixture-of-Experts (256 experts)Shared expert (always active)KV cache (compressed storage)OLMo 2RMSNormQK-normMulti-Head Attention (traditional)
Advertisement

Written by The Brieftide · Source: Ahead of AI

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement