OpenAI gpt-oss release: gpt-oss-120b and gpt-oss-20b
OpenAI published open-weight gpt-oss-120b and gpt-oss-20b, the first full open weights since GPT-2, with MXFP4 optimizations for local GPUs.
TL;DR
- 01OpenAI published open-weight gpt-oss-120b and gpt-oss-20b, the first full open weights since GPT-2, with MXFP4 optimizations for local GPUs.
- 02OpenAI released open-weight gpt-oss-120b and gpt-oss-20b this week, the first full open-weight models the company has shared since GPT-2 in 2019.
- 03The gpt-oss models use the familiar decoder-only transformer backbone, but they incorporate a set of changes that have become common in modern large language models.
OpenAI released open-weight gpt-oss-120b and gpt-oss-20b this week, the first full open-weight models the company has shared since GPT-2 in 2019. The release includes architecture tweaks and what OpenAI and reviewers describe as optimizations that make local runs possible, notably an MXFP4 optimization that helps fit the models onto single GPUs.
Model architecture changes
The gpt-oss models use the familiar decoder-only transformer backbone, but they incorporate a set of changes that have become common in modern large language models. The release highlights multiple specific shifts away from older GPT design choices:
Dropout is removed. The author notes that most modern LLMs have dropped dropout, and cites a 2025 small-scale LLM paper (Pythia 1.4B) finding that dropout can worsen downstream performance in single-epoch training regimes.
RoPE replaces absolute positional embeddings. Rotary Position Embedding encodes position by rotating query and key vectors rather than adding learned position vectors, and has been widely adopted since Llama in 2023.
GELU is largely replaced by Swish / SiLU, and the feed-forward module moves to gated variants such as SwiGLU or GEGLU. The article includes a param-count illustration: with an embedding dimension of 1024, a traditional feed-forward pair of fc1 and fc2 totals 8,388,608 parameters, while a GLU variant with three 1024 by 1024 layers totals 3,145,728 parameters, yielding fewer parameters and extra multiplicative interaction.
The single feed-forward module is replaced with a Mixture-of-Experts (MoE) design. In MoE setups the router activates only a subset of experts per token, increasing total model capacity while keeping per-step compute sparse; the piece notes that in most MoE models expert weights account for more than 90% of total parameters.
Grouped Query Attention (GQA) is used in place of traditional Multi-Head Attention as a more compute- and parameter-efficient alternative.
The author emphasizes that none of these choices are unique to gpt-oss. Many contemporary models use the same base transformer architecture with similar tweaks, and the author attributes much of the performance gains across the field to data and algorithmic tweaks rather than radical architecture changes.
Deployment and local use
OpenAI published the gpt-oss model files and the author points users to OpenAI's official model hub pages on Hugging Face: https://huggingface.co/openai/gpt-oss-20b and https://huggingface.co/openai/gpt-oss-120b. According to the release and the writeup, the gpt-oss-20b model can run on a consumer GPU with up to 16 GB of RAM. The gpt-oss-120b model can run on a single H100 with 80 GB of RAM or on newer hardware, with MXFP4 cited as an optimization that helps fit models onto single GPUs.
The author also places gpt-oss in historical context, showing a side-by-side with GPT-2 XL 1.5B to illustrate how the same decoder-only transformer backbone has evolved through positional encodings, activation functions, MoE, and attention variants.
Why it matters
OpenAI publishing full open-weight models for the first time since GPT-2 opens the architecture and implementation details to researchers and practitioners, enabling inspection and local experimentation. The MoE approach lets models grow total parameter capacity while keeping per-token inference efficient, which matters for scaling knowledge capacity without linear inference cost inflation. Finally, the ability to run a 20B model on consumer GPUs, and the MXFP4 path toward single-GPU runs, lowers the barrier for local testing and iteration, even as larger checkpoints still require high-memory server GPUs.
What to watch
Watch independent reproductions of the single-GPU runs and community benchmarks that compare gpt-oss models with OpenAI's contemporaneous announcements such as GPT-5, which the author notes was announced days after the gpt-oss release. Also monitor the Hugging Face model pages for usage notes and community-contributed optimizations.
| Item | ||||
|---|---|---|---|---|
| gpt-oss-20b | 20B | Consumer GPU, up to 16 GB RAM | Mixture-of-Experts, RoPE, SwiGLU, Grouped Query Attention, MXFP4 optimizations | |
| gpt-oss-120b | 120B | Single H100 with 80 GB RAM or newer hardware | Mixture-of-Experts, RoPE, SwiGLU, Grouped Query Attention, MXFP4 optimizations | |
| GPT-2 XL | 1.5B | Not specified in this release | Decoder-only transformer, absolute positional embeddings, GELU, single feed-forward module, dropout (historical) |
Written by The Brieftide · Source: Ahead of AI
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Open Source AIOpenAI backs EU AI content transparency code
OpenAI pledged to support the European Code of Practice on AI content transparency.
PRC-linked AI influence campaigns target US tech policy debates
OpenAI says PRC-linked actors used AI-generated content and coordinated accounts to push narratives about data centers and tariffs.
LSEG adopts OpenAI to scale trusted AI across global teams
London Stock Exchange Group embedded OpenAI models across global teams, accelerating insights and shortening release cycles.
OpenAI people-first AI industrial policy and workforce plan
OpenAI proposes workforce programs, public investment, corporate governance rules and international coordination to expand AI opportunity.