Open Source AI6 min read

OpenAI gpt-oss release: gpt-oss-120b and gpt-oss-20b

OpenAI published open-weight gpt-oss-120b and gpt-oss-20b, the first full open weights since GPT-2, with MXFP4 optimizations for local GPUs.

The Brieftide

TL;DR

  • 01OpenAI published open-weight gpt-oss-120b and gpt-oss-20b, the first full open weights since GPT-2, with MXFP4 optimizations for local GPUs.
  • 02OpenAI released open-weight gpt-oss-120b and gpt-oss-20b this week, the first full open-weight models the company has shared since GPT-2 in 2019.
  • 03The gpt-oss models use the familiar decoder-only transformer backbone, but they incorporate a set of changes that have become common in modern large language models.

OpenAI released open-weight gpt-oss-120b and gpt-oss-20b this week, the first full open-weight models the company has shared since GPT-2 in 2019. The release includes architecture tweaks and what OpenAI and reviewers describe as optimizations that make local runs possible, notably an MXFP4 optimization that helps fit the models onto single GPUs.

Model architecture changes

The gpt-oss models use the familiar decoder-only transformer backbone, but they incorporate a set of changes that have become common in modern large language models. The release highlights multiple specific shifts away from older GPT design choices:

  • Dropout is removed. The author notes that most modern LLMs have dropped dropout, and cites a 2025 small-scale LLM paper (Pythia 1.4B) finding that dropout can worsen downstream performance in single-epoch training regimes.

  • RoPE replaces absolute positional embeddings. Rotary Position Embedding encodes position by rotating query and key vectors rather than adding learned position vectors, and has been widely adopted since Llama in 2023.

  • GELU is largely replaced by Swish / SiLU, and the feed-forward module moves to gated variants such as SwiGLU or GEGLU. The article includes a param-count illustration: with an embedding dimension of 1024, a traditional feed-forward pair of fc1 and fc2 totals 8,388,608 parameters, while a GLU variant with three 1024 by 1024 layers totals 3,145,728 parameters, yielding fewer parameters and extra multiplicative interaction.

  • The single feed-forward module is replaced with a Mixture-of-Experts (MoE) design. In MoE setups the router activates only a subset of experts per token, increasing total model capacity while keeping per-step compute sparse; the piece notes that in most MoE models expert weights account for more than 90% of total parameters.

  • Grouped Query Attention (GQA) is used in place of traditional Multi-Head Attention as a more compute- and parameter-efficient alternative.

The author emphasizes that none of these choices are unique to gpt-oss. Many contemporary models use the same base transformer architecture with similar tweaks, and the author attributes much of the performance gains across the field to data and algorithmic tweaks rather than radical architecture changes.

Deployment and local use

OpenAI published the gpt-oss model files and the author points users to OpenAI's official model hub pages on Hugging Face: https://huggingface.co/openai/gpt-oss-20b and https://huggingface.co/openai/gpt-oss-120b. According to the release and the writeup, the gpt-oss-20b model can run on a consumer GPU with up to 16 GB of RAM. The gpt-oss-120b model can run on a single H100 with 80 GB of RAM or on newer hardware, with MXFP4 cited as an optimization that helps fit models onto single GPUs.

The author also places gpt-oss in historical context, showing a side-by-side with GPT-2 XL 1.5B to illustrate how the same decoder-only transformer backbone has evolved through positional encodings, activation functions, MoE, and attention variants.

Why it matters

OpenAI publishing full open-weight models for the first time since GPT-2 opens the architecture and implementation details to researchers and practitioners, enabling inspection and local experimentation. The MoE approach lets models grow total parameter capacity while keeping per-token inference efficient, which matters for scaling knowledge capacity without linear inference cost inflation. Finally, the ability to run a 20B model on consumer GPUs, and the MXFP4 path toward single-GPU runs, lowers the barrier for local testing and iteration, even as larger checkpoints still require high-memory server GPUs.

What to watch

Watch independent reproductions of the single-GPU runs and community benchmarks that compare gpt-oss models with OpenAI's contemporaneous announcements such as GPT-5, which the author notes was announced days after the gpt-oss release. Also monitor the Hugging Face model pages for usage notes and community-contributed optimizations.

gpt-oss models and GPT-2 comparison
Item
gpt-oss-20b20BConsumer GPU, up to 16 GB RAMMixture-of-Experts, RoPE, SwiGLU, Grouped Query Attention, MXFP4 optimizations
gpt-oss-120b120BSingle H100 with 80 GB RAM or newer hardwareMixture-of-Experts, RoPE, SwiGLU, Grouped Query Attention, MXFP4 optimizations
GPT-2 XL1.5BNot specified in this releaseDecoder-only transformer, absolute positional embeddings, GELU, single feed-forward module, dropout (historical)
Advertisement

Written by The Brieftide · Source: Ahead of AI

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement