Open Source AI5 min read

OpenAI o3 and RL for LLM reasoning: 10× compute, PPO, GRPO

OpenAI's o3 used 10× more training compute than o1; xAI and Anthropic add "thinking" buttons while GPT-4.5 and Llama 4 remain non-reasoning.

The Brieftide

TL;DR

  • 01OpenAI's o3 used 10× more training compute than o1; xAI and Anthropic add "thinking" buttons while GPT-4.5 and Llama 4 remain non-reasoning.
  • 02OpenAI released the o3 reasoning model, and OpenAI staff said during an April 16, 2025 livestream that o3 used 10× more training compute compared to o1.
  • 03This arrival comes as GPT-4.5 and Llama 4 were released this month but remained conventional models trained without explicit reinforcement learning for reasoning.

OpenAI released the o3 reasoning model, and OpenAI staff said during an April 16, 2025 livestream that o3 used 10× more training compute compared to o1. This arrival comes as GPT-4.5 and Llama 4 were released this month but remained conventional models trained without explicit reinforcement learning for reasoning.

What vendors have done and how the market is shifting

Reactions to GPT-4.5 and Llama 4 were relatively muted, the author notes, in part because those models stayed "conventional," meaning they were trained without explicit reinforcement learning for reasoning. By contrast, competitors have moved to add reasoning capabilities and control. The article points out that both the xAI Grok and Anthropic Claude interfaces now include a "thinking" (or "extended thinking") button for certain models, explicitly toggling reasoning capabilities.

OpenAI's o3 is presented as an example of investing compute specifically for reasoning. The article highlights the company claim that o3 used 10× more training compute than o1, implying that targeted compute can yield measurable gains in reasoning performance. The piece also suggests that the muted response to some recent releases signals we may be approaching the limits of what scaling model size and data alone can achieve.

How reasoning-focused training differs: RLHF, PPO and reward models

The reinforcement learning training methods used to develop reasoning models are closely related to reinforcement learning with human feedback, RLHF, which is part of the standard recipe following the InstructGPT paper. Conventional LLM development commonly follows three stages: pre-training, supervised fine-tuning, and alignment, typically via RLHF.

The RLHF pipeline is described in three practical steps: first, supervised fine-tuning (SFT) of the pre-trained model creates a base model for further alignment. Second, a reward model is created by sampling prompts, generating multiple responses, and having human annotators rank those responses. The article notes that for each prompt the SFT model can generate four responses which humans rank; the reward model automates that ranking process. To turn the SFT model into a reward model the output layer is substituted with a regression layer featuring a single output node.

Third, the SFT model is fine-tuned using a reinforcement learning algorithm such as proximal policy optimization, PPO. PPO is described as limiting how much the policy is allowed to change during each update via a clipped loss function, and it also includes a KL divergence penalty comparing the current policy to the original SFT model. PPO adds an entropy bonus to encourage exploration. The article also references other RL variants in the landscape, for example GRPO, as part of the progression from PPO to newer algorithms.

Why it matters

Reasoning-focused post-training appears to yield consistent accuracy and problem-solving improvements, the author observes. That pattern makes reasoning not merely a feature but a likely future standard in LLM pipelines, especially if targeted compute investments like OpenAI's 10× claim for o3 continue to produce gains. The contrast between conventional releases and vendors exposing explicit reasoning toggles suggests the next wave of differentiation will be about training methods and runtime control, not just scale.

What to watch

Track whether more vendors adopt reasoning-specific post-training steps or expose reasoning toggles in interfaces beyond the currently noted xAI and Anthropic implementations. Also watch follow-up disclosures about o3 and comparable models that confirm whether the 10× training compute investment translates into repeatable, measurable reasoning gains across benchmarks.

Advertisement

Written by The Brieftide · Source: Ahead of AI

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement