OpenAI o3 and RL for LLM reasoning: 10× compute, PPO, GRPO
OpenAI's o3 used 10× more training compute than o1; xAI and Anthropic add "thinking" buttons while GPT-4.5 and Llama 4 remain non-reasoning.
TL;DR
- 01OpenAI's o3 used 10× more training compute than o1; xAI and Anthropic add "thinking" buttons while GPT-4.5 and Llama 4 remain non-reasoning.
- 02OpenAI released the o3 reasoning model, and OpenAI staff said during an April 16, 2025 livestream that o3 used 10× more training compute compared to o1.
- 03This arrival comes as GPT-4.5 and Llama 4 were released this month but remained conventional models trained without explicit reinforcement learning for reasoning.
OpenAI released the o3 reasoning model, and OpenAI staff said during an April 16, 2025 livestream that o3 used 10× more training compute compared to o1. This arrival comes as GPT-4.5 and Llama 4 were released this month but remained conventional models trained without explicit reinforcement learning for reasoning.
What vendors have done and how the market is shifting
Reactions to GPT-4.5 and Llama 4 were relatively muted, the author notes, in part because those models stayed "conventional," meaning they were trained without explicit reinforcement learning for reasoning. By contrast, competitors have moved to add reasoning capabilities and control. The article points out that both the xAI Grok and Anthropic Claude interfaces now include a "thinking" (or "extended thinking") button for certain models, explicitly toggling reasoning capabilities.
OpenAI's o3 is presented as an example of investing compute specifically for reasoning. The article highlights the company claim that o3 used 10× more training compute than o1, implying that targeted compute can yield measurable gains in reasoning performance. The piece also suggests that the muted response to some recent releases signals we may be approaching the limits of what scaling model size and data alone can achieve.
How reasoning-focused training differs: RLHF, PPO and reward models
The reinforcement learning training methods used to develop reasoning models are closely related to reinforcement learning with human feedback, RLHF, which is part of the standard recipe following the InstructGPT paper. Conventional LLM development commonly follows three stages: pre-training, supervised fine-tuning, and alignment, typically via RLHF.
The RLHF pipeline is described in three practical steps: first, supervised fine-tuning (SFT) of the pre-trained model creates a base model for further alignment. Second, a reward model is created by sampling prompts, generating multiple responses, and having human annotators rank those responses. The article notes that for each prompt the SFT model can generate four responses which humans rank; the reward model automates that ranking process. To turn the SFT model into a reward model the output layer is substituted with a regression layer featuring a single output node.
Third, the SFT model is fine-tuned using a reinforcement learning algorithm such as proximal policy optimization, PPO. PPO is described as limiting how much the policy is allowed to change during each update via a clipped loss function, and it also includes a KL divergence penalty comparing the current policy to the original SFT model. PPO adds an entropy bonus to encourage exploration. The article also references other RL variants in the landscape, for example GRPO, as part of the progression from PPO to newer algorithms.
Why it matters
Reasoning-focused post-training appears to yield consistent accuracy and problem-solving improvements, the author observes. That pattern makes reasoning not merely a feature but a likely future standard in LLM pipelines, especially if targeted compute investments like OpenAI's 10× claim for o3 continue to produce gains. The contrast between conventional releases and vendors exposing explicit reasoning toggles suggests the next wave of differentiation will be about training methods and runtime control, not just scale.
What to watch
Track whether more vendors adopt reasoning-specific post-training steps or expose reasoning toggles in interfaces beyond the currently noted xAI and Anthropic implementations. Also watch follow-up disclosures about o3 and comparable models that confirm whether the 10× training compute investment translates into repeatable, measurable reasoning gains across benchmarks.
Written by The Brieftide · Source: Ahead of AI
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Open Source AIStrands Agents open-source SDK lets you run any model anywhere
Strands Agents offers an open-source SDK that powers Amazon's AI agents, with multi-cloud model support, guardrails.
OpenAI backs EU AI content transparency code
OpenAI pledged to support the European Code of Practice on AI content transparency.
PRC-linked AI influence campaigns target US tech policy debates
OpenAI says PRC-linked actors used AI-generated content and coordinated accounts to push narratives about data centers and tariffs.
LSEG adopts OpenAI to scale trusted AI across global teams
London Stock Exchange Group embedded OpenAI models across global teams, accelerating insights and shortening release cycles.