CEO-Bench benchmark: 3 AI models finished above capital
Princeton's CEO-Bench ran 500-day simulations of a fictional startup; only Claude Fable 5.
TL;DR
- 01Princeton's CEO-Bench ran 500-day simulations of a fictional startup; only Claude Fable 5.
- 02Princeton researchers built CEO-Bench, a 500-day simulation that has AI agents run a fictional subscription software company called NovaMind, starting with $1,000,000 in cash.
- 03Of fourteen models tested, only three finished their best run above the starting capital: Claude Fable 5 ($47.15 million), Claude Opus 4.8 ($27.8 million) and GPT-5.5 ($21.3 million).
Princeton researchers built CEO-Bench, a 500-day simulation that has AI agents run a fictional subscription software company called NovaMind, starting with $1,000,000 in cash. Of fourteen models tested, only three finished their best run above the starting capital: Claude Fable 5 ($47.15 million), Claude Opus 4.8 ($27.8 million) and GPT-5.5 ($21.3 million).
What is CEO-Bench and how does it work?
CEO-Bench simulates running a startup for 500 simulated days where performance is measured by remaining cash at the end, and the run ends immediately if the balance drops below zero even once. The agent controls NovaMind through a Python API with 34 tools and a database of 19 tables, and it can write code, run SQL queries and compose workflows rather than issuing single-shot commands.
The benchmark forces decisions that mirror real business trade-offs: pricing and tiers, advertising across channels, product quality and R&D, infrastructure capacity and customer support, plus multi-round enterprise negotiations. Feedback is delayed and noisy: revenue hits at billing dates, R&D takes days to weeks to pay off, and many state variables (customer satisfaction, willingness to pay) remain hidden and must be inferred from signals like cancellations, support tickets and social media posts.
How did models perform in the 500-day test?
Most tested agents go bankrupt before the simulation ends; only three models' best runs finish above the $1,000,000 start. Claude Fable 5 reached $47.15 million, Claude Opus 4.8 reached $27.8 million and GPT-5.5 reached $21.3 million. A simple rule-based heuristic that never calls a language model reached $15.76 million, beating every model except the three listed above.
The study tested 14 models in total and also estimated an approximate upper bound for achievable final cash at about $2.2 billion, a level far above the best agents. The authors flag caveats: one Fable 5 run aborted because the model refused to continue, and in the other two Fable 5 runs some requests fell back to Opus 4.8. GPT-5.5 itself went bankrupt in two of its three runs.
The researchers measured four capabilities that correlate with success: uncovering hidden information, predicting the future (error in four-week cash forecasts), adapting quickly to change (how fast a model notices a competitor move), and planning ahead (how often if-then scenarios appear in notes). Claude Opus 4.8 and GPT-5.5 score above the average of the other models on all four measures. Behaviorally, Opus 4.8 and GPT-5.5 explore new strategies as conditions change, while Opus 4.7 tends to cut costs and preserve cash, surviving but failing to profit.
The software environment also matters. When the team paired Opus 4.7 with Claude Code and GPT-5.5 with Codex, both agents acted far less often and performed worse, a result the researchers attribute to system prompts tuned for software development. Compressing the simulation to 50 days did not solve the core problem: only GPT-5.5 managed to finish the shorter run with a profit.
Why it matters
CEO-Bench exposes a gap between models' local tool competence and the ability to connect actions into long-horizon strategy. Simple heuristics outperform most language models, and even the best agents fall far short of the study's rough upper bound. That gap matters for any real-world task that requires sequential prioritization, resource allocation under uncertainty and reading noisy, delayed signals over months.
What to watch
Watch for models that consistently finish above the $1,000,000 start across multiple runs without fallbacks or aborted sessions, and for results that narrow the gap toward the researchers' ~ $2.2 billion upper bound. Also track whether tool-specific system prompts are adjusted so coding assistants no longer reduce an agent's frequency of action and overall performance.
| Item | |||
|---|---|---|---|
| Starting capital | Starting capital | 1000000 | Simulation start |
| Claude Fable 5 (best run) | Claude Fable 5 | 47150000 | Best run; one run aborted and some requests fell back to Opus 4.8 |
| Claude Opus 4.8 (best run) | Claude Opus 4.8 | 27800000 | Reached $27.8M in best run |
| GPT-5.5 (best run) | GPT-5.5 | 21300000 | Went bankrupt in two of three runs |
| Rule-based heuristic | Rule-based heuristic | 15760000 | No language model; reached $15.76M |
| Estimated upper bound | Estimated upper bound | 2200000000 | Authors' rough estimate of achievable final cash |
Written by The Brieftide · Source: The Decoder
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Enterprise AI AdoptionNVIDIA Confidential Computing: 98% performance, Blackwell GPUs
NVIDIA’s Confidential Computing secures models and data on Blackwell (HGX B300) while adding typically under 8% throughput or per‑token.
Teleperformance AI: Achieving Operational Excellence Now
Teleperformance says firms with Lean Six Sigma or BPM discipline can better translate AI investments; a sponsored report cites $113B market.
Microsoft Frontier Company launches with $2.5B investment
The unit will deploy 6,000 industry and engineering experts to deliver enterprise AI projects using Microsoft’s existing tools.
Multi-Agent Orchestration for Enterprise AI: arXiv Paper
An arXiv paper (18 Jun 2026) evaluates DAG Plan and Execute versus ReAct across 208 enterprise scenarios and adds a Task Manager that cuts.