Enterprise AI Adoption5 min read

CEO-Bench benchmark: 3 AI models finished above capital

Princeton's CEO-Bench ran 500-day simulations of a fictional startup; only Claude Fable 5.

The Brieftide

TL;DR

  • 01Princeton's CEO-Bench ran 500-day simulations of a fictional startup; only Claude Fable 5.
  • 02Princeton researchers built CEO-Bench, a 500-day simulation that has AI agents run a fictional subscription software company called NovaMind, starting with $1,000,000 in cash.
  • 03Of fourteen models tested, only three finished their best run above the starting capital: Claude Fable 5 ($47.15 million), Claude Opus 4.8 ($27.8 million) and GPT-5.5 ($21.3 million).

Princeton researchers built CEO-Bench, a 500-day simulation that has AI agents run a fictional subscription software company called NovaMind, starting with $1,000,000 in cash. Of fourteen models tested, only three finished their best run above the starting capital: Claude Fable 5 ($47.15 million), Claude Opus 4.8 ($27.8 million) and GPT-5.5 ($21.3 million).

What is CEO-Bench and how does it work?

CEO-Bench simulates running a startup for 500 simulated days where performance is measured by remaining cash at the end, and the run ends immediately if the balance drops below zero even once. The agent controls NovaMind through a Python API with 34 tools and a database of 19 tables, and it can write code, run SQL queries and compose workflows rather than issuing single-shot commands.

The benchmark forces decisions that mirror real business trade-offs: pricing and tiers, advertising across channels, product quality and R&D, infrastructure capacity and customer support, plus multi-round enterprise negotiations. Feedback is delayed and noisy: revenue hits at billing dates, R&D takes days to weeks to pay off, and many state variables (customer satisfaction, willingness to pay) remain hidden and must be inferred from signals like cancellations, support tickets and social media posts.

How did models perform in the 500-day test?

Most tested agents go bankrupt before the simulation ends; only three models' best runs finish above the $1,000,000 start. Claude Fable 5 reached $47.15 million, Claude Opus 4.8 reached $27.8 million and GPT-5.5 reached $21.3 million. A simple rule-based heuristic that never calls a language model reached $15.76 million, beating every model except the three listed above.

The study tested 14 models in total and also estimated an approximate upper bound for achievable final cash at about $2.2 billion, a level far above the best agents. The authors flag caveats: one Fable 5 run aborted because the model refused to continue, and in the other two Fable 5 runs some requests fell back to Opus 4.8. GPT-5.5 itself went bankrupt in two of its three runs.

The researchers measured four capabilities that correlate with success: uncovering hidden information, predicting the future (error in four-week cash forecasts), adapting quickly to change (how fast a model notices a competitor move), and planning ahead (how often if-then scenarios appear in notes). Claude Opus 4.8 and GPT-5.5 score above the average of the other models on all four measures. Behaviorally, Opus 4.8 and GPT-5.5 explore new strategies as conditions change, while Opus 4.7 tends to cut costs and preserve cash, surviving but failing to profit.

The software environment also matters. When the team paired Opus 4.7 with Claude Code and GPT-5.5 with Codex, both agents acted far less often and performed worse, a result the researchers attribute to system prompts tuned for software development. Compressing the simulation to 50 days did not solve the core problem: only GPT-5.5 managed to finish the shorter run with a profit.

Why it matters

CEO-Bench exposes a gap between models' local tool competence and the ability to connect actions into long-horizon strategy. Simple heuristics outperform most language models, and even the best agents fall far short of the study's rough upper bound. That gap matters for any real-world task that requires sequential prioritization, resource allocation under uncertainty and reading noisy, delayed signals over months.

What to watch

Watch for models that consistently finish above the $1,000,000 start across multiple runs without fallbacks or aborted sessions, and for results that narrow the gap toward the researchers' ~ $2.2 billion upper bound. Also track whether tool-specific system prompts are adjusted so coding assistants no longer reduce an agent's frequency of action and overall performance.

Final cash after 500-day CEO-Bench runs
Item
Starting capitalStarting capital1000000Simulation start
Claude Fable 5 (best run)Claude Fable 547150000Best run; one run aborted and some requests fell back to Opus 4.8
Claude Opus 4.8 (best run)Claude Opus 4.827800000Reached $27.8M in best run
GPT-5.5 (best run)GPT-5.521300000Went bankrupt in two of three runs
Rule-based heuristicRule-based heuristic15760000No language model; reached $15.76M
Estimated upper boundEstimated upper bound2200000000Authors' rough estimate of achievable final cash
Advertisement

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement