Enterprise AI AdoptionJune 28, 20265 min read

CEO-Bench benchmark: 3 AI models finished above capital

Princeton's CEO-Bench ran 500-day simulations of a fictional startup; only Claude Fable 5.

The BrieftideJune 28, 2026

TL;DR

01Princeton's CEO-Bench ran 500-day simulations of a fictional startup; only Claude Fable 5.
02Princeton researchers built CEO-Bench, a 500-day simulation that has AI agents run a fictional subscription software company called NovaMind, starting with $1,000,000 in cash.
03Of fourteen models tested, only three finished their best run above the starting capital: Claude Fable 5 ($47.15 million), Claude Opus 4.8 ($27.8 million) and GPT-5.5 ($21.3 million).

Princeton researchers built CEO-Bench, a 500-day simulation that has AI agents run a fictional subscription software company called NovaMind, starting with $1,000,000 in cash. Of fourteen models tested, only three finished their best run above the starting capital: Claude Fable 5 ($47.15 million), Claude Opus 4.8 ($27.8 million) and GPT-5.5 ($21.3 million).

What is CEO-Bench and how does it work?

CEO-Bench simulates running a startup for 500 simulated days where performance is measured by remaining cash at the end, and the run ends immediately if the balance drops below zero even once. The agent controls NovaMind through a Python API with 34 tools and a database of 19 tables, and it can write code, run SQL queries and compose workflows rather than issuing single-shot commands.

The benchmark forces decisions that mirror real business trade-offs: pricing and tiers, advertising across channels, product quality and R&D, infrastructure capacity and customer support, plus multi-round enterprise negotiations. Feedback is delayed and noisy: revenue hits at billing dates, R&D takes days to weeks to pay off, and many state variables (customer satisfaction, willingness to pay) remain hidden and must be inferred from signals like cancellations, support tickets and social media posts.

How did models perform in the 500-day test?

Most tested agents go bankrupt before the simulation ends; only three models' best runs finish above the $1,000,000 start. Claude Fable 5 reached $47.15 million, Claude Opus 4.8 reached $27.8 million and GPT-5.5 reached $21.3 million. A simple rule-based heuristic that never calls a language model reached $15.76 million, beating every model except the three listed above.

The study tested 14 models in total and also estimated an approximate upper bound for achievable final cash at about $2.2 billion, a level far above the best agents. The authors flag caveats: one Fable 5 run aborted because the model refused to continue, and in the other two Fable 5 runs some requests fell back to Opus 4.8. GPT-5.5 itself went bankrupt in two of its three runs.

The researchers measured four capabilities that correlate with success: uncovering hidden information, predicting the future (error in four-week cash forecasts), adapting quickly to change (how fast a model notices a competitor move), and planning ahead (how often if-then scenarios appear in notes). Claude Opus 4.8 and GPT-5.5 score above the average of the other models on all four measures. Behaviorally, Opus 4.8 and GPT-5.5 explore new strategies as conditions change, while Opus 4.7 tends to cut costs and preserve cash, surviving but failing to profit.

The software environment also matters. When the team paired Opus 4.7 with Claude Code and GPT-5.5 with Codex, both agents acted far less often and performed worse, a result the researchers attribute to system prompts tuned for software development. Compressing the simulation to 50 days did not solve the core problem: only GPT-5.5 managed to finish the shorter run with a profit.

Why it matters

CEO-Bench exposes a gap between models' local tool competence and the ability to connect actions into long-horizon strategy. Simple heuristics outperform most language models, and even the best agents fall far short of the study's rough upper bound. That gap matters for any real-world task that requires sequential prioritization, resource allocation under uncertainty and reading noisy, delayed signals over months.

What to watch

Watch for models that consistently finish above the $1,000,000 start across multiple runs without fallbacks or aborted sessions, and for results that narrow the gap toward the researchers' ~ $2.2 billion upper bound. Also track whether tool-specific system prompts are adjusted so coding assistants no longer reduce an agent's frequency of action and overall performance.

Final cash after 500-day CEO-Bench runs

Item
Starting capital	Starting capital	1000000	Simulation start
Claude Fable 5 (best run)	Claude Fable 5	47150000	Best run; one run aborted and some requests fell back to Opus 4.8
Claude Opus 4.8 (best run)	Claude Opus 4.8	27800000	Reached $27.8M in best run
GPT-5.5 (best run)	GPT-5.5	21300000	Went bankrupt in two of three runs
Rule-based heuristic	Rule-based heuristic	15760000	No language model; reached $15.76M
Estimated upper bound	Estimated upper bound	2200000000	Authors' rough estimate of achievable final cash

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

NVIDIA Confidential Computing: 98% performance, Blackwell GPUs

NVIDIA’s Confidential Computing secures models and data on Blackwell (HGX B300) while adding typically under 8% throughput or per‑token.

The BrieftideDAILY BRIEF

Teleperformance AI: Achieving Operational Excellence Now

Teleperformance says firms with Lean Six Sigma or BPM discipline can better translate AI investments; a sponsored report cites $113B market.

The BrieftideDAILY BRIEF

Microsoft Frontier Company launches with $2.5B investment

The unit will deploy 6,000 industry and engineering experts to deliver enterprise AI projects using Microsoft’s existing tools.

The BrieftideDAILY BRIEF

Multi-Agent Orchestration for Enterprise AI: arXiv Paper

An arXiv paper (18 Jun 2026) evaluates DAG Plan and Execute versus ReAct across 208 enterprise scenarios and adds a Task Manager that cuts.