CEO-Bench: Agents in a 500-day startup simulation benchmark
arXiv paper CEO-Bench evaluates agents managing pricing, marketing and budgeting across a 500-day simulated startup.
TL;DR
- 01arXiv paper CEO-Bench evaluates agents managing pricing, marketing and budgeting across a 500-day simulated startup.
- 02The arXiv paper (arXiv:2606.18543) was submitted on 16 Jun 2026 by Haozhe Chen, Karthik Narasimhan and Zhuang Liu.
- 03The benchmark gives agents the same interface and challenges a human CEO would face, asking them to translate noisy signals into strategy and to use programming when appropriate.
CEO-Bench is a new evaluation that places language model agents in a 500-day simulated startup, testing whether agents can manage pricing, marketing, budgeting and other interdependent business decisions over long horizons. The arXiv paper (arXiv:2606.18543) was submitted on 16 Jun 2026 by Haozhe Chen, Karthik Narasimhan and Zhuang Liu.
What is CEO-Bench?
CEO-Bench is a programmable, Python-driven environment that simulates "operating a startup for 500 days" and requires agents to analyze noisy, interconnected business databases, acquire information in noisy environments, adapt to a changing world and coordinate many moving parts toward a coherent goal. The benchmark gives agents the same interface and challenges a human CEO would face, asking them to translate noisy signals into strategy and to use programming when appropriate.
The task suite includes managing pricing, marketing, and budgeting, among other operational decisions. The authors note that the strongest agents write sophisticated code inside the environment, for example to simulate customer cohorts to forecast cash and to mine negotiation history for hidden customer preferences.
How do state-of-the-art agents perform in CEO-Bench?
Only two tested models finished above the $1M starting balance, and even they did not consistently make profit. Specifically, Claude Opus 4.8 and GPT-5.5 finished above the $1M starting balance, and the paper states that neither consistently turns a profit. By contrast, the authors report that most state-of-the-art models struggle in this environment.
Those outcomes reflect the benchmark's emphasis on long-horizon planning under uncertainty. Agents that succeeded at specific short-horizon tasks do not automatically sustain adaptive progress over hundreds of simulated days. The paper highlights examples of advanced agent behavior: writing code to simulate customer cohorts for cash forecasting and mining negotiation histories to uncover customer preferences. Even these skills did not guarantee steady profitability across runs.
Why it matters
CEO-Bench combines multiple real-world capabilities that are rarely tested together: navigating long horizons amid uncertainty; acquiring information in noisy environments; adapting to a changing world; and orchestrating many moving parts toward a coherent goal. Measuring these together exposes gaps between short-horizon task performance and sustained, adaptive decision making over time.
That gap matters because real-world roles like product management, operations and executive decision making require sustained coordination across noisy signals and interacting subsystems. The paper shows that current agents can perform impressive, targeted analyses and programmatic strategies, yet fail to reliably drive a fictional company to consistent profit across a 500-day window.
What to watch
Follow whether future agent designs close the gap on sustained profitability and whether benchmark variants add clearer causal signals for long-term learning. The authors position CEO-Bench as a first step toward measuring the intelligence required to drive sustained, adaptive progress over time; subsequent work that expands the set of tested models or the environment dynamics will be the next concrete milestone.
Bibliographic note: the paper is listed as arXiv:2606.18543 and was submitted on 16 Jun 2026 by Haozhe Chen, Karthik Narasimhan and Zhuang Liu. The authors provide code-driven interactions and examples of agent-written analyses inside the environment, emphasizing the role of programmatic decision making in the benchmark.
| Item | ||||
|---|---|---|---|---|
| Claude Opus 4.8 | Finished above $1M | No | One of the models that finished above the $1M starting balance | |
| GPT-5.5 | Finished above $1M | No | One of the models that finished above the $1M starting balance | |
| Most other state-of-the-art models | Did not finish above $1M | No | Authors state that most state-of-the-art models struggle in this environment |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsNVIDIA ENPIRE: AI coding agents teach robots GPU installs
ENPIRE let AI coding agents train robot arms to cut zip ties and insert GPUs.
CODA-BENCH benchmark: testing code agents on data tasks
CODA-BENCH places agents in a Kaggle-based Linux sandbox with 1,009 tasks across 31 communities and an average of 980 files per task.
SWE-Explore: benchmark shows AI coding agents miss key lines
SWE-Explore isolates code search from repair and finds agents hit the right files but cover only 14–19% of the lines that matter.
OpenAI acquires Ona to add persistent agents to Codex
The deal brings Ona's cloud development environments into Codex so agents can continue tasks for hours or days in customers' clouds.