Epoch AI MirrorCode benchmark: Claude Opus 4.7 tops at 56%
Epoch AI and METR's MirrorCode requires full-program reimplementation; Claude Opus 4.7 solved 56 percent while the largest tasks defeated.
TL;DR
- 01Epoch AI and METR's MirrorCode requires full-program reimplementation; Claude Opus 4.7 solved 56 percent while the largest tasks defeated.
- 02Epoch AI and METR released MirrorCode, a new benchmark that asks AI models to reimplement complete programs from scratch across 25 targets and multiple domains.
- 03Claude Opus 4.7 leads the ranking with a 56 percent solve rate, while GPT-5.5 scores 44 percent and Gemini 3.1 Pro Preview scores 32 percent.
Epoch AI and METR released MirrorCode, a new benchmark that asks AI models to reimplement complete programs from scratch across 25 targets and multiple domains. Claude Opus 4.7 leads the ranking with a 56 percent solve rate, while GPT-5.5 scores 44 percent and Gemini 3.1 Pro Preview scores 32 percent.
What is MirrorCode and how does it work?
MirrorCode requires models to recreate entire programs so their outputs exactly match the original, including hidden end-to-end tests the model never sees. The benchmark covers 25 target programs across Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography and compression, and verifies correctness by reproducing the original program outputs on a set of tests.
MirrorCode also departs from many existing software engineering benchmarks by removing tight per-task inference caps. Whereas other suites often cap costs at $1 to $10 per task, Epoch AI ran at least one large task that cost $2,600 and took a single AI 19 days of continuous, unattended work for a single run.
How did current models perform on MirrorCode?
Claude Opus 4.7 leads with a 56 percent overall solve rate, GPT-5.5 follows at 44 percent, and Gemini 3.1 Pro Preview comes in at 32 percent; even failing submissions typically pass 90 percent or more of tests. Small programs such as uuid or parseqsv are reliably reimplemented by all tested models, while the largest tasks stump every model so far.
A standout example: Claude Opus 4.7 reimplemented gotree, a bioinformatics toolkit of roughly 16,000 lines of Go and over 40 commands, in 14 hours for $251, a task Epoch AI says would take a human 2 to 17 weeks without AI help. The researchers note that leading models from about a year ago would have scored roughly 30 percent and been limited to simpler programs.
Cost behavior varied by model: Epoch AI reports GPT-5.5 runs three times as expensive as GPT-5 for the same tasks, while Claude Opus 4.7 runs three times cheaper than Claude Opus 4.1 on the same workload.
Epoch AI has open-sourced the benchmark scaffold and 22 of the 25 target programs, covering 132 task instances across six programming languages; three programs remain private for testing.
Why does this matter?
MirrorCode moves evaluation from unit-level or snippet synthesis to full-program correctness, exposing work that requires sustained planning, multi-file structure, and long-run verification. The benchmark shows models can already tackle demanding engineering work — Claude Opus 4.7 rebuilt a 16,000-line toolkit in hours — but it also demonstrates a clear ceiling: no model can yet crack the largest tasks.
That combination matters for teams considering AI for real development: models can accelerate sizable pieces of work, but they currently fail the most complex, long-horizon engineering projects. The open-sourced scaffold and targets create a repeatable testbed for researchers and practitioners to measure progress and cost trade-offs.
What are the limits and caveats?
Epoch AI warns that MirrorCode targets are drawn from open-source projects, so memorization could contribute to results; "the results were not dominated by memorization, but we cannot rule out the possibility that memorization contributes to AI performance," the researchers write. That uncertainty means top-line solve rates may partly reflect training data overlap rather than pure generalization.
The benchmark also exposes unpredictable cost profiles: one large task cost $2,600 for a single unattended run and required 19 days of continuous inference, while other tasks finished in hours and at far lower cost, as with the gotree example that completed in 14 hours for $251.
What to watch next
Check whether future MirrorCode runs close the gap on the large tasks and whether the three private target programs change rankings. Also watch for independent analyses that measure training-data overlap against the 22 open-sourced targets, and for comparisons that report end-to-end costs across models on the same set of large programs.
| Item | |||
|---|---|---|---|
| Solve rate (%) | 56 | 44 | 32 |
| Typical pass on partial failures | 90%+ | 90%+ | 90%+ |
| Gotree reimplementation (time, cost) | 14 hours, $251 | N/A | N/A |
| Cracks largest tasks? | No | No | No |
Written by The Brieftide · Source: The Decoder
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAutoformalization: Agent Instructions to Policy-as-Code
A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.
Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A
An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.
Data2Story: CSV-to-article pipeline with seven AI agents
A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.
Vibe Coding: AI evaluation for greenfield software engineering
Callum Barbour's arXiv paper tests 'vibe coding' on isolated Python greenfield tasks using a custom evaluation suite.