Foundation ModelsJune 22, 20265 min read

Sakana AI Fugu launch: matches Anthropic Fable and Mythos

Fugu coordinates a swappable pool of LLMs behind one API; Fugu Ultra posts parity with Anthropic's Fable 5 and Mythos Preview in Sakana's.

The BrieftideJune 22, 2026

TL;DR

01Fugu coordinates a swappable pool of LLMs behind one API; Fugu Ultra posts parity with Anthropic's Fable 5 and Mythos Preview in Sakana's.
02Sakana AI is launching Fugu, a multi-LLM orchestrator that coordinates multiple language models from a swappable agent pool while presenting a single OpenAI-compatible API.
03Sakana published benchmark results on June 22, 2026 showing its Fugu Ultra variant performs on par with Anthropic's Fable 5 and Mythos Preview across coding, reasoning, science, and agent tests.

Sakana AI is launching Fugu, a multi-LLM orchestrator that coordinates multiple language models from a swappable agent pool while presenting a single OpenAI-compatible API. Sakana published benchmark results on June 22, 2026 showing its Fugu Ultra variant performs on par with Anthropic's Fable 5 and Mythos Preview across coding, reasoning, science, and agent tests.

What is Fugu and how does it work?

Fugu is a language model trained to call other models from a swappable pool and either handle requests itself or assemble a team of specialist agents, with selection, delegation, checks, and synthesis running internally. Users interact through a single API; the pool can include copies of Fugu and can be configured to exclude agents for privacy or compliance reasons.

Sakana positions Fugu as a single logical model that orchestrates calls to many models. The company built earlier orchestrator work into this approach: its ALE-Agent placed 21st out of 1,000 human experts in a coding competition, and two Sakana papers, Trinity and Conductor, were presented at ICLR 2026.

How did Fugu perform in benchmarks and real tests?

Sakana's published table shows Fugu Ultra matching or beating several top models on a range of benchmarks, with specific scores such as SWE Bench Pro: Fugu 59.0, Fugu Ultra 73.7, Opus 4.8 69.2, Gemini 3.1 Pro 54.2, GPT 5.5 58.6. TerminalBench 2.1 lists Fugu 80.2 and Fugu Ultra 82.1 versus Opus 74.6, Gemini 70.3, and GPT 78.2.

Other sample results from Sakana's table: LiveCodeBench 92.9 (Fugu) and 93.2 (Fugu Ultra); LiveCodeBench Pro 87.8 and 90.8; GPQA-D 95.5 for both Fugu and Fugu Ultra; MRCRv2 86.6 (Fugu) and 93.6 (Fugu Ultra). The company notes neither Anthropic model is part of Fugu's agent pool because they are not publicly available, and says the baseline comparison numbers come from the model providers themselves.

About 500 beta users tested the system in real-world settings. Sakana says Fugu proved strongest on long, multi-step workflows like automated data research, security analysis, and code reviews. One software developer told Sakana, "Where other tools flag about three issues, Fugu surfaced more than twenty."

Both Fugu and Fugu Ultra are live now through a single API on Sakana's product page and console, with subscription plans for daily use and usage-based billing for heavier workloads.

Why it matters

Fugu's orchestration design directly addresses single-provider dependence. Sakana cites recent export controls on Anthropic's Fable and Mythos as an example and writes, "For an organization or a nation, relying on a single company’s APIs for critical infrastructure, finance, or governance is a material vulnerability. This risk is no longer a hypothetical possibility, but a reality." A swappable pool lets an operator route around a provider that goes dark, which changes how organizations can think about operational resilience.

Performance-wise, the results suggest orchestration can lift effective capability above what any single pool member delivers. The trade-offs are explicit in Sakana's announcement: orchestration's real-world value depends on which models are available in the pool, and Sakana does not disclose how orchestration affects token usage or costs.

What to watch

Watch which providers Sakana plugs into public pools and whether Sakana adds Anthropic models if access changes; Sakana itself says including those models would likely push scores higher. Also watch adoption signals from enterprise users handling long, multi-step workflows and any disclosures about token consumption or cost impact from multi-LLM calls.

Sakana AI background: the startup was founded by former Google AI researchers Llion Jones and David Ha; Jones co-authored the 2017 "Attention Is All You Need" paper. The company frames Fugu as an ecosystem approach to AI rather than a single-model product.

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

LLM scaling: Sam Altman says researchers underestimated it

At Stanford on Jun 21, 2026, Sam Altman argued scaling LLMs has yielded new knowledge and blamed a generation of researchers for.

The BrieftideDAILY BRIEF

BIM-Edit: Benchmarking LLMs for IFC-based BIM Editing

BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.

The BrieftideDAILY BRIEF

QMFOL benchmark: QMFOLBench with 2880 logic instances

QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.

The BrieftideDAILY BRIEF

DeFAb: Defeasible Abduction Benchmark, 372,648+ instances

DeFAb converts four decades of publicly funded knowledge bases into 372.