Coding AgentsJuly 2, 20265 min read

Remote Labor Index: Fable 5 hits 16.1% pro freelance automation

Fable 5 now completes 16.1 percent of tested paid freelance projects at client-acceptable quality.

The BrieftideJuly 2, 2026

TL;DR

01Fable 5 now completes 16.1 percent of tested paid freelance projects at client-acceptable quality.
02Fable 5 now completes 16.1 percent of paid freelance projects at a quality level a client would accept, the Remote Labor Index found, up from a 2.5 percent frontier when the benchmark first launched.
03The Remote Labor Index measures the share of real, commercially valuable freelance projects where an AI agent's output is rated at least as good as a human's.

Fable 5 now completes 16.1 percent of paid freelance projects at a quality level a client would accept, the Remote Labor Index found, up from a 2.5 percent frontier when the benchmark first launched.

What did the Remote Labor Index measure and find?

The Remote Labor Index measures the share of real, commercially valuable freelance projects where an AI agent's output is rated at least as good as a human's. The benchmark covers 240 projects worth a combined $144,000, sourced from 358 verified freelancers, and human evaluators at the Center for AI Safety score each deliverable against a gold standard created by a paid professional.

Key numbers from the latest run: Fable 5 topped the leaderboard at 16.1 percent automation, Opus 4.8 scored 8.3 percent, GPT-5.5 scored 6.3 percent, Opus 4.6 previously led at 4.17 percent, and Gemini 3 Pro landed near the bottom at 1.25 percent. The authors note the frontier has more than quadrupled in under eight months. Fable 5 could only be evaluated on 218 of 240 projects before U.S. government restrictions limited access; even if it failed every missing project its rate would still be 14.6 percent.

How were the models tested and scored?

The benchmark runs agents inside a virtual Linux machine loaded with over 30 professional apps, including Blender, GIMP, and Audacity, letting the models operate graphical programs and developer tools used day to day. Tests gave each project up to 24 hours of compute time, and the setup used a critic loop where a second agent reviews and forces revisions from the first.

The RLI spans disciplines such as 3D and CAD, architecture, graphic design, video and animation, audio, data analysis, and web apps. Human evaluators open files in professional software, operate the tools, and judge work like a paying client would. The authors tested whether AI judges could replace those humans and found they could not: AI evaluators rated the new models far too generously. For GPT-5.5 the AI evaluator's score was almost three times too high, and for Opus 4.8 about two and a half times too high. The automated judge did reproduce the ranking order but not reliable absolute scores.

The study also surfaced qualitative failure modes. On a ring design task Fable 5 produced better results than earlier systems but still fell short on close inspection. On an architecture project GPT-5.5 created an appealing render while its underlying 3D model remained flawed, an error that requires opening the model geometry in professional software to catch.

Why does this shift matter?

A jump from a 2.5 percent frontier to 16.1 percent in under eight months signals rapid progress in AI agents' practical utility for real paid work. The result narrows the gap between prototype capabilities and commercially acceptable deliverables in disciplines where freelancers currently compete. At the same time the study shows human expertise in operating tools and judging output remains essential, because current AI systems both produce subtle defects and overrate each other when left to self-assess.

What to watch

Watch whether the remaining 22 Fable 5 projects are evaluated once access restrictions ease, and whether that changes the 16.1 percent figure. Also track whether AI judges are improved to reliably inspect native project files and whether newer releases reverse odd ordering on the Scale Labs leaderboard, where a newer system like Gemini 3 Pro currently scores just 1.25 percent.

Remote Labor Index: automation rates by model

Item
Fable 5	16.1%	Evaluated on 218/240 projects; worst-case 14.6% if remaining fails
Opus 4.8	8.3%
GPT-5.5	6.3%	AI evaluator scored it almost three times too high
Opus 4.6	4.17%	Prior leaderboard leader
Gemini 3 Pro	1.25%	Newer release placed near the bottom
Initial best at launch	2.5%	Benchmark's starting frontier

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Agent4cs: Multi-agent code summarization, up to 38% gains

Agent4cs uses three cooperating agents to summarize large hierarchical codebases.

The BrieftideDAILY BRIEF

llm-coding-agent 0.1a0: GPT-5.5 coding agent and tools

Simon Willison published llm-coding-agent 0.1a0 on 2nd July 2026, a PyPI slop-alpha that exposes file.

The BrieftideDAILY BRIEF

Autoformalization: Agent Instructions to Policy-as-Code

A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.

The BrieftideDAILY BRIEF

Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A

An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.