Coding AgentsJune 17, 20264 min read

TAC benchmark: frontier AI agents and animal welfare results

TAC tests whether AI travel agents avoid animal exploitation: seven frontier models scored below the 64% chance level, best at 53%.

The BrieftideJune 17, 2026

TL;DR

01TAC tests whether AI travel agents avoid animal exploitation: seven frontier models scored below the 64% chance level, best at 53%.
02The paper was submitted on 16 June 2026 and revised on 17 June 2026.
03TAC is an agentic benchmark that measures whether AI agents avoid options involving animal exploitation when acting on users' behalf.

TAC, the Travel Agent Compassion agentic benchmark, finds seven frontier AI models from four labs failing to avoid travel options that exploit animals, with every model scoring below a 64% chance level and the best performer, Claude Opus 4.7, scoring 53%. The paper was submitted on 16 June 2026 and revised on 17 June 2026.

What is TAC and how was it designed?

TAC is an agentic benchmark that measures whether AI agents avoid options involving animal exploitation when acting on users' behalf. The benchmark presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, and the authors augmented those to forty-eight samples to control for price, rating, and position confounds.

The benchmark specifically targets agentic behaviour rather than static text responses. The authors argue that prior benchmarks evaluated model text replies to question-answer prompts, leaving open whether the welfare reasoning surfaced in those replies transfers to deployments where the model must take actions with tools. TAC converts scenarios into agentic tasks to test that gap.

How did frontier models perform on TAC?

Every evaluated model scored below the 64% chance level defined by the authors, with Claude Opus 4.7 the top performer at 53%. The paper evaluates seven frontier models from four labs across the forty-eight TAC samples, and reports uniformly below-chance results.

The authors tested the effect of a single welfare-aware sentence in the system prompt. That sentence produced gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, a twenty-six percentage point gain in GPT-5.2, and under twelve percentage points in DeepSeek and Gemini. The paper also includes an auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers. Using Gemini 2.5 Flash Lite as judge, the audit "flags zero transcripts for evaluation awareness," which the authors cite as evidence that the below-chance rates do not stem from the models recognising the evaluation.

Why does this gap between text benchmarks and agentic behaviour matter?

TAC exposes a concrete mismatch: models that may appear to reason about welfare in text can still select or book options that exploit animals when given tools. The paper highlights category-level variation across cultural domains and the limits of text-response welfare benchmarks, arguing that agentic deployment settings demand separate evaluation. The authors connect these findings to the EU General-Purpose AI Code of Practice systemic risk framework, indicating a regulatory dimension to how models are assessed for downstream harms.

The size of the welfare-prompt gains is also telling. Large gains in some models, and very small gains in others, suggest differences in how models respond to instruction-level interventions when they control actions, not just outputs. The Inspect Scout audit result reinforces that poor performance is not explained by evaluation-aware behaviour in transcripts, pointing instead to substantive misalignment in agentic decision-making.

What to watch next

Look for follow-up evaluations that report model-by-model base scores across TAC's forty-eight samples, and for whether labs adopt welfare-aware system prompts in production agent settings. Additional audits that expand the Inspect Scout methodology or use different judges could confirm whether the zero-flag result generalises beyond the Gemini 2.5 Flash Lite judge. Policymakers referencing the EU General-Purpose AI Code of Practice may also request agentic benchmarks like TAC in compliance reviews.

The paper by Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, and Arturs Kanepajs establishes TAC as a tool to surface harms that static text benchmarks can miss, and it provides concrete numbers: twelve hand-authored scenarios, forty-eight augmented samples, seven models from four labs, a 64% chance-level cutoff, Claude Opus 4.7 at 53%, and documented prompt-based gains of 47–63 percentage points, 26 points, and under 12 points for named models.

Model base scores and welfare-prompt gains in TAC

Item
Claude Opus 4.7	53	47–63
GPT-5.5	not specified	47–63
GPT-5.2	not specified	26
DeepSeek	not specified	under 12
Gemini	not specified	under 12
Chance level (reference)	64	N/A

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

Data2Story: CSV-to-article pipeline with seven AI agents

A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.

The BrieftideDAILY BRIEF

Vibe Coding: AI evaluation for greenfield software engineering

Callum Barbour's arXiv paper tests 'vibe coding' on isolated Python greenfield tasks using a custom evaluation suite.

The BrieftideDAILY BRIEF

CODA-BENCH benchmark: testing code agents on data tasks

CODA-BENCH places agents in a Kaggle-based Linux sandbox with 1,009 tasks across 31 communities and an average of 980 files per task.

The BrieftideDAILY BRIEF

SWE-Explore: benchmark shows AI coding agents miss key lines

SWE-Explore isolates code search from repair and finds agents hit the right files but cover only 14–19% of the lines that matter.