Dissecting model behavior: SSA paper, 138k trajectories
Gaurav Gupta et al. formalize an "intent-execution" gap and use the Simple Strands Agent to analyze 138k trajectories across agentic.
TL;DR
- 01Gaurav Gupta et al. formalize an "intent-execution" gap and use the Simple Strands Agent to analyze 138k trajectories across agentic.
- 02SSA was used to reproduce or improve on the pass@1 performance reported by diverse model-provider families, and the study analyzed 138k agent trajectories produced by that harness.
- 03SSA is presented as intentionally simple and customizable, designed to reveal harness-model alignment issues rather than to be a maximal-performance orchestrator.
Dissecting model behavior through agent trajectories, by Gaurav Gupta, Vatshank Chaturvedi, Jun Huan and Anoop Deoras, was submitted to arXiv on 16 June 2026 and argues that agent harness design creates an "intent-execution" gap between what a model intends and what the harness executes. The paper introduces a simple, customizable harness called Simple Strands Agent (SSA), and analyzes 138k trajectories generated by SSA while reproducing or improving pass@1 results on SWE-Pro, SWE-Verified and Terminal-Bench-2.
What did the authors build and test?
The authors built Simple Strands Agent (SSA), a lightweight agent harness intended to capture common patterns across model families and expose model-specific preferences, then tested it on three agentic benchmarks: SWE-Pro, SWE-Verified and Terminal-Bench-2. SSA was used to reproduce or improve on the pass@1 performance reported by diverse model-provider families, and the study analyzed 138k agent trajectories produced by that harness.
SSA is presented as intentionally simple and customizable, designed to reveal harness-model alignment issues rather than to be a maximal-performance orchestrator. The paper spans 106 pages, includes 50 figures and 16 tables, and frames its contribution as both benchmarking (pass@1 reproduction) and behavioral analysis via trajectory data.
How do agent trajectories reveal model-level differences?
Representing agent trajectories in code state-spaces lets the authors measure fine-grained behaviors such as edit frequency, testing activity and phase-transitions, and thereby surface differences between model families. By mapping actions and code states across a run, SSA-produced trajectories make it possible to quantify how models allocate effort across stages of autonomous problem solving.
The paper applies this representation to trajectories drawn from models across multiple provider families named in the study: Claude, Gemini, GPT, Grok and Qwen. Rather than relying only on pass@1 scores, the authors extract metrics like how often models edit code, how intensively they run tests, and where runs shift from exploration to refinement. Those metrics produce behavioral fingerprints that diverge even when pass@1 numbers are relatively similar across frontier models.
Why it matters
The study reframes agent performance as a systems problem: strong model capabilities do not automatically translate into strong agent behavior if the harness misinterprets model intent. That argument makes harness design an explicit point of failure and an intervention opportunity. If harnesses amplify or suppress specific model behaviors, then the community cannot treat pass@1 alone as a full measure of agent quality; behavioral metrics derived from trajectories expose where two models with similar pass@1 differ in reliability, efficiency and approach.
What to watch
Whether other researchers reproduce SSA's pass@1 outcomes on SWE-Pro, SWE-Verified and Terminal-Bench-2 will be a key signal; the paper claims reproduction or improvement on those benchmarks. Also watch for follow-up work that publishes SSA code or applies the paper's trajectory representation to additional model families or tasks, which would test the generality of the authors' behavioral metrics.
References and key facts drawn from the submission: the paper title "Dissecting model behavior through agent trajectories," authors Gaurav Gupta, Vatshank Chaturvedi, Jun Huan and Anoop Deoras, submission date 16 June 2026, an analysis of 138k trajectories, and the claim that SSA reproduces or improves pass@1 results on SWE-Pro, SWE-Verified and Terminal-Bench-2. The manuscript is 106 pages long and includes 50 figures and 16 tables.
| Item | |||
|---|---|---|---|
| SWE-Pro | reproduce or improve on pass@1 | tested across diverse model-provider families (Claude, Gemini, GPT, Grok, Qwen) | |
| SWE-Verified | reproduce or improve on pass@1 | behavioral metrics derived from trajectories supplement pass@1 | |
| Terminal-Bench-2 | reproduce or improve on pass@1 | trajectory analysis of 138k runs used to compare models |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsNVIDIA ENPIRE: AI coding agents teach robots GPU installs
ENPIRE let AI coding agents train robot arms to cut zip ties and insert GPUs.
CODA-BENCH benchmark: testing code agents on data tasks
CODA-BENCH places agents in a Kaggle-based Linux sandbox with 1,009 tasks across 31 communities and an average of 980 files per task.
SWE-Explore: benchmark shows AI coding agents miss key lines
SWE-Explore isolates code search from repair and finds agents hit the right files but cover only 14–19% of the lines that matter.
OpenAI acquires Ona to add persistent agents to Codex
The deal brings Ona's cloud development environments into Codex so agents can continue tasks for hours or days in customers' clouds.