Coding Agents4 min read

SWE-Explore benchmark: AI coding agents find files but miss lines

SWE-Explore shows Claude Code, Codex and peers usually locate the correct source file but fail to return the exact lines needed for fixes.

The Brieftide

TL;DR

  • 01SWE-Explore shows Claude Code, Codex and peers usually locate the correct source file but fail to return the exact lines needed for fixes.
  • 02The benchmark measures both file-level retrieval and line-level accuracy on developer-style queries and reports a consistent gap: high file hit rates paired with low precise-line hit rates.
  • 03SWE-Explore frames evaluation tasks to mimic real developer workflows: locate the file where a bug or behavior lives, then extract the exact lines required to patch, test, or understand the code.

SWE-Explore, a new software-engineering benchmark published this week, finds that AI coding agents such as Claude Code and Codex reliably locate the right source file but often fail to return the exact lines developers need. The benchmark measures both file-level retrieval and line-level accuracy on developer-style queries and reports a consistent gap: high file hit rates paired with low precise-line hit rates.

Benchmark setup and key findings

SWE-Explore frames evaluation tasks to mimic real developer workflows: locate the file where a bug or behavior lives, then extract the exact lines required to patch, test, or understand the code. The benchmark includes queries that require pinpointing a small code fragment inside potentially large files and scores agents on two axes: whether they return the correct file and whether they return the critical lines within that file.

Across tested agents the benchmark shows a clear pattern. File-level retrieval is strong: agents return the correct file in a majority of cases. By contrast, line-level accuracy is substantially lower, with agents often returning the surrounding context, an approximate function, or the wrong slice of code rather than the precise lines needed for a fix or a unit test. SWE-Explore characterizes that gap as the core failure mode for current coding agents.

The benchmark also measures downstream task impact. When line-level hits are required to generate a correct patch or a passing test, overall task success drops sharply. Tasks that can tolerate file-level results combined with developer inspection remain workable. Tasks expecting automated edits or exact test scaffolding perform poorly when line-level accuracy is low.

Model behaviors and error modes

SWE-Explore highlights several recurring behaviors. Agents frequently return a larger block of code around the relevant area instead of the minimal lines, which can be helpful for context but unhelpful when an exact replacement is required. Some agents return syntactically valid but semantically incorrect snippets that look plausible but do not address the specific bug. Off-by-one line offsets and mismatches between the requested function name and the snippet returned also appear repeatedly.

The benchmark tests multiple agent types and prompt styles. Retrieval-first pipelines that explicitly search a codebase and then run a code model over retrieved results tend to get the file right more often, but still struggle to extract the exact lines. End-to-end generative agents sometimes hallucinate line numbers or produce code that does not match any file in the repository. SWE-Explore notes that adding retrieval verification steps and stronger alignment between the retrieval and generation stages improves results but does not close the line-level gap.

SWE-Explore further observes that failures concentrate on tasks requiring small, precise edits: off-by-one fixes, single-line condition changes, and tightly scoped test scaffolding. Tasks that require broader refactors or high-level suggestions are less sensitive to exact-line accuracy.

Why it matters

The gulf between file-level and line-level accuracy limits how much coding agents can safely automate. Navigation and suggestion tools benefit from strong file retrieval, but automated patching, precise test synthesis, and security fixes depend on accurate line selection. Toolmakers, enterprises, and security teams should treat current agents as navigation and drafting aids rather than fully reliable automated patch engines.

Snapshot of SWE-Explore findings
Item
Claude CodeHigh (often ≥70%)Low (≈20–30%)Returns function or context, not exact lines
CodexHigh (often ≥70%)Low (≈15–30%)Context blocks, off-by-one line offsets
Other agents (average)Moderate to highLowHallucinated snippets or approximate slices

Primary source

The Decoder

the-decoder.com
Read the original

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click