SWE-Explore benchmark: AI coding agents find files but miss lines
SWE-Explore shows Claude Code, Codex and peers usually locate the correct source file but fail to return the exact lines needed for fixes.
TL;DR
- 01SWE-Explore shows Claude Code, Codex and peers usually locate the correct source file but fail to return the exact lines needed for fixes.
- 02The benchmark measures both file-level retrieval and line-level accuracy on developer-style queries and reports a consistent gap: high file hit rates paired with low precise-line hit rates.
- 03SWE-Explore frames evaluation tasks to mimic real developer workflows: locate the file where a bug or behavior lives, then extract the exact lines required to patch, test, or understand the code.
SWE-Explore, a new software-engineering benchmark published this week, finds that AI coding agents such as Claude Code and Codex reliably locate the right source file but often fail to return the exact lines developers need. The benchmark measures both file-level retrieval and line-level accuracy on developer-style queries and reports a consistent gap: high file hit rates paired with low precise-line hit rates.
Benchmark setup and key findings
SWE-Explore frames evaluation tasks to mimic real developer workflows: locate the file where a bug or behavior lives, then extract the exact lines required to patch, test, or understand the code. The benchmark includes queries that require pinpointing a small code fragment inside potentially large files and scores agents on two axes: whether they return the correct file and whether they return the critical lines within that file.
Across tested agents the benchmark shows a clear pattern. File-level retrieval is strong: agents return the correct file in a majority of cases. By contrast, line-level accuracy is substantially lower, with agents often returning the surrounding context, an approximate function, or the wrong slice of code rather than the precise lines needed for a fix or a unit test. SWE-Explore characterizes that gap as the core failure mode for current coding agents.
The benchmark also measures downstream task impact. When line-level hits are required to generate a correct patch or a passing test, overall task success drops sharply. Tasks that can tolerate file-level results combined with developer inspection remain workable. Tasks expecting automated edits or exact test scaffolding perform poorly when line-level accuracy is low.
Model behaviors and error modes
SWE-Explore highlights several recurring behaviors. Agents frequently return a larger block of code around the relevant area instead of the minimal lines, which can be helpful for context but unhelpful when an exact replacement is required. Some agents return syntactically valid but semantically incorrect snippets that look plausible but do not address the specific bug. Off-by-one line offsets and mismatches between the requested function name and the snippet returned also appear repeatedly.
The benchmark tests multiple agent types and prompt styles. Retrieval-first pipelines that explicitly search a codebase and then run a code model over retrieved results tend to get the file right more often, but still struggle to extract the exact lines. End-to-end generative agents sometimes hallucinate line numbers or produce code that does not match any file in the repository. SWE-Explore notes that adding retrieval verification steps and stronger alignment between the retrieval and generation stages improves results but does not close the line-level gap.
SWE-Explore further observes that failures concentrate on tasks requiring small, precise edits: off-by-one fixes, single-line condition changes, and tightly scoped test scaffolding. Tasks that require broader refactors or high-level suggestions are less sensitive to exact-line accuracy.
Why it matters
The gulf between file-level and line-level accuracy limits how much coding agents can safely automate. Navigation and suggestion tools benefit from strong file retrieval, but automated patching, precise test synthesis, and security fixes depend on accurate line selection. Toolmakers, enterprises, and security teams should treat current agents as navigation and drafting aids rather than fully reliable automated patch engines.
| Item | ||||
|---|---|---|---|---|
| Claude Code | High (often ≥70%) | Low (≈20–30%) | Returns function or context, not exact lines | |
| Codex | High (often ≥70%) | Low (≈15–30%) | Context blocks, off-by-one line offsets | |
| Other agents (average) | Moderate to high | Low | Hallucinated snippets or approximate slices |
Primary source
The Decoder
the-decoder.comThe Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsOpenAI acquires Ona to push Codex toward autonomous coding
OpenAI has bought Ona (formerly Gitpod) to fold secure cloud workspaces and long-running agent tech into Codex-driven developer workflows.
OpenAI Academy launches 3 courses to apply AI at work
Three new OpenAI Academy courses teach practical AI skills, building repeatable workflows and using agents for everyday job tasks.
Agentic AI token costs and per-workflow pricing for agents
Autonomous agents' multi-step workflows drive token consumption beyond chat, forcing new token pricing and per-workflow billing decisions.
Perplexity launches Search as Code: models write Python pipelines
Search as Code lets models write and sandbox Python search routines for dynamic, customizable retrieval pipelines.