Benchmarks & EvalsJune 25, 20264 min read

LemonHarness: runtime framework reaches 86.52% on Terminal-Bench

LemonHarness constrains workspace state, adds reusable rule knowledge and time-aware execution; with GPT-5.5 it hit 86.52% on.

The BrieftideJune 25, 2026

TL;DR

01LemonHarness constrains workspace state, adds reusable rule knowledge and time-aware execution; with GPT-5.5 it hit 86.52% on.
02The paper also describes a reusable rule knowledge base that encodes recurring execution rules and acceptance criteria as runtime knowledge the agent can call.
03These pieces together aim to prevent scattered, weakly constrained changes that accumulate during long multi-step tasks and make modified files or artifacts hard to track.

LemonHarness, presented in an arXiv technical report submitted on 23 Jun 2026 (arXiv:2606.24311), is an integrated execution framework for long-horizon language-model agents that constrains state-changing operations inside an explicit workspace boundary. The paper, titled "LemonHarness Technical Report" and authored by Kailong Ren and 20 other authors, evaluates the system on Terminal-Bench 2.0 and reports accuracy gains when paired with modern backbones.

What is LemonHarness and how does it work?

LemonHarness is a runtime that enforces a single, explicit execution boundary so that file writes, dependency installations and temporary artifacts stay contained within a defined workspace; model calls, tool execution and rule knowledge operate inside that boundary. The system routes state-changing operations through structured tool interfaces, records execution feedback as observations for later decisions and exposes elapsed and remaining budget to the model so it can rebalance exploration, implementation and validation under time pressure.

The paper also describes a reusable rule knowledge base that encodes recurring execution rules and acceptance criteria as runtime knowledge the agent can call. These pieces together aim to prevent scattered, weakly constrained changes that accumulate during long multi-step tasks and make modified files or artifacts hard to track.

How did LemonHarness perform on Terminal-Bench 2.0?

On Terminal-Bench 2.0, the authors report two concrete results: LemonHarness_GPT-5.3-CodeX reached 84.49% accuracy over 445 trials, and the same framework paired with a stronger GPT-5.5 backbone raised the average accuracy to 86.52% across five jobs. Those figures anchor the paper's claim that a unified runtime boundary, callable rule knowledge and time-aware execution improve stability for long-horizon agent runs.

The experiment details supplied in the abstract show both a high-trial evaluation (445 trials with GPT-5.3-CodeX) and a smaller-scale evaluation (five jobs with GPT-5.5), highlighting gains when the framework uses a stronger backbone.

Why does the workspace boundary and time-aware execution matter?

Confining state changes to a controlled workspace reduces the risk that agents leave state scattered across file-system paths, which the report cites as a practical cause of instability in long tasks. Exposing elapsed and remaining time lets the model trade off exploration, implementation and verification as the budget tightens, which addresses failures caused by long waits or excessive verification leading to timeouts. Turning recurring rules into a callable knowledge base makes acceptance criteria explicit at runtime, reducing ad-hoc decision drift across iterations.

Those mechanisms tackle operational failure modes rather than raw model capability. The paper positions the contribution as stabilising multi-round agent execution by changing the runtime assumptions agents operate under: from loosely observed tool outputs and log fragments to a single controlled execution context with recorded observations and executable rules.

What to watch next

Follow-up signals to validate LemonHarness's promise include wider-scale evaluations across more jobs and backbones beyond the two reported setups, and publication of the implementation details or code referenced in the report. The abstract gives two concrete accuracy points to compare against: 84.49% over 445 trials and 86.52% across five jobs; similar public benchmarks with identical workloads would confirm how the framework and backbone choices interact.

The paper is archived as arXiv:2606.24311 and lists Kailong Ren and 20 other authors, so further versions or an expanded manuscript may add experimental detail and reproducibility artifacts.

Terminal-Bench 2.0 results reported in the LemonHarness paper

Item
LemonHarness_GPT-5.3-CodeX	84.49%	445 trials
LemonHarness with GPT-5.5	86.52%	across five jobs

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

T2D-Bench: Benchmarking LLMs for Type 2 Diabetes Evidence

A multi-layer clinical-lifestyle knowledge graph flags unsupported LLM diabetes recommendations and corrects them across 100 vignettes.

The BrieftideDAILY BRIEF

RIFT-Bench: Dynamic Red-teaming for Agentic AI Systems

A graph-driven methodology with automated Discovery and Scanning phases.

The BrieftideDAILY BRIEF

BIM-Edit: Benchmarking LLMs for IFC-based BIM Editing

BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.

The BrieftideDAILY BRIEF

ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates

ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.