Benchmarks & Evals5 min read

CoTE-SQL: New fine-tuning method beats SOTA on Bird, Spider

CoTE-SQL combines self-enhanced reasoning traces, structured CoT prompting and execution-aware revision to boost text-to-SQL results.

The Brieftide

TL;DR

  • 01CoTE-SQL combines self-enhanced reasoning traces, structured CoT prompting and execution-aware revision to boost text-to-SQL results.
  • 02The architecture is an LLM-based text-to-SQL pipeline where reasoning traces and example retrieval guide modular decomposition of the NL question into subproblems and the SQL generation process.
  • 03After producing SQL candidates the system executes them and applies error-aware revision when execution feedback indicates problems, closing the loop between generation and runtime correctness.

CoTE-SQL, a fine-tuning approach for text-to-SQL, achieves new state-of-the-art performance among methods built on open-source LLMs on the Bird benchmark and posts strong results on Spider, according to a paper submitted 14 Jun 2026. The paper, authored by Feng Lyu, Jinfeng Cen, Sijing Duan, Hao Wu, Shucheng Li, Weixu Zhang and Haolun Wu, reports Bird results of 53.39% EX and 59.02 VES and Spider results of 79.60% EX and 77.19 VES.

How does CoTE-SQL work?

CoTE-SQL integrates three concrete innovations: distilled reasoning traces produced without human annotation, structured chain-of-thought prompting with modular decomposition and examples retrieval, and error-aware revision driven by SQL execution feedback. The authors summarize the first innovation as "self-enhanced reasoning traces distilled from LLMs without human annotation," and pair that with structured CoT prompting and an execution-feedback revision loop to improve generation.

The architecture is an LLM-based text-to-SQL pipeline where reasoning traces and example retrieval guide modular decomposition of the NL question into subproblems and the SQL generation process. After producing SQL candidates the system executes them and applies error-aware revision when execution feedback indicates problems, closing the loop between generation and runtime correctness.

How does CoTE-SQL perform on benchmarks?

CoTE-SQL achieves 53.39% Exact Match (EX) and 59.02% Valid Execution Score (VES) on Bird, and 79.60% EX and 77.19% VES on Spider, with the paper noting especially significant gains on complex queries. The authors position these numbers as new state-of-the-art among methods built on open-source LLMs with comparable model sizes for Bird, and as strong results on Spider.

The evaluation appears extensive: the submission runs experiments on the Spider and Bird benchmarks and reports gains concentrated on complex queries. The paper itself spans 14 pages and contains 13 figures and 7 tables, indicating a detailed empirical section backing the metric claims.

Why it matters

CoTE-SQL targets two persistent problems in text-to-SQL: producing logically correct SQL for complex linguistic inputs and generalizing across database schemas and query patterns. The paper's combination of automated reasoning-trace distillation, structured prompting, and execution-time revision suggests a practical path to reduce dependence on human annotation while improving runtime correctness. If the claimed gains on Bird and Spider hold up under scrutiny, method choices that mix internal LLM traces with execution feedback could become the dominant pattern for building reliable open-source text-to-SQL systems.

What to watch

Watch for the authors' code, data and demos linked in the paper's "Code, Data and Media Associated with this Article" section to appear, allowing reproduction of the reported Bird and Spider numbers. Also look for independent comparisons against other open-source LLM-based text-to-SQL methods on complex-query subsets to confirm the paper's claim of especially significant gains there.

References and paper details: the submission was posted on 14 Jun 2026 and lists Feng Lyu, Jinfeng Cen, Sijing Duan, Hao Wu, Shucheng Li, Weixu Zhang and Haolun Wu as authors. The PDF and auxiliary materials are available with the submission.

CoTE-SQL benchmark results (as reported in the paper)
Item
Bird53.39%59.02%
Spider79.60%77.19%
Advertisement

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement