CoTE-SQL: New fine-tuning method beats SOTA on Bird, Spider
CoTE-SQL combines self-enhanced reasoning traces, structured CoT prompting and execution-aware revision to boost text-to-SQL results.
TL;DR
- 01CoTE-SQL combines self-enhanced reasoning traces, structured CoT prompting and execution-aware revision to boost text-to-SQL results.
- 02The architecture is an LLM-based text-to-SQL pipeline where reasoning traces and example retrieval guide modular decomposition of the NL question into subproblems and the SQL generation process.
- 03After producing SQL candidates the system executes them and applies error-aware revision when execution feedback indicates problems, closing the loop between generation and runtime correctness.
CoTE-SQL, a fine-tuning approach for text-to-SQL, achieves new state-of-the-art performance among methods built on open-source LLMs on the Bird benchmark and posts strong results on Spider, according to a paper submitted 14 Jun 2026. The paper, authored by Feng Lyu, Jinfeng Cen, Sijing Duan, Hao Wu, Shucheng Li, Weixu Zhang and Haolun Wu, reports Bird results of 53.39% EX and 59.02 VES and Spider results of 79.60% EX and 77.19 VES.
How does CoTE-SQL work?
CoTE-SQL integrates three concrete innovations: distilled reasoning traces produced without human annotation, structured chain-of-thought prompting with modular decomposition and examples retrieval, and error-aware revision driven by SQL execution feedback. The authors summarize the first innovation as "self-enhanced reasoning traces distilled from LLMs without human annotation," and pair that with structured CoT prompting and an execution-feedback revision loop to improve generation.
The architecture is an LLM-based text-to-SQL pipeline where reasoning traces and example retrieval guide modular decomposition of the NL question into subproblems and the SQL generation process. After producing SQL candidates the system executes them and applies error-aware revision when execution feedback indicates problems, closing the loop between generation and runtime correctness.
How does CoTE-SQL perform on benchmarks?
CoTE-SQL achieves 53.39% Exact Match (EX) and 59.02% Valid Execution Score (VES) on Bird, and 79.60% EX and 77.19% VES on Spider, with the paper noting especially significant gains on complex queries. The authors position these numbers as new state-of-the-art among methods built on open-source LLMs with comparable model sizes for Bird, and as strong results on Spider.
The evaluation appears extensive: the submission runs experiments on the Spider and Bird benchmarks and reports gains concentrated on complex queries. The paper itself spans 14 pages and contains 13 figures and 7 tables, indicating a detailed empirical section backing the metric claims.
Why it matters
CoTE-SQL targets two persistent problems in text-to-SQL: producing logically correct SQL for complex linguistic inputs and generalizing across database schemas and query patterns. The paper's combination of automated reasoning-trace distillation, structured prompting, and execution-time revision suggests a practical path to reduce dependence on human annotation while improving runtime correctness. If the claimed gains on Bird and Spider hold up under scrutiny, method choices that mix internal LLM traces with execution feedback could become the dominant pattern for building reliable open-source text-to-SQL systems.
What to watch
Watch for the authors' code, data and demos linked in the paper's "Code, Data and Media Associated with this Article" section to appear, allowing reproduction of the reported Bird and Spider numbers. Also look for independent comparisons against other open-source LLM-based text-to-SQL methods on complex-query subsets to confirm the paper's claim of especially significant gains there.
References and paper details: the submission was posted on 14 Jun 2026 and lists Feng Lyu, Jinfeng Cen, Sijing Duan, Hao Wu, Shucheng Li, Weixu Zhang and Haolun Wu as authors. The PDF and auxiliary materials are available with the submission.
| Item | |||
|---|---|---|---|
| Bird | 53.39% | 59.02% | |
| Spider | 79.60% | 77.19% |
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Benchmarks & EvalsBIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates
ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.
LLM Agents: Predictive Validity vs Static Leaderboards
Dhaval C. Patel et al. aggregate fourteen implementation studies and seven prior benchmarks and propose ranking by predictive validity.
SafeClawBench: benchmark separating semantic, audit, sandbox harm
A 600-task staged benchmark measures semantic acceptance, audit-visible evidence.