Benchmarks & Evals3 min readvia The Decoder

Gemini-SQL2 tops BIRD benchmark with 80.04% accuracy

Built on Gemini 3.1 Pro, Gemini-SQL2 converts natural language to executable SQL and posts 80.04 percent on the BIRD text-to-SQL benchmark.

The Brieftide

TL;DR

  • 01Built on Gemini 3.1 Pro, Gemini-SQL2 converts natural language to executable SQL and posts 80.04 percent on the BIRD text-to-SQL benchmark.
  • 02Google Research has released Gemini-SQL2, a text-to-SQL model built on Gemini 3.1 Pro, and reported an 80.04 percent accuracy on the BIRD text-to-SQL benchmark.
  • 03The model converts natural language questions into executable SQL queries and posts the highest known BIRD score to date, according to Google Research.

Google Research has released Gemini-SQL2, a text-to-SQL model built on Gemini 3.1 Pro, and reported an 80.04 percent accuracy on the BIRD text-to-SQL benchmark. The model converts natural language questions into executable SQL queries and posts the highest known BIRD score to date, according to Google Research.

What Gemini-SQL2 does

Gemini-SQL2 accepts natural language prompts that describe data retrieval needs and produces SQL statements that can run against a relational database. The work targets end-to-end executable correctness, not just syntactic similarity, meaning the output is judged by whether the generated query returns the correct result on benchmark databases.

Google Research positioned Gemini-SQL2 as an evolution of its prior SQL-focused models, leveraging the Gemini 3.1 Pro base for language understanding and code generation. The team emphasized improvements in schema grounding, column and table disambiguation, and generation of joins and nested queries that execute correctly across a range of schema complexities.

The model is evaluated on BIRD, a benchmark that measures executable-match accuracy across diverse databases and natural language prompts. BIRD focuses on ensuring generated SQL is not only plausible but also produces the correct results when executed against the test databases.

Benchmark results and comparison

On the BIRD benchmark Gemini-SQL2 achieved 80.04 percent executable-match accuracy. Google Research framed the result as a substantial margin over previously published baselines, and highlighted examples where the model successfully resolved ambiguous references and produced multi-join queries that matched expected outputs.

Google provided qualitative examples showing improvements in handling nested subqueries, union operations, and schema linking where column names are similar across tables. The team also noted reductions in common failure modes, such as incorrect aggregation placement and misordered join conditions.

Public details about training data, compute, and fine-tuning methods were limited in the initial release. Google Research indicated that the model builds on the Gemini 3.1 Pro family and incorporates task-specific tuning for text-to-SQL, but full reproducibility details and a comprehensive leaderboard snapshot were not published alongside the announcement.

Comparison at a glance

Model BIRD accuracy Notes
Gemini-SQL2 (Gemini 3.1 Pro) 80.04% Reported by Google Research as top BIRD score
Prior published baselines Varies by model Multiple earlier entries exist with lower published scores

The announcement did not publish a full ranked list of competitors on the BIRD leaderboard, so direct numeric gaps to named prior leaders are not available in the initial material.

Why it matters

A higher executable-match score on BIRD signals progress toward text-to-SQL systems that produce correct, runnable queries in real settings, which matters for analytics teams and BI tools that aim to let nontechnical users query databases. Better schema grounding and fewer execution failures reduce the manual checking burden when models are used to prototype or automate queries.

Wider adoption will depend on transparency around training data, safety mitigations for data-sensitive environments, and how the model performs on private or domain-specific schemas beyond benchmark datasets.

BIRD benchmark snapshot: Gemini-SQL2 versus published baselines
Item
Gemini-SQL2 (Gemini 3.1 Pro)Gemini-SQL2 (Gemini 3.1 Pro)80.04%Reported by Google Research as top BIRD score
Prior published baselinesPrior published baselinesVaries by modelMultiple earlier entries exist with lower published scores

Primary source

The Decoder

the-decoder.com
Read the original

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeNo adsNo trackingUnsubscribe in one click

Read next

  1. Claude Fable 5 vs GPT-5.5: FrontierMath toughest-tier scoresJun 13 · 3 min read
  2. olmo-eval: AllenAI launches evaluation workbench for modelJun 12 · 4 min read
  3. Claude Fable 5 benchmark: SWE-bench 95% but costly, filteredJun 10 · 4 min read
  4. Anthropic releases Claude Fable 5 and Mythos 5 with coding gainsJun 9 · 3 min read