If you've been following the text-to-SQL space, you'll know the benchmark numbers have climbed steeply over the past two years. GPT-4 hit 85% on Spider. Claude 3 nudged ahead on join-heavy tests. Gemini Ultra claimed improvements on complex subquery generation. But benchmark numbers and real-world accuracy are two very different things.

This article breaks down how the leading large language models actually perform when given real production schemasambiguous column names, multi-table joins, and the kind of nuanced business logic that shows up in day-to-day queries, not cleaned-up academic datasets.

How Text-to-SQL Works Under the Hood

When you ask an AI "show me revenue by country last 30 days," the model doesn't magically know which tables to query. It needs schema context: which tables exist, what the columns are, their data types, and ideally some sample data or cardinality hints.

The typical pipeline looks like this:

Schema extraction The system introspects your database and pulls table names, column names, types, and foreign key relationships.

Prompt construction The schema is injected into the prompt alongside the user's natural language question.

SQL generation The LLM outputs a SQL query based on the prompt.

Validation and execution The generated SQL is parsed, checked for syntax errors, and run against the actual database.

Result formatting The returned rows are formatted into a table or chart.

Each step can fail. But the SQL generation step is where LLM choice matters most.

What "Accurate" Actually Means for SQL Generation

Benchmarks like Spider and BIRD score models on exact match or execution accuracy against a reference query. In production, you care about a different set of criteria:

Does it run without a syntax error? A query that crashes is worse than no query at all.

Does it return the right rows? A query that runs but silently returns wrong data is the most dangerous failure.

Does it handle ambiguity gracefully? Real schemas have columns like status, type, or created appearing in multiple tables. Can the model figure out the right one?

Can it handle multi-table joins? Most interesting business questions require joining at least two tables.

Does it respect business logic? If your orders table has a deleted_at soft-delete column, a model that ignores it will inflate your numbers.

GPT-4, Claude, and Gemini: Practical Differences

Testing these models on production-style schemas reveals consistent patterns.

GPT-4 (including GPT-4o) produces clean, valid SQL in most cases. It handles JOINs well and has solid table disambiguation. Its main weakness is aggressive assumption-makingit picks a table when uncertain rather than flagging the ambiguity, which can produce subtly wrong queries. It also handles very large schemas (100+ tables) less gracefully as context fills up.

Claude 3.5 and Claude 3 Opus show stronger performance on queries requiring careful reasoning about schema relationships. Claude tends to produce more conservative queriesit will add explicit filters like WHERE deleted_at IS NULL even when not asked, which is usually the right behaviour. It also generates better inline comments explaining its reasoning, which helps you catch errors before running.

Gemini 1.5 Pro and Ultra handle long context better than either GPT-4 or Claude on a per-token basis, which is an advantage with large schemas. However, testing against complex schemas shows a higher rate of syntactically valid but semantically wrong queries in JOIN-heavy scenarios.

Here's a concrete example. Given this schema fragment:

-- tables: orders, order_items, products, customers
-- orders: id, customer_id, created_at, status, deleted_at
-- order_items: id, order_id, product_id, quantity, unit_price
-- products: id, name, category, price

-- Question: "What was our total revenue by product category last month?"

GPT-4 typically generates:

SELECT p.category, SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.created_at >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
  AND o.created_at < DATE_TRUNC('month', CURRENT_DATE)
GROUP BY p.category
ORDER BY total_revenue DESC;

Clean and correctbut it missed the deleted_at filter, which means cancelled or deleted orders inflate the revenue numbers.

Claude typically adds:

WHERE o.created_at >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
  AND o.created_at < DATE_TRUNC('month', CURRENT_DATE)
  AND o.deleted_at IS NULL
  AND o.status NOT IN ('cancelled', 'refunded')

Which is almost certainly what you wanted, even if you didn't say it.

How Schema Context Changes Everything

The model you choose matters less than how well you feed it schema context. An improperly prompted GPT-4 will underperform a well-prompted smaller model.

Key schema context elements that improve accuracy:

Column descriptions Instead of just status VARCHAR, include status VARCHAR -- values: active, trial, cancelled, churned

Foreign key hints Make JOIN relationships explicit in the prompt

Sample values Including a few example values for categorical columns dramatically reduces table disambiguation errors

Business rules Document soft-delete patterns, status filters, and other non-obvious logic in column comments

Tools like AI for Database handle schema enrichment automaticallythey introspect your database, cache schema metadata, and inject the right context per query. This is why purpose-built NL-to-SQL tools often outperform copy-pasting a schema into ChatGPT even when using the same underlying model.

When to Trust the Generated SQL

Generated SQL should always be treated as a first draft, not a final answer. Before running any query against production data, check:

The WHERE clauses Make sure all expected filters are present, especially soft deletes and status exclusions.

The JOINs Confirm the right columns are being joined and that cardinality won't blow up your result set.

DISTINCT vs GROUP BY Deduplication logic is a common failure point.

Implicit type conversions Type mismatches between join columns cause silent data problems in some databases.

The most dangerous scenario is a query that runs, returns a plausible-looking number, but is subtly wrongfor example, double-counting due to a missing DISTINCT.

AI for Database includes a query preview step before execution, letting you review the generated SQL and catch issues before they hit your data.

The Full Loop: Generation Is Only Half the Problem

Text-to-SQL accuracy is only one part of getting useful answers from your database. The other partsdatabase connection management, result formatting, error handling, and iterationare equally important.

If the first query fails or returns the wrong thing, a good system should let you refine in plain English rather than requiring SQL edits. "Actually, exclude the test accounts" should produce a corrected query automatically.

This is where dedicated tools differ most from general-purpose AI assistants. When you ask ChatGPT to write SQL, you're on your own for running it, handling errors, and iterating. A dedicated platform like AI for Database handles the whole loop: it generates, validates, executes, catches errors, reformats, and lets you ask follow-up questions with full conversation context.

For teams that query databases regularly, that full loop matters more than which underlying model scores highest on Spider.

The Bottom Line

There is no single best LLM for SQL generation in all situations. The gap between top models has narrowed significantly, and what separates good text-to-SQL implementations from mediocre ones is schema context quality, error handling, and iteration speednot raw model performance.

If you want accurate answers from your database without managing any of this complexity yourself, try AI for Database free at aifordatabase.com. It connects to your existing database in minutes, handles schema enrichment automatically, and lets you query in plain English without worrying about which AI model is doing the work.

Best LLM for SQL Generation: Which AI Model Writes the Most Accurate Queries