Type "Show me revenue by country last 30 days" into a tool like AI for Database and within a few seconds you get a result table. But what actually happened between your English sentence and the database returning rows?
The answer involves a surprisingly intricate chain: schema introspection, prompt engineering, an LLM generating SQL, query execution, and result formatting. Each step has failure modes. Understanding how the pipeline works helps you use these tools more effectively — and helps engineers decide whether to build one themselves or use an existing tool.
This post walks through the full architecture, with concrete examples at each stage.
What "Text-to-SQL" Actually Means
Text-to-SQL (also called NL2SQL, or natural language to SQL) is the task of taking a natural language question and producing a syntactically valid SQL query that answers it.
It sounds straightforward, but it's genuinely hard. A human writing SQL knows the schema, understands the business context, and can disambiguate vague terms. An LLM has to figure all of this out from limited context — the schema, the question, and sometimes a few examples.
Early text-to-SQL systems from the 2010s used rule-based parsing and template matching. They worked for narrow domains but fell apart on anything outside their training distribution. Modern systems use large language models, which generalize far better — at the cost of being occasionally wrong in unpredictable ways.
Step 1: Schema Introspection
Before an LLM can generate SQL, it needs to know what tables and columns exist in your database. This is called schema introspection — the tool queries the database's system tables to get the structure.
For PostgreSQL, this looks something like:
SELECT
table_name,
column_name,
data_type,
is_nullable
FROM information_schema.columns
WHERE table_schema = 'public'
ORDER BY table_name, ordinal_position;For MySQL:
SELECT
TABLE_NAME,
COLUMN_NAME,
DATA_TYPE,
IS_NULLABLE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = DATABASE()
ORDER BY TABLE_NAME, ORDINAL_POSITION;The result is a map of every table and its columns. A well-designed tool will also pull foreign key relationships — which columns reference which — so the LLM knows how to JOIN tables correctly.
For a mid-size SaaS database with 30–50 tables, the schema representation might look something like this in a prompt:
Table: users
- id (integer, not null)
- email (varchar, not null)
- created_at (timestamp, not null)
- plan_id (integer, references plans.id)
- churned_at (timestamp, nullable)
Table: plans
- id (integer, not null)
- name (varchar, not null)
- monthly_price_cents (integer, not null)
- tier (varchar: 'free', 'pro', 'enterprise')The tricky part: large databases have hundreds of tables. Stuffing all of them into an LLM context window is slow and expensive. Good text-to-SQL systems do relevance filtering — given the user's question, they pre-select the tables most likely to be needed and only include those in the prompt.
Step 2: Building the Prompt
The prompt is where most of the engineering effort lives in a production text-to-SQL system.
A typical prompt includes:
EXTRACT(EPOCH FROM ...) for timestamp differences; MySQL uses TIMESTAMPDIFF(); these differ and the LLM needs to know which one to useA simplified version:
You are a SQL query generator. Given the database schema and a question,
write a valid PostgreSQL query that answers the question. Return only the SQL,
no explanation.
Schema:
users(id, email, created_at, plan_id, churned_at)
plans(id, name, monthly_price_cents, tier)
Foreign keys:
users.plan_id → plans.id
Question: How many paying users do we have right now?
SQL:The model completes from there, generating something like:
SELECT COUNT(*)
FROM users
WHERE plan_id IS NOT NULL
AND churned_at IS NULL
AND plan_id IN (
SELECT id FROM plans WHERE tier != 'free'
);Whether that's right depends on how "paying" is defined in your schema. Maybe tier != 'free' is the right filter, maybe there's a separate subscriptions table with a status column. The model makes its best guess from available schema information.
Step 3: LLM Inference
The prompt goes to the LLM — GPT-4o, Claude, Gemini, or a fine-tuned open-source model. The LLM generates SQL token by token.
A few important details about this step:
Temperature matters. Text-to-SQL works best with low temperature (0.0–0.2). You want deterministic, consistent output — not creative variations. If you're building your own system, set temperature close to zero.
Context window length is a real constraint. A 500-table enterprise schema might not fit in a 128k token context window. This is why schema filtering matters. The best systems embed schema descriptions and retrieve only the most relevant portions per query.
Fine-tuning vs. prompting. General-purpose LLMs are surprisingly capable at text-to-SQL with good prompting. Fine-tuned models (trained specifically on SQL generation tasks) can outperform them on narrow distributions but may underperform on unusual schemas or edge cases. For most applications, a well-prompted general model is the faster, more maintainable choice.
Step 4: Validation and Error Recovery
Raw LLM output isn't always valid SQL. The query might reference a column that doesn't exist, use a function that's wrong for the database dialect, or have a syntax error.
Good text-to-SQL systems validate the generated SQL before executing it. The most common approaches:
Parse validation — run the SQL through a parser (like sqlglot) to catch syntax errors without actually executing the query
Schema validation — check that all referenced tables and columns exist in the schema
Dry-run execution — for databases that support it, add LIMIT 0 or use EXPLAIN to validate the query without returning rows
Retry on failure — if the query fails validation or execution, feed the error back to the LLM with an instruction to fix it. This "self-healing" loop handles most errors in one or two iterations:
The query you generated returned an error:
ERROR: column users.is_active does not exist
Please fix the query. Note: active users are identified by churned_at IS NULL.One or two retry iterations resolves the majority of failures without user intervention.
Step 5: Result Formatting
Once the SQL executes successfully, the raw result set needs to be formatted for display.
A query returning 10,000 rows needs pagination or summarization. A query returning a single number should display as a metric, not a table. A query with a timestamp column should format dates in the user's local timezone.
AI for Database automatically picks the right visualization — single-number metrics, tables, and charts — based on the shape of the result. A query that groups by a date column and returns a count automatically renders as a line chart. A query returning multiple dimensions renders as a table. This formatting layer is separate from the SQL generation but equally important for usability.
Where Text-to-SQL Gets It Wrong
Understanding the failure modes helps you work around them:
Ambiguous column names. If your orders table has both created_at and order_date, and you ask "orders this week," the model might pick the wrong one. Being specific ("orders placed this week" vs. "orders created this week") helps.
Business logic not in the schema. If "active user" means something specific in your product (e.g., users who have completed a certain action, not just users without churned_at), the model doesn't know that unless you tell it. AI for Database lets you add custom schema descriptions and business definitions that get included in prompts.
Complex multi-table logic. Queries requiring three or more joins with non-obvious conditions are harder. Breaking them into sequential questions often works better: "Show me users on the Pro plan" → then "of those, how many haven't logged in this month?"
Aggregation edge cases. NULL handling in SQL aggregations trips up LLMs occasionally. AVG(column) ignores NULLs, but users sometimes expect NULLs to count as zero. The generated SQL might be technically correct while giving a result the user didn't expect.
Building Your Own vs. Using a Purpose-Built Tool
If you're a developer evaluating whether to build a text-to-SQL layer internally, here's an honest assessment:
Building your own makes sense if:
Using a purpose-built tool makes sense if:
AI for Database handles the full pipeline — schema introspection, prompt engineering, LLM calls, validation, retry logic, result formatting, and dashboards — so your team gets natural language database access without building and maintaining the infrastructure.
A Note on Security and Read-Only Access
One concern with any tool that can run SQL against your database: what if it generates destructive queries?
Responsible text-to-SQL systems enforce read-only database connections at the connection level, not just through prompting. A SELECT-only PostgreSQL user physically cannot execute DELETE, UPDATE, or INSERT statements — regardless of what the LLM generates.
When connecting your database to AI for Database, using a dedicated read-only role is both the secure and the recommended approach:
-- Create a read-only role
CREATE ROLE aifd_readonly WITH LOGIN PASSWORD 'your-password';
-- Grant connect and select on all current tables
GRANT CONNECT ON DATABASE your_database TO aifd_readonly;
GRANT USAGE ON SCHEMA public TO aifd_readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO aifd_readonly;
-- Grant select on future tables too
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO aifd_readonly;With this setup, even if the LLM somehow generated a malicious query, it would fail at the database permission level.