How to Use Text-to-SQL

Text-to-SQL is a technique that uses large language models to translate natural language questions into executable SQL queries -- enabling anyone to query databases without writing SQL by hand.

See Spice SQL federation

Read the docs

Databases hold answers to most business questions. But getting those answers requires SQL -- a skill that most stakeholders, analysts, and even many developers don't use daily. The gap between "I want to know X" and SELECT ... FROM ... WHERE ... is where text-to-SQL fits in.

Text-to-SQL systems accept a natural language question (e.g., "What were our top 10 customers by revenue last quarter?"), generate the corresponding SQL query, execute it against a database, and return the results. The translation is handled by a large language model (LLM) that has been given enough context about the database schema to produce valid, executable SQL.

This is not a new idea -- natural language interfaces to databases (NLIDBs) date back to the 1970s. What changed is that modern LLMs are good enough at SQL generation to make these systems practical for real workloads.

How Text-to-SQL Works

A text-to-SQL system has four stages: schema context injection, natural language parsing, SQL generation, and result delivery.

Schema Context Injection

Before the LLM can generate SQL, it needs to understand the database structure. This means providing table names, column names, data types, primary and foreign key relationships, and ideally sample values or column descriptions. This context is injected into the LLM prompt alongside the user's question.

The quality of schema context directly determines output quality. An LLM that knows orders.customer_id references customers.id will generate correct joins. Without that context, it may hallucinate column names or produce syntactically valid but semantically wrong queries.

Schema context can be provided statically (a fixed schema description in the prompt) or dynamically (retrieved at query time based on the user's question). Dynamic retrieval is more scalable for large databases with hundreds of tables, because it limits the prompt to only the relevant tables.

Natural Language Parsing

The LLM interprets the user's intent from their natural language input. This involves:

Identifying the entities referenced (tables, columns, metrics)
Understanding temporal references ("last quarter," "year-over-year")
Resolving ambiguity ("revenue" might mean gross_revenue, net_revenue, or total_amount depending on the schema)
Recognizing aggregation intent ("top 10," "average," "total")

This step is where most errors originate. Natural language is inherently ambiguous, and the same question can map to different SQL queries depending on business context.

SQL Generation

The LLM produces a SQL query based on the parsed intent and schema context. A well-constructed prompt might yield:

SELECT c.name, SUM(o.total_amount) AS revenue
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.created_at >= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
  AND o.created_at < DATE_TRUNC('quarter', CURRENT_DATE)
GROUP BY c.name
ORDER BY revenue DESC
LIMIT 10

The generated SQL must be syntactically correct for the target database dialect (PostgreSQL, MySQL, DuckDB, etc.), use only columns and tables that actually exist, and correctly express the user's intent.

Execution and Result Delivery

The generated SQL is executed against the database, and results are returned to the user. In production systems, this step includes validation (checking the SQL for syntax errors or disallowed operations before execution), sandboxing (running the query with restricted permissions), and result formatting (converting tabular results into a human-readable response).

NSQL: Natural SQL as an Emerging Concept

NSQL (natural SQL) refers to an emerging approach where the boundary between natural language and SQL becomes fluid. Rather than treating text-to-SQL as a strict translation problem -- natural language in, SQL out -- NSQL systems allow users to express queries in a mix of natural language and SQL fragments.

For example, a user might write: "Show me all orders WHERE total > 1000 from last week." The system interprets the natural language portions ("from last week") and passes through the SQL fragments (WHERE total > 1000) directly.

NSQL is still an early concept, but it reflects a practical reality: power users often know parts of the SQL they want and prefer to specify those directly rather than relying entirely on LLM interpretation.

Common Approaches to Text-to-SQL

Prompt Engineering with General-Purpose LLMs

The most common approach uses a general-purpose LLM (GPT-4, Claude, Llama) with a carefully engineered prompt that includes the database schema, query examples, and instructions for SQL generation. This requires no model training and can be deployed immediately.

The prompt typically includes:

The database schema (DDL statements or structured descriptions)
A few examples of natural language to SQL mappings (few-shot learning)
Instructions about the target SQL dialect
Business-specific terminology mappings

This approach works well for straightforward queries but can struggle with complex joins, subqueries, and domain-specific logic.

Fine-Tuned Models

Fine-tuning a base LLM on a dataset of (question, SQL) pairs from a specific database produces a model that is specialized for that schema. Fine-tuned models are typically more accurate for their target database than general-purpose models, but they require training data, compute resources, and retraining when the schema changes.

Open-source models like SQLCoder and NSQL-Llama have been fine-tuned specifically for text-to-SQL tasks and achieve competitive accuracy on benchmarks like Spider and BIRD.

RAG over Schema Metadata

For databases with hundreds or thousands of tables, including the full schema in every prompt is impractical. Retrieval-augmented generation (RAG) addresses this by retrieving only the relevant schema elements at query time.

When a user asks a question, the system:

Embeds the question and searches a vector index of table and column descriptions
Retrieves the most relevant tables and their schemas
Constructs a prompt with only the relevant schema context
Generates SQL using the focused context

This approach scales to large databases and reduces hallucination by limiting the schema surface area the LLM must reason about.

Challenges in Text-to-SQL

Ambiguity in Natural Language

"Show me active users" could mean users who logged in today, users with active subscriptions, or users who have made a purchase in the last 30 days. Without explicit business definitions, the LLM must guess -- and it often guesses wrong.

The most reliable mitigation is to include business glossaries or metric definitions in the prompt context: "active user = a user with at least one login event in the last 30 days."

Complex Joins and Multi-Table Queries

Single-table queries are relatively straightforward. Performance degrades significantly for queries that require multiple joins, correlated subqueries, window functions, or CTEs (common table expressions). These queries demand that the LLM correctly trace foreign key relationships across several tables -- a task that increases in difficulty with schema size.

Hallucinated Column and Table Names

LLMs can generate SQL that references columns or tables that don't exist. This is particularly common when the schema context is incomplete or when column names are ambiguous. A model might generate SELECT user_email FROM users when the actual column is email_address.

Schema validation before execution catches these errors, but it doesn't fix them. More advanced systems re-prompt the LLM with the error message, allowing it to self-correct.

SQL Dialect Differences

SQL is not a single language. PostgreSQL, MySQL, BigQuery, DuckDB, and SQL Server each have their own syntax for date functions, string operations, window functions, and type casting. A text-to-SQL system must generate queries in the correct dialect for the target database.

Evaluating Text-to-SQL Systems

Two primary metrics are used to evaluate text-to-SQL accuracy:

Execution accuracy measures whether the generated SQL, when executed, produces the correct result set. This is the more practical metric -- it doesn't matter if the SQL is different from the reference query as long as the results match.

Exact match accuracy measures whether the generated SQL exactly matches a reference query. This is stricter and less useful in practice, since many different SQL queries can produce the same results.

On the Spider benchmark (a widely used text-to-SQL evaluation dataset), state-of-the-art systems achieve 85-90% execution accuracy on simple queries but drop to 50-70% on complex queries involving multiple joins, nested subqueries, and aggregations.

Production Patterns for Text-to-SQL

SQL Validation and Sandboxing

Never execute LLM-generated SQL without validation. At minimum, production systems should:

Parse the SQL and verify that all referenced tables and columns exist
Check for disallowed operations (DROP, DELETE, UPDATE in read-only contexts)
Execute the query with a read-only database user and strict resource limits (timeouts, row limits)
Log every generated query for audit and debugging

Result Verification

After execution, verify that the results are reasonable. Common checks include:

Row count sanity checks (a query asking for "top 10" should return at most 10 rows)
Type validation (a "revenue" column should contain numeric values)
NULL handling (unexpected NULLs often indicate a wrong join)

Multi-Turn Refinement

When the first query doesn't match the user's intent, a conversational interface allows the user to refine their question. The system can use the previous query, its results, and the user's feedback to generate an improved query. This iterative approach significantly improves practical accuracy.

How to Build Text-to-SQL with Spice

Text-to-SQL becomes more powerful when combined with SQL federation. In a federated environment, a single SQL query can access data from PostgreSQL, Databricks, S3, and dozens of other sources simultaneously. Text-to-SQL on top of federation means a user can ask a natural language question that spans multiple data systems -- without knowing where the data lives or how to connect to each source.

Spice provides SQL federation, data acceleration, and LLM inference in a single runtime -- everything needed for production text-to-SQL. Here's how to set it up.

Step 1: Configure Data Sources and a Model

A spicepod.yaml defines the federated data sources and the LLM model that will generate SQL:

version: v1
kind: Spicepod
name: text_to_sql

datasets:
  - from: postgres:public.customers
    name: customers
    params:
      pg_host: db.example.com
      pg_user: ${secrets:PG_USER}
      pg_pass: ${secrets:PG_PASS}
    acceleration:
      engine: arrow
      refresh_mode: changes
      refresh_check_interval: 1s

  - from: s3://data-lake/orders/
    name: orders
    params:
      file_format: parquet
    acceleration:
      engine: arrow
      refresh_mode: full
      refresh_check_interval: 10m

models:
  - name: text_to_sql
    from: openai:gpt-4o
    params:
      openai_api_key: ${secrets:OPENAI_API_KEY}

Spice automatically provides the model with the schema of all registered datasets -- table names, column names, data types, and relationships. The model doesn't need a manually maintained schema description; it introspects the federated catalog directly.

Step 2: Query with Natural Language via the NSQL Endpoint

Spice exposes a /v1/nsql endpoint that accepts natural language questions, translates them to SQL using the configured model, executes the query against the federated data sources, and returns the results:

curl -XPOST "http://localhost:8090/v1/nsql" \
  -H "Content-Type: application/json" \
  -d '{"query": "What were the top 10 customers by total order value last quarter?"}'

Behind the scenes, Spice:

Passes the question and the full federated schema to the configured LLM
The LLM generates a SQL query (e.g., SELECT c.name, SUM(o.amount) ... FROM customers c JOIN orders o ...)
Spice validates the SQL against the catalog (catching hallucinated column names)
Executes the query across PostgreSQL and S3 with predicate pushdown
Returns the result set as JSON

Because the LLM and the federation engine are co-located in the same runtime, the round-trip time from question to answer is minimized -- no network hops between separate model-serving and data-access services.

Step 3: Add Schema Context for Better Accuracy

For complex schemas, adding column descriptions and business glossary entries to the dataset configuration improves generation accuracy:

datasets:
  - from: postgres:public.orders
    name: orders
    metadata:
      description: 'Customer orders with status tracking'
      columns:
        status: 'Order status enum: pending, shipped, delivered, cancelled'
        amount: 'Total order amount in USD including tax'
        customer_id: 'References customers.id'

This metadata is included in the prompt context sent to the LLM, reducing ambiguity and hallucination. For databases with hundreds of tables, Spice retrieves only the relevant tables based on the question, keeping the prompt focused.

Why This Approach Works

The combination of federation, acceleration, and inference in a single runtime solves several production text-to-SQL challenges:

Schema awareness is automatic. The model sees the real federated catalog, not a manually maintained schema document that drifts out of sync.
Queries span multiple sources. "Compare CRM satisfaction scores with warehouse order volumes" becomes a federated query across two systems, triggered by a single natural language question.
Performance is production-grade. Accelerated datasets return results in milliseconds, making text-to-SQL feel interactive rather than batch-oriented.
Tool calling enables iteration. Models served through Spice can invoke SQL execution as a tool via the MCP gateway, creating agentic workflows where the model iteratively queries, inspects results, and refines its approach.

Advanced Topics

Schema-Aware Prompting

The difference between a text-to-SQL system that works in demos and one that works in production often comes down to how schema context is structured in the LLM prompt. Naive approaches dump the full DDL (CREATE TABLE statements) into the prompt and hope the model figures out the relationships. Production systems are more deliberate.

Effective schema-aware prompting includes several layers of context beyond raw DDL:

Column descriptions: Natural language annotations explaining what each column contains. orders.status might be an enum with values pending, shipped, delivered, cancelled -- without this context, the LLM cannot correctly filter by status.
Foreign key annotations: Explicit statements like "orders.customer_id references customers.id" guide the model toward correct joins. Without these, the model may join on column name similarity, which often produces wrong results.
Sample values: Including 3-5 representative values for categorical columns (e.g., region: ['us-east', 'us-west', 'eu-west', 'ap-southeast']) helps the LLM generate correct filter predicates.
Business glossary entries: Definitions like "active customer = a customer with at least one order in the last 90 days" resolve ambiguity before the model encounters it.

For large schemas, dynamic schema retrieval is essential. The system embeds the user's question, searches a vector index of table and column descriptions, and includes only the top-k most relevant tables in the prompt. This keeps the prompt focused and reduces hallucination of non-existent columns.

Query Validation Pipelines

LLM-generated SQL cannot be trusted without validation. A production validation pipeline applies multiple checks before execution:

Syntactic validation parses the SQL using the target dialect's parser. This catches malformed queries, unclosed parentheses, and invalid keywords before they reach the database. Parsing also produces an AST (abstract syntax tree) that subsequent checks can inspect.

Schema validation verifies that every table and column referenced in the query actually exists in the database catalog. This is the most effective defense against hallucinated column names -- the most common failure mode in text-to-SQL systems. Schema validation also checks data types: if the query compares a string column to an integer, the validator flags the type mismatch.

Policy validation enforces security and governance rules. Common policies include: no DDL statements (DROP, ALTER, CREATE), no DML mutations (INSERT, UPDATE, DELETE) in read-only contexts, no queries against restricted tables or columns, and mandatory WHERE clauses on large tables to prevent full table scans. Policy validation inspects the AST to detect disallowed operations.

Cost estimation uses the database's EXPLAIN output to estimate query cost before execution. Queries that exceed a cost threshold (indicating a potential full table scan or cartesian join) are rejected or flagged for human review. This prevents runaway queries that could overload the database.

When validation fails, the most effective recovery strategy is to feed the error message back to the LLM and ask it to regenerate the query. Most validation errors (wrong column name, missing join condition) are easily correctable with one retry.

Multi-Turn SQL Conversations

Single-turn text-to-SQL -- one question, one query -- covers the simplest use cases. Production systems increasingly support multi-turn conversations where users iteratively refine their queries.

In a multi-turn flow, the system maintains a conversation context that includes: the original question, the generated SQL, the query results, and any follow-up questions. When the user says "now filter that to just the US region" or "break that down by month," the system must modify the previous query rather than generating a new one from scratch.

The technical challenge is context window management. Each turn adds the previous SQL and results to the prompt, consuming token budget. Production systems manage this by summarizing previous turns (replacing full result sets with row counts and column summaries), maintaining a running SQL query that accumulates modifications, and capping conversation depth (typically 5-10 turns before resetting context).

Multi-turn conversations also enable a powerful debugging pattern: when a query returns unexpected results, the user can ask "why are there NULL values in the revenue column?" and the system can inspect the query, identify the likely cause (a LEFT JOIN that produced unmatched rows), and suggest a correction. This iterative refinement loop makes text-to-SQL practical for exploratory analysis where the user doesn't know the exact question upfront.

Text-to-SQL FAQ

What is the difference between text-to-SQL and NSQL?

Text-to-SQL treats the problem as a strict translation from natural language to SQL. NSQL (natural SQL) is an emerging concept that allows users to mix natural language and SQL fragments in a single query, giving power users more control over the generated output while still handling natural language interpretation for ambiguous parts.

How accurate is text-to-SQL in practice?

On standard benchmarks, state-of-the-art systems achieve 85-90% execution accuracy on simple queries (single table, basic filters and aggregations). Accuracy drops to 50-70% for complex queries involving multiple joins, subqueries, and window functions. In production, accuracy depends heavily on schema context quality, prompt engineering, and the complexity of the target database.

What are the security risks of text-to-SQL?

The primary risk is SQL injection -- an LLM could generate destructive SQL (DROP TABLE, DELETE FROM) if not properly constrained. Production systems mitigate this by executing generated queries with read-only database credentials, parsing and validating SQL before execution, disallowing DDL and DML operations, and enforcing query timeouts and row limits.

Can text-to-SQL handle complex multi-table queries?

It can, but accuracy decreases with query complexity. Queries requiring multiple joins, correlated subqueries, CTEs, and window functions are significantly harder for LLMs to generate correctly. Providing complete foreign key relationships and example queries in the prompt context improves results. RAG-based schema retrieval also helps by surfacing the most relevant tables.

Is text-to-SQL production-ready?

Text-to-SQL is production-ready for specific, well-scoped use cases -- particularly when paired with SQL validation, sandbox execution, result verification, and human-in-the-loop review for critical queries. It works best as an assistant that generates draft queries for review, rather than a fully autonomous system executing unchecked SQL against production databases.

Learn more about text-to-SQL and data access

Guides, blog posts, and documentation on natural language querying and SQL federation.

Docs

Spice.ai OSS Documentation

Learn how Spice combines SQL federation, LLM inference, and data acceleration to power natural language data access.

Blog

A Developer's Guide to Understanding Spice.ai

Learn what Spice.ai is, when to use it, and how it solves enterprise data challenges.

Blog

Getting Started with Spice.ai SQL Query Federation & Acceleration

Learn how to use Spice.ai to federate and accelerate queries.

Talk to an engineer

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer