How to Do SQL Query Federation

SQL federation is a data access pattern that lets you query multiple, heterogeneous data sources through a single SQL interface -- without copying or moving data into a central store.

See Spice SQL federation

Read the docs

Most organizations store data across many systems: transactional databases like PostgreSQL and MySQL, analytical warehouses like Databricks and Snowflake, object stores like Amazon S3, and streaming platforms like Kafka. When an application or analyst needs to combine data from several of these systems, the traditional approach is to build ETL (extract, transform, load) pipelines that copy everything into a central warehouse.

SQL federation takes a different approach. Instead of moving data, a federation engine connects to each source at query time, translates a single SQL query into source-specific requests, and merges the results. The data stays where it is. The application sees a single, unified SQL interface.

How SQL Federation Works

A federated query goes through three stages: planning, execution, and merging.

Query Planning

When a query arrives, the federation engine parses the SQL and identifies which tables map to which data sources. It then builds an optimized execution plan. The planner determines which operations -- filters, joins, aggregations, sorts -- can be pushed down to each source system versus which must be handled in the federation layer.

This planning step is critical for performance. A well-optimized plan minimizes the amount of data transferred over the network by pushing as much work as possible to the sources.

Predicate and Aggregation Pushdown

Pushdown is the most important optimization in SQL federation. When the engine detects that a filter (e.g., WHERE created_at > '2026-01-01') or aggregation (e.g., COUNT(*), SUM(amount)) can be executed natively by the source database, it pushes that operation down rather than pulling all the raw data into the federation layer.

For example, consider a query that joins customer records from PostgreSQL with order events from Databricks, filtered to the last 30 days:

SELECT c.name, COUNT(o.id) as order_count
FROM postgres.customers c
JOIN databricks.orders o ON c.id = o.customer_id
WHERE o.created_at > NOW() - INTERVAL '30 days'
GROUP BY c.name

A federation engine with good pushdown will:

Push the WHERE o.created_at > ... filter to Databricks, so only recent orders are transferred
Potentially push the COUNT aggregation partially to each source
Pull only the filtered, reduced result sets into the federation layer for the final join

Without pushdown, the engine would pull every row from both tables and filter locally -- a much slower and more expensive operation.

Result Merging

After each source returns its partial results, the federation layer applies any remaining operations: cross-source joins, final sorts, limit clauses, and formatting. The merged result is returned to the application as a single result set, indistinguishable from a query against a single database.

SQL Federation vs. ETL Pipelines

ETL and SQL federation solve the same fundamental problem -- accessing data across systems -- but they make different tradeoffs.

Data movement: ETL copies data from sources into a central warehouse on a schedule. Federation queries data in place at runtime. ETL introduces storage duplication and pipeline maintenance. Federation eliminates both but depends on source availability at query time.

Data freshness: ETL pipelines run on schedules (hourly, daily), so warehouse data is always behind. Federation queries live sources, so results reflect the current state. For AI workloads, real-time dashboards, and operational applications, this freshness difference is significant.

Time to value: ETL requires schema design, transformation logic, and orchestration before data is queryable. Federation makes a new source available as soon as a connector is configured -- often in minutes.

Performance: Raw federation queries are bounded by source performance and network latency. ETL trades freshness for speed by pre-computing and co-locating data. The best systems combine both: federation for real-time access, with local acceleration caching for performance-critical queries.

Maintenance: ETL pipelines break when source schemas change, requiring manual fixes. Federation adapts more gracefully because queries execute against the current schema at runtime.

In practice, many production systems use both patterns. Federated queries handle real-time access and ad-hoc exploration, while acceleration caches -- kept fresh via change data capture -- provide sub-second performance for latency-sensitive workloads. This combined approach -- federation plus CDC-backed local caches, without centralized ETL pipelines -- is what architects describe as a zero-ETL architecture.

Key Benefits of SQL Federation

No ETL Pipelines to Maintain

Every ETL pipeline is a liability: it can break when source schemas change, it introduces data staleness, and it requires engineering time to build and monitor. Federation eliminates these pipelines for many use cases, reducing the operational burden on data teams.

Unified SQL Interface

Application developers write standard SQL against a single endpoint. The federation engine handles connectivity, dialect translation, and schema mapping across PostgreSQL, MySQL, S3, Databricks, and 30+ other sources. Teams don't need to learn each source's query language or manage separate connections.

Real-Time Data Access

Because queries execute against live sources, results reflect the current state of each system. This is critical for operational dashboards, AI workloads, and any application where stale data leads to bad decisions.

Governed, Secure Access

A federation layer provides a single point of access control, audit logging, and policy enforcement. Instead of managing permissions across every source individually, teams define policies once at the federation layer. This simplifies compliance and security, especially in regulated industries like financial services and cybersecurity.

Common SQL Federation Use Cases

Cross-Database Analytics

Join customer data in PostgreSQL with event data in Databricks and product data in Amazon S3 -- all in a single query. Federation eliminates the need to pre-join datasets in a warehouse, making it possible to run ad-hoc analytics across any combination of sources.

AI and Machine Learning Pipelines

AI models and retrieval-augmented generation (RAG) systems need fresh, complete data from multiple sources. Federation provides the real-time, multi-source data access that AI workloads require without building and maintaining separate data pipelines for each model.

Operational Data Lakehouses

Combine transactional databases with analytical stores and object storage into a single queryable layer. Federation bridges the gap between OLTP and OLAP workloads, enabling teams to query operational and analytical data together.

Data Mesh Architectures

In data mesh, each domain team owns its data products. Federation provides governed, cross-domain queries without centralizing data into a monolithic warehouse. Each team maintains autonomy over its data while the federation layer enables organization-wide access.

How to Set Up SQL Federation with Spice

Spice combines SQL federation and local acceleration in a single runtime. Here's how to federate queries across multiple data sources in practice.

Step 1: Define Your Data Sources

A spicepod.yaml file declares each data source and how Spice should connect to it. Each dataset maps a remote source to a local table name:

version: v1
kind: Spicepod
name: federated_analytics

datasets:
  # Transactional data from PostgreSQL
  - from: postgres:public.customers
    name: customers
    params:
      pg_host: db.example.com
      pg_port: '5432'
      pg_db: production
      pg_user: ${secrets:PG_USER}
      pg_pass: ${secrets:PG_PASS}

  # Analytical data from Databricks
  - from: databricks:catalog.schema.orders
    name: orders
    params:
      databricks_host: dbc-example.cloud.databricks.com
      databricks_token: ${secrets:DATABRICKS_TOKEN}

  # Product catalog from S3 Parquet files
  - from: s3://data-lake/products/
    name: products
    params:
      file_format: parquet

Run spice run and Spice connects to all three sources. No ETL pipelines, no data movement.

Step 2: Query Across Sources with Standard SQL

Once the sources are registered, query them with standard SQL as if they were tables in a single database. Spice handles dialect translation, connection management, and predicate pushdown automatically:

-- Join PostgreSQL customers with Databricks orders and S3 product data
SELECT c.name, p.category, COUNT(o.id) AS order_count, SUM(o.amount) AS total_spend
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN products p ON o.product_id = p.id
WHERE o.created_at > NOW() - INTERVAL '30 days'
GROUP BY c.name, p.category
ORDER BY total_spend DESC
LIMIT 20

Spice pushes the WHERE o.created_at > ... filter to Databricks, fetches only matching rows, and merges results locally. The application receives a single result set through Arrow Flight SQL, ODBC, JDBC, or HTTP.

Step 3: Add Acceleration for Performance-Critical Queries

Pure federation depends on source latency. For queries that need sub-second performance, add local acceleration with a few lines of configuration:

datasets:
  - from: postgres:public.customers
    name: customers
    acceleration:
      engine: arrow # In-memory acceleration
      refresh_mode: full
      refresh_check_interval: 10s

  - from: databricks:catalog.schema.orders
    name: orders
    acceleration:
      engine: arrow
      refresh_mode: full
      refresh_check_interval: 10m

With acceleration enabled, Spice caches the dataset locally and serves queries from the cache. For PostgreSQL, change data capture keeps the cache synchronized with the source in real time. For Databricks, periodic full refresh ensures the cache stays current.

This pattern -- federation for connectivity, acceleration for performance -- addresses the main limitation of pure federation while preserving real-time data freshness. Queries that hit the acceleration layer return in single-digit milliseconds, even when the underlying sources are slow or remote.

Step 4: Connect Your Application

Spice exposes federated and accelerated data through standard protocols. Applications connect using whichever client fits their stack:

Arrow Flight SQL for high-throughput columnar data transfer (Python, Go, Rust, Java)
HTTP/JSON for lightweight queries from any language
ODBC/JDBC for BI tools, dashboards, and legacy applications
OpenAI-compatible API for LLM inference and text-to-SQL

No custom SDKs or proprietary protocols. Teams query Spice the same way they query PostgreSQL or any other SQL database.

Advanced Topics

Federation Query Planning Internals

The query planner is the most performance-critical component in a federation engine. When a multi-source SQL query arrives, the planner must decompose it into a set of sub-queries that each target a single source, determine the optimal execution order, and decide which operations to execute locally versus remotely.

Modern federation engines build a logical plan tree from the parsed SQL, then apply a series of optimizer rules. The most impactful rules include join reordering (choosing which source to query first based on estimated selectivity), predicate pushdown (moving filters as close to the source as possible), and projection pruning (requesting only the columns needed by the final result, rather than SELECT * from each source).

The planner also handles type coercion between sources. PostgreSQL's TIMESTAMPTZ, Snowflake's TIMESTAMP_LTZ, and S3 Parquet's TIMESTAMP_MICROS all represent timestamps differently. The planner inserts cast operations to normalize types before cross-source joins.

Cost-Based Optimization

Simple rule-based planners apply optimizations in a fixed order, which works for straightforward queries but misses opportunities in complex ones. Cost-based optimizers (CBOs) estimate the execution cost of multiple candidate plans and select the cheapest one.

Cost estimation in federation is harder than in a single database because the planner must account for network transfer costs, source-specific query latency, and varying source capabilities. A CBO might estimate that pushing a GROUP BY to Databricks (which is optimized for large-scale aggregations) saves more time than pushing it to PostgreSQL, even if both support the operation. The planner assigns cost weights to network transfer, source compute, and local compute, then evaluates candidate plans against these weights.

In practice, federation engines maintain statistics about source performance -- average query latency, throughput capacity, and supported pushdown operations -- and use these to inform planning decisions.

Connection Pooling and Source Management

Each federated query opens connections to one or more source databases. Without connection pooling, a burst of concurrent queries could exhaust source connection limits and cause failures. Federation engines maintain connection pools for each configured source, reusing connections across queries and enforcing concurrency limits.

Connection pools also handle health checks and failover. If a source becomes temporarily unavailable, the pool marks it unhealthy and returns errors immediately rather than hanging on connection timeouts. When the source recovers, the pool resumes routing queries to it. For latency-sensitive workloads, some engines support read replicas or fallback sources -- if the primary source is slow, the query is routed to a replica or an acceleration cache instead.

SQL Federation FAQ

What is the difference between SQL federation and ETL?

ETL copies data from source systems into a central warehouse on a schedule. SQL federation queries data in place at runtime without copying it. ETL introduces latency and pipeline maintenance, while federation provides real-time access but depends on source availability. Many teams combine both: federated queries for real-time access and acceleration caches for performance-critical workloads.

How does SQL federation handle performance across slow data sources?

Federation engines use predicate pushdown to minimize data transfer and query acceleration (local caching) to avoid repeated round-trips to slow sources. Spice combines federation with local acceleration -- frequently accessed data is cached in-memory or on-disk, enabling sub-second queries even when the underlying source is slow or remote.

What types of data sources support SQL federation?

Most federation engines support relational databases (PostgreSQL, MySQL, SQL Server), analytical warehouses (Databricks, Snowflake, BigQuery), object stores (S3, Azure Blob, GCS), and streaming systems. The specific connectors vary by engine. Spice supports 30+ data connectors out of the box.

Is SQL federation suitable for production workloads?

Yes, when paired with query acceleration and proper governance. Raw federation without caching can be slow for latency-sensitive applications. Production-grade federation engines like Spice add local acceleration, connection pooling, and fault tolerance to ensure reliable sub-second performance.

How does SQL federation differ from data virtualization?

Data virtualization is the broader concept of abstracting data access across sources. SQL federation is a specific implementation that uses SQL as the query interface. All SQL federation is data virtualization, but data virtualization can also include REST APIs, GraphQL, or other query paradigms.

Learn more about SQL federation

Guides, blog posts, and documentation to help you get started with federated SQL queries.

Docs

Query Federation Docs

Learn how Spice federates SQL queries across multiple data sources with predicate pushdown and local acceleration.

Blog

Getting Started with Spice.ai SQL Query Federation & Acceleration

Learn how to use Spice.ai to federate and accelerate queries across operational and analytical systems with zero ETL.

Blog

How we use Apache DataFusion at Spice AI

A technical overview of how Spice extends Apache DataFusion with custom table providers, optimizer rules, and UDFs to power federated SQL, search, and AI inference.

Talk to an engineer

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer