What is SQL Federation?
SQL federation is a data access pattern that lets you query multiple, heterogeneous data sources through a single SQL interface -- without copying or moving data into a central store.
Most organizations store data across many systems: transactional databases like PostgreSQL and MySQL, analytical warehouses like Databricks and Snowflake, object stores like Amazon S3, and streaming platforms like Kafka. When an application or analyst needs to combine data from several of these systems, the traditional approach is to build ETL (extract, transform, load) pipelines that copy everything into a central warehouse.
SQL federation takes a different approach. Instead of moving data, a federation engine connects to each source at query time, translates a single SQL query into source-specific requests, and merges the results. The data stays where it is. The application sees a single, unified SQL interface.
How SQL Federation Works
A federated query goes through three stages: planning, execution, and merging.
Query Planning
When a query arrives, the federation engine parses the SQL and identifies which tables map to which data sources. It then builds an optimized execution plan. The planner determines which operations -- filters, joins, aggregations, sorts -- can be pushed down to each source system versus which must be handled in the federation layer.
This planning step is critical for performance. A well-optimized plan minimizes the amount of data transferred over the network by pushing as much work as possible to the sources.
Predicate and Aggregation Pushdown
Pushdown is the most important optimization in SQL federation. When the engine detects that a filter (e.g., WHERE created_at > '2026-01-01') or aggregation (e.g., COUNT(*), SUM(amount)) can be executed natively by the source database, it pushes that operation down rather than pulling all the raw data into the federation layer.
For example, consider a query that joins customer records from PostgreSQL with order events from Clickhouse, filtered to the last 30 days:
SELECT c.name, COUNT(o.id) as order_count
FROM postgres.customers c
JOIN clickhouse.orders o ON c.id = o.customer_id
WHERE o.created_at > NOW() - INTERVAL '30 days'
GROUP BY c.nameA federation engine with good pushdown will:
- Push the
WHERE o.created_at > ...filter to Clickhouse, so only recent orders are transferred - Potentially push the
COUNTaggregation partially to each source - Pull only the filtered, reduced result sets into the federation layer for the final join
Without pushdown, the engine would pull every row from both tables and filter locally -- a much slower and more expensive operation.
Result Merging
After each source returns its partial results, the federation layer applies any remaining operations: cross-source joins, final sorts, limit clauses, and formatting. The merged result is returned to the application as a single result set, indistinguishable from a query against a single database.
SQL Federation vs. ETL Pipelines
ETL and SQL federation solve the same fundamental problem -- accessing data across systems -- but they make different tradeoffs.
Data movement: ETL copies data from sources into a central warehouse on a schedule. Federation queries data in place at runtime. ETL introduces storage duplication and pipeline maintenance. Federation eliminates both but depends on source availability at query time.
Data freshness: ETL pipelines run on schedules (hourly, daily), so warehouse data is always behind. Federation queries live sources, so results reflect the current state. For AI workloads, real-time dashboards, and operational applications, this freshness difference is significant.
Time to value: ETL requires schema design, transformation logic, and orchestration before data is queryable. Federation makes a new source available as soon as a connector is configured -- often in minutes.
Performance: Raw federation queries are bounded by source performance and network latency. ETL trades freshness for speed by pre-computing and co-locating data. The best systems combine both: federation for real-time access, with local acceleration caching for performance-critical queries.
Maintenance: ETL pipelines break when source schemas change, requiring manual fixes. Federation adapts more gracefully because queries execute against the current schema at runtime.
In practice, many production systems use both patterns. Federated queries handle real-time access and ad-hoc exploration, while acceleration caches -- kept fresh via change data capture -- provide sub-second performance for latency-sensitive workloads.
Key Benefits of SQL Federation
No ETL Pipelines to Maintain
Every ETL pipeline is a liability: it can break when source schemas change, it introduces data staleness, and it requires engineering time to build and monitor. Federation eliminates these pipelines for many use cases, reducing the operational burden on data teams.
Unified SQL Interface
Application developers write standard SQL against a single endpoint. The federation engine handles connectivity, dialect translation, and schema mapping across PostgreSQL, MySQL, S3, Databricks, and 30+ other sources. Teams don't need to learn each source's query language or manage separate connections.
Real-Time Data Access
Because queries execute against live sources, results reflect the current state of each system. This is critical for operational dashboards, AI workloads, and any application where stale data leads to bad decisions.
Governed, Secure Access
A federation layer provides a single point of access control, audit logging, and policy enforcement. Instead of managing permissions across every source individually, teams define policies once at the federation layer. This simplifies compliance and security, especially in regulated industries like financial services and cybersecurity.
Common SQL Federation Use Cases
Cross-Database Analytics
Join customer data in PostgreSQL with event data in Clickhouse and product data in Amazon S3 -- all in a single query. Federation eliminates the need to pre-join datasets in a warehouse, making it possible to run ad-hoc analytics across any combination of sources.
AI and Machine Learning Pipelines
AI models and retrieval-augmented generation (RAG) systems need fresh, complete data from multiple sources. Federation provides the real-time, multi-source data access that AI workloads require without building and maintaining separate data pipelines for each model.
Operational Data Lakehouses
Combine transactional databases with analytical stores and object storage into a single queryable layer. Federation bridges the gap between OLTP and OLAP workloads, enabling teams to query operational and analytical data together.
Data Mesh Architectures
In data mesh, each domain team owns its data products. Federation provides governed, cross-domain queries without centralizing data into a monolithic warehouse. Each team maintains autonomy over its data while the federation layer enables organization-wide access.
SQL Federation with Spice
Spice combines SQL federation and local acceleration in a single runtime. Queries are federated across 30+ data connectors with predicate pushdown, then frequently accessed data is automatically cached locally for sub-second performance.
This combination addresses the main limitation of pure federation -- source latency -- while preserving the benefits of real-time data access. Change data capture keeps acceleration caches synchronized with source systems, so cached data is always fresh.
Advanced Topics
Federation Query Planning Internals
The query planner is the most performance-critical component in a federation engine. When a multi-source SQL query arrives, the planner must decompose it into a set of sub-queries that each target a single source, determine the optimal execution order, and decide which operations to execute locally versus remotely.
Modern federation engines build a logical plan tree from the parsed SQL, then apply a series of optimizer rules. The most impactful rules include join reordering (choosing which source to query first based on estimated selectivity), predicate pushdown (moving filters as close to the source as possible), and projection pruning (requesting only the columns needed by the final result, rather than SELECT * from each source).
The planner also handles type coercion between sources. PostgreSQL's TIMESTAMPTZ, Clickhouse's DateTime64, and S3 Parquet's TIMESTAMP_MICROS all represent timestamps differently. The planner inserts cast operations to normalize types before cross-source joins.
Cost-Based Optimization
Simple rule-based planners apply optimizations in a fixed order, which works for straightforward queries but misses opportunities in complex ones. Cost-based optimizers (CBOs) estimate the execution cost of multiple candidate plans and select the cheapest one.
Cost estimation in federation is harder than in a single database because the planner must account for network transfer costs, source-specific query latency, and varying source capabilities. A CBO might estimate that pushing a GROUP BY to Clickhouse (which is optimized for aggregations) saves more time than pushing it to PostgreSQL, even if both support the operation. The planner assigns cost weights to network transfer, source compute, and local compute, then evaluates candidate plans against these weights.
In practice, federation engines maintain statistics about source performance -- average query latency, throughput capacity, and supported pushdown operations -- and use these to inform planning decisions.
Connection Pooling and Source Management
Each federated query opens connections to one or more source databases. Without connection pooling, a burst of concurrent queries could exhaust source connection limits and cause failures. Federation engines maintain connection pools for each configured source, reusing connections across queries and enforcing concurrency limits.
Connection pools also handle health checks and failover. If a source becomes temporarily unavailable, the pool marks it unhealthy and returns errors immediately rather than hanging on connection timeouts. When the source recovers, the pool resumes routing queries to it. For latency-sensitive workloads, some engines support read replicas or fallback sources -- if the primary source is slow, the query is routed to a replica or an acceleration cache instead.
SQL Federation FAQ
What is the difference between SQL federation and ETL?
ETL copies data from source systems into a central warehouse on a schedule. SQL federation queries data in place at runtime without copying it. ETL introduces latency and pipeline maintenance, while federation provides real-time access but depends on source availability. Many teams combine both: federated queries for real-time access and acceleration caches for performance-critical workloads.
How does SQL federation handle performance across slow data sources?
Federation engines use predicate pushdown to minimize data transfer and query acceleration (local caching) to avoid repeated round-trips to slow sources. Spice combines federation with local acceleration -- frequently accessed data is cached in-memory or on-disk, enabling sub-second queries even when the underlying source is slow or remote.
What types of data sources support SQL federation?
Most federation engines support relational databases (PostgreSQL, MySQL, SQL Server), analytical warehouses (Databricks, Snowflake, BigQuery), object stores (S3, Azure Blob, GCS), and streaming systems. The specific connectors vary by engine. Spice supports 30+ data connectors out of the box.
Is SQL federation suitable for production workloads?
Yes, when paired with query acceleration and proper governance. Raw federation without caching can be slow for latency-sensitive applications. Production-grade federation engines like Spice add local acceleration, connection pooling, and fault tolerance to ensure reliable sub-second performance.
How does SQL federation differ from data virtualization?
Data virtualization is the broader concept of abstracting data access across sources. SQL federation is a specific implementation that uses SQL as the query interface. All SQL federation is data virtualization, but data virtualization can also include REST APIs, GraphQL, or other query paradigms.
Learn more about SQL federation
Guides, blog posts, and documentation to help you get started with federated SQL queries.
Query Federation Docs
Learn how Spice federates SQL queries across multiple data sources with predicate pushdown and local acceleration.

Getting Started with Spice.ai SQL Query Federation & Acceleration
Learn how to use Spice.ai to federate and accelerate queries across operational and analytical systems with zero ETL.

How we use Apache DataFusion at Spice AI
A technical overview of how Spice extends Apache DataFusion with custom table providers, optimizer rules, and UDFs to power federated SQL, search, and AI inference.

See Spice in action
Get a guided walkthrough of how development teams use Spice to query, accelerate, and integrate AI for mission-critical workloads.
Get a demo