SQL Federation vs ETL: Key Differences and When to Use Each

SQL Federation vs ETL

SQL federation and ETL are two approaches to querying data across distributed systems. Federation queries data in place at runtime. ETL copies data into a central store on a schedule. Each approach makes different tradeoffs -- and the best architectures often combine both.

See Spice SQL federation

Read the docs

Most organizations store data across many systems: transactional databases like PostgreSQL and MySQL, analytical warehouses like Databricks and Snowflake, object stores like Amazon S3, and SaaS platforms like Salesforce. When applications or analytics need to combine data from several of these systems, teams must choose how to bridge the gap.

The two dominant approaches are ETL (extract, transform, load) and SQL federation. ETL copies data from source systems into a central warehouse on a schedule. SQL federation queries data in place across sources at runtime using a single SQL interface. Neither approach is universally better -- they solve different problems and make different tradeoffs around freshness, performance, complexity, and cost.

This guide explains how each approach works, compares them across the dimensions that matter in production, and provides a framework for choosing the right pattern for your workloads.

How ETL Works

ETL is the traditional approach to centralizing data. A pipeline extracts data from source systems, transforms it into the target schema, and loads it into a central warehouse or data lake.

Extract

The pipeline connects to each source system and reads data -- either a full snapshot or an incremental batch based on timestamps or sequence numbers. Extraction can be scheduled (hourly, daily) or triggered by events.

Transform

Raw data is cleaned, normalized, and reshaped to match the target schema. Transformations may include deduplication, type casting, joining reference tables, computing derived columns, and enforcing data quality rules.

Load

Transformed data is written to the central warehouse, data lake, or lakehouse. The target system becomes the single source of truth for downstream consumers -- BI tools, dashboards, and analytical queries.

ETL pipelines are well-understood and widely supported by tools like Apache Airflow, dbt, Fivetran, and Apache Spark. They work reliably for batch analytics over historical data, but they introduce inherent latency: data in the warehouse is always at least as old as the last pipeline run.

How SQL Federation Works

SQL federation takes a different approach. Instead of moving data, a federation engine connects to each source at query time, translates a single SQL query into source-specific requests, and merges the results. The data stays where it is.

A federated query goes through three stages:

Query planning: The engine parses the SQL, identifies which tables map to which sources, and builds an optimized execution plan with predicate and aggregation pushdown.
Distributed execution: Sub-queries are dispatched to each source in parallel. Filters and aggregations are pushed down to minimize data transfer.
Result merging: Partial results are joined, sorted, and formatted in the federation layer before being returned to the application.

Federation provides real-time access to data across 30+ source types through a single SQL endpoint. Applications see a unified interface regardless of where data is stored.

Comparison Table

The following table summarizes the key differences between SQL federation and ETL across the dimensions that matter most in production deployments.

Dimension	SQL Federation	ETL
Data freshness	Real-time -- queries hit live sources	Batch -- as fresh as the last pipeline run (minutes to hours)
Data movement	None -- data stays in source systems	Full copy into a central warehouse or lake
Time to first query	Minutes -- configure a connector and query	Days to weeks -- design schemas, build transforms, orchestrate pipelines
Schema change handling	Automatic -- queries execute against the current schema	Manual -- pipeline breaks require code changes and redeployment
Query performance	Bounded by source latency and network; improved with acceleration	Fast for pre-computed, co-located data
Storage cost	No duplication	Duplicate storage in the warehouse
Operational overhead	Low -- no pipelines to monitor	High -- pipeline failures, scheduling, orchestration
Best for	Real-time access, ad-hoc queries, AI workloads	Batch analytics, historical reporting, compliance archives
Source availability dependency	High -- sources must be available at query time	Low -- warehouse is independent after loading

Neither column is strictly better. The right choice depends on the workload requirements, which the decision framework below addresses.

When ETL Is the Right Choice

ETL remains the right approach for several well-defined scenarios.

Heavy Analytical Workloads on Historical Data

When analysts run complex aggregations, window functions, and multi-table joins over months or years of data, co-locating that data in a warehouse optimized for analytical queries delivers the best performance. Federation would require pulling large volumes of data over the network on every query.

Known, Stable Query Patterns

If the same set of reports and dashboards run daily against the same datasets, ETL's batch model is efficient. The upfront cost of building pipelines is amortized over many query executions, and the warehouse can be tuned for those specific access patterns.

Compliance and Audit Requirements

Some regulatory frameworks require durable, timestamped copies of data in a controlled environment. ETL into a governed warehouse or data lake satisfies these requirements by producing an immutable historical record.

Source Systems with Limited Query Capacity

If a source database cannot handle additional analytical query load -- for example, a production OLTP system under heavy write pressure -- extracting data on a schedule and querying the copy avoids adding load to the source.

When SQL Federation Is the Right Choice

Federation excels in scenarios where freshness, speed-to-value, and cross-source access matter more than pre-computed performance.

Real-Time Operational Applications

Applications that need current data from multiple systems -- operational dashboards, monitoring tools, customer-facing portals -- benefit from federation's real-time access. Stale data from a batch pipeline can lead to incorrect decisions or degraded user experiences.

AI and Machine Learning Workloads

AI models, retrieval-augmented generation (RAG) systems, and inference pipelines require fresh, multi-source data. Federation provides the real-time, cross-source access these workloads demand without building separate data pipelines for each model.

Ad-Hoc Exploration and Prototyping

When data teams need to explore a new data source or prototype a cross-system query, federation eliminates the weeks of pipeline engineering that ETL requires. Configure a connector and start querying in minutes.

Data Mesh Architectures

In data mesh, each domain team owns its data products. Federation enables governed, cross-domain queries without centralizing everything into a monolithic warehouse. Each team maintains autonomy while the organization gets unified access.

Decision Framework

Use the following questions to determine which approach fits each workload. In many organizations, the answer is "both" -- different workloads within the same architecture use different patterns.

1. How fresh does the data need to be?

Seconds to minutes: Federation, potentially with change data capture for acceleration cache refresh
Hours to days: ETL is sufficient
Mixed requirements: Federation for real-time consumers, ETL for batch analytics

2. How predictable are the query patterns?

Known, repeated queries: ETL can pre-compute and optimize for these patterns
Ad-hoc, exploratory, or evolving: Federation adapts without pipeline changes
Both: Accelerate known patterns locally; federate the rest on demand

3. What is the acceptable time-to-value?

Minutes: Federation -- connect and query immediately
Weeks are acceptable: ETL with proper schema design and pipeline engineering
Start fast, optimize later: Federation first, add acceleration and ETL for mature workloads

4. What are the source system constraints?

Sources can handle additional query load: Federation is straightforward
Sources are capacity-constrained: ETL extracts data during off-peak windows, or federation with data acceleration caches the data locally to avoid repeated source queries

5. What is the data volume?

Moderate (gigabytes): Federation with acceleration handles this well
Very large (terabytes+): ETL into a data lake or lakehouse may be more practical for full-scan analytical queries
Mixed: Federate smaller, real-time datasets; ETL larger, historical datasets

Summary Matrix

Scenario	Recommended approach
Real-time dashboard over 3 databases	Federation
Monthly revenue report over 2 years of data	ETL
AI model needing fresh features from 5 sources	Federation with acceleration
Compliance archive of transactional records	ETL
Ad-hoc exploration of a new data source	Federation
High-frequency analytics on a data lake	ETL into lakehouse, or federation with acceleration

Advanced Topics

Hybrid Architectures: Combining Federation and ETL

In practice, the most effective data architectures use both patterns. Federation handles real-time access and cross-source queries, while ETL pipelines populate warehouses and data lakes for heavy analytical workloads. The challenge is managing the boundary between the two.

A common hybrid pattern is federate-first with selective materialization. All data sources are accessible via federation by default. As query patterns mature and performance requirements become clear, specific datasets are materialized -- either through traditional ETL into a warehouse, or through local acceleration caches kept fresh by change data capture. This approach minimizes upfront pipeline engineering while providing an optimization path for production workloads.

The key architectural decision is where to place the boundary between federated and materialized data. Criteria include query frequency (datasets queried hundreds of times per minute should be materialized), latency sensitivity (sub-second requirements demand local acceleration or warehouse co-location), and data volume (very large datasets are expensive to federate repeatedly).

ELT and the Modern Data Stack

The traditional ETL sequence -- extract, transform, load -- has evolved into ELT (extract, load, transform), where raw data is loaded into the warehouse first and transformations happen inside the warehouse using SQL. Tools like dbt popularized this pattern by enabling transformation-as-code within the warehouse.

ELT addresses some of ETL's pain points: transformations are version-controlled, testable, and run inside a powerful SQL engine. But ELT still requires extraction pipelines, still introduces batch latency, and still duplicates data into a central store. For organizations evaluating federation vs. ETL, ELT shares most of ETL's tradeoffs -- the key distinction remains batch movement vs. real-time in-place access.

Federation complements ELT architectures by providing real-time access to data that hasn't yet been extracted and loaded. An application can query the federation layer for the freshest data while the ELT pipeline processes the same data for historical analytics on a schedule.

Federation Performance Optimization

Raw federation performance depends on source latency, network bandwidth, and the federation engine's ability to push computation down to sources. Several techniques close the gap between federation and co-located warehouse queries.

Predicate pushdown is the most impactful optimization. When the federation engine pushes WHERE clauses to the source, only matching rows are transferred -- reducing network transfer by orders of magnitude for selective queries.

Parallel execution dispatches sub-queries to independent sources concurrently. A query joining data from PostgreSQL, S3, and Databricks issues all three sub-queries simultaneously rather than sequentially.

Result caching stores the results of expensive federated queries for a configurable TTL. Subsequent identical queries are served from cache without hitting the sources. This is particularly effective for dashboard queries that refresh on a fixed interval.

Local acceleration goes further than result caching by materializing entire datasets locally. Instead of caching individual query results, the acceleration layer maintains a full, queryable copy of the dataset that is refreshed via CDC or scheduled sync. This enables sub-second performance for any query pattern against the accelerated dataset, not just previously executed queries.

These optimizations can be combined. A federation engine might push predicates to the source, execute sub-queries in parallel, serve frequently accessed datasets from local acceleration, and cache the merged results for identical follow-up queries.

How Spice Bridges Federation and ETL

Spice combines SQL federation and local data acceleration in a single runtime, providing a practical middle ground between pure federation and full ETL.

Queries are federated across 30+ data connectors with automatic predicate pushdown and parallel execution. For datasets that require lower latency than raw federation can deliver, Spice provides local acceleration -- caching data in-memory (Apache Arrow) or on-disk (DuckDB) with change data capture keeping the cache synchronized with source systems.

This hybrid approach gives teams the real-time access and operational simplicity of federation, with the performance characteristics of co-located data -- without building and maintaining traditional ETL pipelines. Data freshness is measured in seconds (via CDC) rather than hours (via batch ETL), and new data sources become queryable in minutes rather than weeks.

For large-scale data lake workloads, Spice provides acceleration engines optimized for high-throughput analytical queries over object storage, bridging the gap between data lake scalability and the sub-second performance that applications require.

The result is an architecture where federation, acceleration, and ETL coexist. Teams start with federation for immediate access, add acceleration for performance-critical datasets, and retain ETL pipelines only for workloads that genuinely require batch materialization into a central warehouse.

SQL Federation vs ETL FAQ

Can SQL federation completely replace ETL?

For many workloads, yes. Federation eliminates the need for ETL pipelines when the goal is real-time, cross-source data access. However, ETL remains valuable for heavy analytical workloads over large historical datasets, compliance archives, and scenarios where source systems cannot handle additional query load. Most production architectures use both patterns for different workloads.

Is SQL federation slower than querying a warehouse?

Raw federation can be slower because queries must travel over the network to source systems. However, federation engines with local acceleration close this gap by caching frequently accessed data in-memory or on-disk. With acceleration, federated queries can match warehouse performance while providing real-time data freshness that ETL cannot.

How does federation handle source system outages?

If a source system is unavailable, federated queries that depend on it will fail. Production federation engines mitigate this with local acceleration caches that can serve queries even when the source is temporarily offline, connection pooling with health checks, and configurable fallback behavior. ETL avoids this issue because the warehouse is independent after data is loaded.

What happens to ETL pipelines when source schemas change?

Schema changes are one of the biggest operational challenges with ETL. When a source adds a column, changes a data type, or renames a table, ETL pipelines break and require manual fixes. Federation handles schema changes more gracefully because queries execute against the current schema at runtime. The query adapts automatically as long as the referenced columns still exist.

Can I use federation and ETL together in the same architecture?

Yes, and this is the recommended approach for most organizations. Use federation for real-time access, ad-hoc queries, and AI workloads. Use ETL for batch analytics, historical reporting, and compliance. Spice supports this hybrid model natively -- federated queries and locally accelerated datasets coexist in a single runtime, so teams choose the right pattern per workload without managing separate infrastructure.

Learn more about SQL federation and data integration

Guides, blog posts, and documentation on federated queries, acceleration, and modern data architectures.

Docs

Query Federation Docs

Learn how Spice federates SQL queries across multiple data sources with predicate pushdown and local acceleration.

Blog

Getting Started with Spice.ai SQL Query Federation & Acceleration

Learn how to use Spice.ai to federate and accelerate queries across operational and analytical systems with zero ETL.

Blog

How we use Apache DataFusion at Spice AI

A technical overview of how Spice extends Apache DataFusion with custom table providers, optimizer rules, and UDFs to power federated SQL, search, and AI inference.

Talk to an engineer

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer