What is Zero-ETL?

Zero-ETL is a data architecture approach that eliminates traditional extract-transform-load pipelines. Instead of copying data into a central warehouse on a schedule, zero-ETL systems query data in place or use change data capture to keep local caches synchronized in real time.

ETL -- extract, transform, load -- has been the default approach to integrating data across systems for decades. Data is extracted from source systems on a schedule, transformed into a target schema, and loaded into a central warehouse where analysts and applications can query it. ETL works, but it introduces a persistent problem: data in the warehouse is always behind the source. Minutes-old or hours-old data leads to stale analytics, incorrect AI outputs, and delayed operational decisions.

Zero-ETL is the architectural response to this limitation. It describes data access patterns that eliminate or minimize the ETL step, either by querying source systems directly at runtime or by using event-driven mechanisms like change data capture (CDC) to keep local copies synchronized continuously. The result is data that reflects the current state of source systems -- not the state as of the last pipeline run.

The term is used in two overlapping ways. Cloud vendors (AWS, Google, Databricks) use it as a marketing term for near-zero-latency replication features -- where the "ETL" is automated but still happens. The architecture community uses it to describe genuinely pipeline-free patterns where data either is not moved at all, or is kept synchronized through CDC without manual pipeline code. This guide focuses on the architectural meaning.

Why ETL Pipelines Create Problems

Before understanding zero-ETL, it is worth understanding what ETL pipelines actually cost in practice.

Staleness

ETL pipelines run on schedules. A daily pipeline means warehouse data is up to 24 hours old. An hourly pipeline means up to 60 minutes of staleness. For operational use cases -- detecting fraud, answering customer questions, making real-time recommendations -- even minutes of staleness is unacceptable.

Fragility

ETL pipelines break when upstream schemas change. A source team adds a column, renames a field, or changes a data type, and downstream pipelines fail. Someone must diagnose the failure, fix the transformation code, backfill the gap, and redeploy. This maintenance burden compounds as the number of pipelines grows.

Time to value

Building an ETL pipeline requires schema design, transformation logic, orchestration tooling, and testing before any data is queryable. For exploratory analyses or new data sources, this overhead can take days or weeks.

Redundant storage

ETL copies data from sources into the warehouse. For large datasets, this doubles or triples storage costs. The copy in the warehouse is not the source of truth -- it is a snapshot that requires continuous replication to stay current.

What Zero-ETL Looks Like in Practice

Zero-ETL is not a single technology; it is a set of patterns that share the goal of making data accessible without manual pipeline code.

Pattern 1: SQL Federation

SQL federation queries data in place across multiple sources at runtime. A federation engine connects to PostgreSQL, Databricks, Amazon S3, and other sources, translates a single SQL query into source-specific requests, and merges the results. No data is copied. There are no pipelines to build or maintain.

-- Query across PostgreSQL and Databricks in a single statement
SELECT c.name, SUM(o.amount) AS total_spend
FROM postgres.customers c
JOIN databricks.orders o ON c.id = o.customer_id
WHERE o.created_at > NOW() - INTERVAL '30 days'
GROUP BY c.name
ORDER BY total_spend DESC

The trade-off with pure federation is that query performance is bounded by source latency. A query that joins a slow Snowflake warehouse with a fast PostgreSQL database is limited by Snowflake's response time.

Pattern 2: CDC-Backed Acceleration

Change data capture monitors database transaction logs and streams row-level changes -- inserts, updates, deletes -- to downstream consumers in real time. When paired with a local acceleration cache, CDC provides the best of both worlds: data is stored locally for fast queries, but the local copy is kept synchronized continuously with the source without manual pipeline code.

# spicepod.yaml: CDC-backed local acceleration -- no ETL pipeline required
datasets:
  - from: postgres:public.orders
    name: orders
    acceleration:
      engine: arrow
      refresh_mode: changes  # Log-based CDC
      refresh_check_interval: 1s

The local cache reflects source changes within seconds. No scheduled jobs, no transformation code, no pipeline orchestration.

Pattern 3: Direct Query Pushdown

Some modern data platforms (Snowflake Data Sharing, BigQuery Authorized Views, Databricks Delta Sharing) allow consumers to query data directly from the producer's storage without physically copying it. The query is pushed down to the producer's execution engine and the results are returned. This eliminates data movement while preserving the performance of the source engine.

Zero-ETL vs. Traditional ETL

The following comparison covers the key dimensions that matter when evaluating the approaches.

DimensionZero-ETL (Federation + CDC)Traditional ETL
Data freshnessReal-time to near-real-timeMinutes to hours behind source
Data movementNone (federation) or CDC increments onlyFull copy on each pipeline run
Time to first queryMinutes -- configure connectors and queryDays to weeks -- build and test pipelines
Maintenance burdenLow -- no pipeline code to maintainHigh -- pipelines break on schema changes
Storage costNo duplication (federation) or minimal delta (CDC)Full duplicate at destination
Query performanceDepends on source latency; acceleration layers close the gapFast for pre-computed, co-located data
Source availabilityFederated queries require source availabilityWarehouse independent after load
Best forOperational apps, real-time AI, live dashboardsHistorical analytics, compliance archives, batch ML

When Zero-ETL Is the Right Choice

Zero-ETL is not a universal replacement for ETL. It is better suited to some workloads than others.

Zero-ETL fits well when:

Real-time data is required. Any application that needs data reflecting the current state -- fraud detection, live dashboards, AI inference, real-time search -- benefits from zero-ETL. Scheduled ETL cannot serve these use cases without significant lag.

Multiple sources need to be combined. When joining PostgreSQL with Databricks with Amazon S3 in a single query, SQL federation eliminates the need to pre-join datasets in a warehouse. No pipeline needs to be built for each combination.

Schema evolution is frequent. Federated queries execute against the current schema. There is no transformation code to update when an upstream team adds a column.

Time to query matters. Adding a new data source to a federation engine takes minutes -- configure the connector, query. Building an ETL pipeline to the same source takes days.

ETL still fits well when:

Long-term historical analytics are needed. Data archives, compliance reporting, and historical trend analysis often operate over years of data. ETL and a well-designed warehouse schema are better suited to these workloads than real-time federation.

Complex multi-step transformations are required. If data must be significantly reshaped, enriched, or quality-checked before it is queryable, ETL's explicit transformation step is the right tool. Zero-ETL does not replace transformation logic -- it eliminates the extraction and loading overhead.

Source availability is not reliable. If a source system has high downtime, federation will expose that downtime to applications. ETL's decoupled warehouse buffers applications from source failures.

Query performance requires pre-computation. Materialized aggregations over terabytes of historical data are best served from a pre-computed warehouse table, not from a federated query at runtime.

Zero-ETL and the Anti-Pattern of Misuse

The term zero-ETL is sometimes used to describe managed replication services that automate ETL rather than eliminate it. AWS Zero-ETL, for example, replicates Aurora changes into Redshift continuously -- the ETL step is handled by the platform, but it still occurs. This is useful, but it is not the same as eliminating the pipeline architecture.

The practical distinction: true zero-ETL means there is no centralized data copy that can go stale, no pipeline code that can break, and no replication lag from copying full tables on a schedule. CDC-backed acceleration approaches this but does involve local storage. Pure federation is the closest to "zero" data movement.

Advanced Topics

Predicate Pushdown in Federation

The performance of zero-ETL federation depends heavily on how aggressively the engine pushes predicates (filters) down to source systems. Without pushdown, a federated query that filters WHERE created_at > '2026-01-01' would pull all rows from the source and filter locally. With pushdown, the source executes the filter and returns only matching rows.

Apache DataFusion -- the query engine underlying Spice -- applies multi-level pushdown: filter predicates, aggregation functions, and projection (column selection) are all pushed to sources when the connector supports it. For a query joining PostgreSQL and S3 Parquet files, the PostgreSQL connector generates a parameterized SQL query with the filter, and the Parquet reader skips row groups whose min/max statistics exclude the filter range. This minimizes data transfer and improves query latency substantially.

CDC Exactly-Once Semantics

CDC-backed acceleration requires careful handling to avoid duplicating or missing changes. CDC consumers track their position in the source's transaction log (the Log Sequence Number, or LSN, in PostgreSQL). On restart, the consumer resumes from its last acknowledged LSN, ensuring no events are replayed or skipped.

For the local acceleration cache, Spice uses upsert semantics: each change event is applied as an insert-or-update based on the primary key. This makes the consumer idempotent -- applying the same event twice produces the same result as applying it once. This avoids the need for distributed transactions while still achieving exactly-once logical outcomes.

Combining Federation and CDC Acceleration

The most performant zero-ETL architecture combines both patterns. Frequently accessed, latency-sensitive datasets are accelerated locally with CDC-based refresh. Infrequently accessed or cold datasets are federated on demand without local storage. Applications query through a unified SQL endpoint and are unaware of which tier serves each table.

In a hybrid data architecture, sidecars cache the hot working set for sub-millisecond local reads, while a centralized cluster manages ingestion and serves cold queries. This extends the zero-ETL pattern across tiers while maintaining a unified query interface.

Zero-ETL with Spice

Spice is built around the zero-ETL principle. The SQL federation and acceleration platform connects to 30+ data sources without requiring ETL pipelines. Each dataset can be queried in-place through federation, accelerated locally through CDC-backed refresh, or both -- using a single declarative YAML configuration.

The real-time CDC feature supports log-based CDC from PostgreSQL, MySQL, and other sources. Changes flow from the source transaction log into the local acceleration cache within seconds, without Kafka, Debezium, or custom pipeline code for the most common use cases.

For teams building AI applications, RAG pipelines, or operational dashboards that need always-fresh data, Spice provides the zero-ETL foundation: query any source with SQL, cached locally for performance, kept live through CDC.

Zero-ETL FAQ

What does zero-ETL actually mean?

Zero-ETL refers to data access architectures that eliminate traditional extract-transform-load pipelines. The two main approaches are SQL federation (querying data in place across sources at runtime) and CDC-backed acceleration (keeping local caches synchronized in real time via change data capture). Zero-ETL data is always current -- not a snapshot from the last pipeline run.

Is zero-ETL the same as CDC?

No, though CDC is one technique used in zero-ETL architectures. Zero-ETL is a broader category: it includes SQL federation (no data movement at all), CDC-backed local caches (incremental sync without batch pipelines), and direct query pushdown patterns. CDC eliminates the batch extraction step by streaming changes in real time.

When should I still use ETL?

ETL remains appropriate for long-term historical archives, complex multi-step transformations, compliance reporting that requires auditable snapshots, and workloads where pre-computed aggregations over terabytes of historical data are needed. Zero-ETL is better suited to operational and real-time workloads where data freshness matters more than batch throughput.

Does zero-ETL mean no data transformation?

Not necessarily. Zero-ETL eliminates the extraction and loading overhead -- the pipeline code that copies data on a schedule. Transformations can still be applied at query time (via SQL views), at acceleration time (via materialized views in the local cache), or via event-driven compute triggered by CDC events. The difference is that transformation logic is not embedded in a fragile batch pipeline.

Can zero-ETL handle large datasets?

Yes. SQL federation handles large datasets through predicate pushdown, which minimizes data transferred from source systems. CDC-backed acceleration handles large datasets by syncing only changed rows rather than reloading the full table. For analytical queries over very large local caches, Spice uses columnar storage engines like DuckDB and Vortex (Cayenne) that deliver fast scans over billions of rows.

See Spice in action

Get a guided walkthrough of how development teams use Spice to query, accelerate, and integrate AI for mission-critical workloads.

Get a demo