How to Implement Change Data Capture (CDC) | Complete Guide

How to Implement Change Data Capture

Change data capture (CDC) is a data integration pattern that identifies and tracks row-level changes -- inserts, updates, and deletes -- in a database and delivers them to downstream systems in real time.

See Spice real-time CDC

Read the docs

Keeping data synchronized across systems is one of the hardest problems in distributed architectures. A customer updates their shipping address in the transactional database, but the analytics dashboard still shows the old one. A product price changes, but the search index serves stale results for hours. An AI model generates answers based on yesterday's data because the vector index hasn't been refreshed.

The traditional solution is batch ETL: extract all data on a schedule, transform it, and load it into downstream systems. This works, but it introduces latency (minutes to hours), wastes resources (re-extracting unchanged data), and adds fragile pipeline infrastructure to maintain.

Change data capture (CDC) solves this by streaming only the rows that changed, as they change. Instead of periodic bulk extracts, CDC monitors the database's internal change log and delivers a continuous stream of insert, update, and delete events to downstream consumers. The result is near-real-time data synchronization with minimal impact on the source database. When combined with SQL federation, CDC-backed acceleration is the foundation of a zero-ETL data architecture -- one where applications query live, fresh data without centralized ingestion pipelines.

How CDC Works: Three Implementation Patterns

CDC can be implemented at different levels of the database stack. Each approach makes different tradeoffs between reliability, latency, and source impact.

Log-Based CDC

Log-based CDC reads the database's transaction log -- the internal record of every committed change. In PostgreSQL, this is the Write-Ahead Log (WAL). In MySQL, it's the binary log (binlog). In SQL Server, it's the transaction log.

This is the preferred approach for production workloads because:

Zero application changes: The transaction log already exists. CDC reads it asynchronously without modifying queries, adding triggers, or changing schemas.
Complete capture: Every committed change is captured, including deletes. Nothing is missed.
Minimal source impact: Reading the log is an asynchronous, read-only operation. It adds negligible overhead to the source database.
Ordering guarantees: Changes are read in commit order, preserving transactional consistency.

The main limitation is that transaction log formats are database-specific. Each database requires its own CDC connector, and log retention policies must be configured to keep logs available long enough for the CDC process to read them.

Trigger-Based CDC

Database triggers fire on insert, update, or delete operations and write change records to a separate tracking table. An external process then reads the tracking table and forwards changes to downstream systems.

Trigger-based CDC works on databases that don't expose their transaction logs (some older or proprietary systems), but it has significant drawbacks:

Write overhead: Every write operation on the source table triggers an additional write to the tracking table, increasing latency and I/O
Schema maintenance: The tracking table must be maintained alongside the source schema
Performance impact: Under high write loads, triggers can become a bottleneck

Polling-Based CDC

A process periodically queries the source table using a timestamp column (updated_at) or incrementing sequence number to detect new or changed rows. This is the simplest approach to implement but the least reliable:

Misses deletes: Without a soft-delete pattern, there's no way to detect that a row was removed
Latency proportional to poll interval: A 5-minute polling interval means changes are at least 5 minutes stale
Source load: Frequent polling adds query load to the source database

Polling-based CDC is useful for prototyping or for sources that support no other mechanism, but it's generally not suitable for production real-time workloads.

CDC Pipeline Architecture

A production CDC pipeline has three components: the change capture mechanism, a transport layer, and downstream consumers.

Capture

The CDC connector monitors the source database and emits a stream of change events. Each event includes:

Operation type: INSERT, UPDATE, or DELETE
Before state: The row values before the change (for updates and deletes)
After state: The row values after the change (for inserts and updates)
Metadata: Timestamp, transaction ID, source table, and schema information

Transport

Change events are typically published to a message broker or streaming platform -- Kafka is the most common choice. The transport layer provides durability (events aren't lost if a consumer is temporarily offline), ordering (events from the same table are delivered in commit order), and fan-out (multiple consumers can independently read the same stream).

For simpler architectures, CDC can also be consumed directly without a message broker. Some systems, like Spice, provide built-in CDC consumption that eliminates the need for a separate streaming layer.

Consumers

Downstream systems consume change events and update their local state. Common consumers include:

Analytics databases: Keep analytical copies synchronized with transactional sources
Search indexes: Update Elasticsearch or OpenSearch indexes as data changes
Cache layers: Invalidate or refresh Redis or Memcached entries when source data changes
Vector indexes: Re-embed and re-index documents for RAG systems as content is updated
Acceleration layers: Refresh local query caches used by SQL federation engines

CDC Use Cases

Real-Time Analytics and Dashboards

Stream changes from transactional databases into analytical systems so dashboards and reports reflect the current state of the business. Instead of waiting for the next ETL batch, every change is visible within seconds.

This is particularly important for operational dashboards -- monitoring inventory levels, tracking order fulfillment, or observing system health -- where stale data leads to wrong decisions.

Cache and Search Index Synchronization

Application caches (Redis, Memcached) and search indexes (Elasticsearch, OpenSearch) go stale when source data changes. Without CDC, teams resort to time-based expiration (which causes periodic staleness) or manual invalidation logic (which is error-prone and hard to maintain).

CDC automates this entirely: when a row changes in the source database, the corresponding cache entry or search index document is updated within seconds. No manual invalidation, no stale reads.

AI and RAG Pipeline Freshness

Retrieval-augmented generation systems depend on vector indexes that represent the current state of source data. If the vector index is rebuilt nightly, every answer is at least a day stale.

CDC enables incremental index updates: when a document changes in the source database, only that document is re-embedded and re-indexed. This keeps RAG retrieval fresh without the cost of full re-indexing.

Event-Driven Microservices

CDC turns database changes into a stream of events that microservices can react to. Instead of services polling each other for updates, changes propagate automatically through the event stream. This pattern -- sometimes called the "outbox pattern" -- decouples services while ensuring they stay synchronized.

Data Lake Ingestion

Continuously stream changes from operational databases into data lakes (S3, GCS, Azure Blob) in formats like Apache Parquet or Apache Iceberg. This replaces batch export jobs and ensures the data lake reflects the current state of operational systems.

CDC Best Practices

Schema Evolution Handling

Source database schemas change over time -- columns are added, data types are modified, tables are renamed. A robust CDC pipeline must handle these changes gracefully. The most common approaches are:

Schema registries that track schema versions and ensure consumers can handle multiple versions
Automatic schema migration where the CDC pipeline detects changes and applies them downstream
Backward-compatible changes enforced by policy, so consumers always handle the latest schema

Monitoring and Alerting

CDC pipelines should be monitored for:

Lag: The time between when a change is committed at the source and when it's applied downstream. Increasing lag indicates the pipeline is falling behind.
Error rates: Failed events that couldn't be applied downstream
Log retention: If the source database's transaction log is truncated before CDC reads it, changes are lost permanently

Initial Load

When a CDC pipeline is first set up, the downstream system needs a full snapshot of the current source data. This "initial load" or "snapshot" must be coordinated with the CDC stream to avoid duplicates or gaps. Most production CDC tools handle this automatically.

How to Set Up CDC with Spice

Spice has built-in CDC that keeps accelerated datasets synchronized with source databases -- no Kafka, no Debezium, no separate streaming infrastructure required for many use cases.

Step 1: Configure CDC Refresh on Accelerated Datasets

In the spicepod.yaml, set refresh_mode: changes on any dataset that should track source changes in real time:

version: v1
kind: Spicepod
name: realtime_analytics

datasets:
  # CDC-backed acceleration from PostgreSQL
  - from: postgres:public.orders
    name: orders
    params:
      pg_host: db.example.com
      pg_port: '5432'
      pg_db: production
      pg_user: ${secrets:PG_USER}
      pg_pass: ${secrets:PG_PASS}
    acceleration:
      engine: arrow # In-memory for sub-ms queries
      refresh_mode: changes # CDC-based incremental refresh
      refresh_check_interval: 1s

  # CDC-backed acceleration from MySQL
  - from: mysql:inventory.products
    name: products
    params:
      mysql_host: mysql.example.com
      mysql_user: ${secrets:MYSQL_USER}
      mysql_pass: ${secrets:MYSQL_PASS}
    acceleration:
      engine: arrow
      refresh_mode: changes
      refresh_check_interval: 1s

When Spice starts, it performs an initial full snapshot of each dataset, then switches to incremental CDC. For PostgreSQL, Spice creates a logical replication slot and reads the WAL. For MySQL, it reads the binlog. Inserts, updates, and deletes in the source database appear in the local acceleration cache within seconds.

Step 2: Prepare the Source Database

For PostgreSQL log-based CDC, the source database needs logical replication enabled:

-- In postgresql.conf (or set via ALTER SYSTEM)
-- wal_level = logical
-- max_replication_slots = 4

-- Grant replication privileges to the Spice user
ALTER ROLE spice_user WITH REPLICATION;

-- Ensure tables have full replica identity for UPDATE/DELETE capture
ALTER TABLE orders REPLICA IDENTITY FULL;

Spice manages the replication slot lifecycle automatically -- creating it on first connection and advancing it as changes are consumed. If Spice restarts, it resumes from the last confirmed position without missing or duplicating events.

Step 3: Query Accelerated, CDC-Fresh Data

With CDC acceleration configured, federated SQL queries hit the local acceleration cache and return in single-digit milliseconds, while the underlying data stays fresh:

-- This query hits the local Arrow acceleration cache, not PostgreSQL
-- Results reflect changes from the source within ~1 second
SELECT customer_id, COUNT(*) AS order_count, SUM(amount) AS total_spend
FROM orders
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY customer_id
ORDER BY total_spend DESC
LIMIT 100

The same query against the source PostgreSQL database might take hundreds of milliseconds due to network latency and concurrent transactional load. Against the CDC-refreshed acceleration cache, it returns in under 5 milliseconds.

Step 4: Combine CDC with Federation

CDC acceleration and SQL federation work together. Some datasets are accelerated locally with CDC for performance; others are federated in real time for freshness on rarely queried data:

datasets:
  # Hot data: accelerated with CDC (sub-ms queries)
  - from: postgres:public.orders
    name: orders
    acceleration:
      engine: arrow
      refresh_mode: changes
      refresh_check_interval: 1s

  # Cold data: federated in real time (no local storage)
  - from: s3://data-lake/historical_orders/
    name: historical_orders

Queries can join accelerated and federated datasets transparently. Spice routes each table scan to the appropriate source -- the local cache for accelerated datasets, the remote source for federated ones -- and merges results.

Refresh Modes Compared

Spice supports multiple refresh strategies, with CDC being the most powerful:

refresh_mode: changes -- Log-based CDC. Captures every insert, update, and delete. Near-real-time. Minimal source impact. Recommended for production.
refresh_mode: full -- Periodic full refresh. Re-reads the entire dataset on a schedule. Simple but resource-intensive for large datasets.
refresh_mode: append -- Append-only. Checks for new rows since the last refresh using a time column. Efficient for event logs and immutable data.

For most production workloads, refresh_mode: changes is the right choice. It provides the best combination of freshness, efficiency, and source impact.

Advanced Topics

WAL Internals and Logical Replication

Log-based CDC in PostgreSQL reads from the Write-Ahead Log (WAL), which is the database's crash-recovery mechanism. Every committed transaction is first written to the WAL before being applied to the actual data files. CDC leverages this by attaching a logical replication slot to the WAL, which tells PostgreSQL to retain log segments until the CDC consumer has acknowledged them.

A logical replication slot decodes the raw WAL bytes into structured change events using an output plugin (e.g., pgoutput or wal2json). The output plugin determines the format of change events -- whether they include full row images, old values for updated columns, or just the changed fields. Configuring REPLICA IDENTITY FULL on a table ensures that UPDATE and DELETE events include the complete before-state of the row, which is critical for consumers that need to maintain materialized views or detect specific field-level changes.

The key operational concern with WAL-based CDC is slot management. If a CDC consumer goes offline for an extended period, the replication slot prevents PostgreSQL from reclaiming WAL segments. This can cause disk usage to grow unbounded, eventually filling the disk and crashing the database. Production CDC deployments must monitor replication slot lag and set maximum retention policies to prevent this failure mode.

Exactly-Once Delivery Semantics

Distributed systems offer three delivery guarantees: at-most-once (changes may be lost), at-least-once (changes may be duplicated), and exactly-once (each change is applied precisely once). CDC pipelines must handle the gap between at-least-once delivery (which most transport layers provide) and exactly-once semantics (which consumers require).

The standard approach is idempotent consumers. Instead of trying to guarantee that each change event is delivered exactly once (which is impractical in distributed systems), the consumer is designed so that applying the same event multiple times produces the same result. For database targets, this means using UPSERT (INSERT ... ON CONFLICT UPDATE) instead of plain INSERT. For search indexes, it means writing documents with deterministic IDs so that re-applying an update overwrites the previous version.

When idempotency is insufficient -- for example, when the consumer maintains counters or running aggregates -- the consumer must track its position in the change stream (the Log Sequence Number, or LSN, in PostgreSQL terms) and store it transactionally alongside the applied changes. On recovery, the consumer resumes from its last committed LSN, ensuring no events are processed twice.

CDC at Scale

Scaling CDC introduces challenges that don't exist in single-source deployments. When dozens or hundreds of tables must be captured simultaneously, the CDC system must manage connection limits, replication slot resources, and downstream throughput.

Partitioning the change stream by table or by key range allows parallel processing. Events for independent tables can be consumed by separate workers without coordination. Events within a single table can be partitioned by primary key, enabling parallel consumers that each handle a subset of rows -- as long as ordering is maintained within each partition.

Backpressure management is critical at scale. If a downstream consumer slows down (due to indexing lag, network congestion, or resource contention), the CDC pipeline must buffer events without dropping them and without allowing unbounded memory growth. Production systems use bounded buffers with overflow to persistent storage -- writing excess events to disk when in-memory buffers fill, then draining the disk buffer when the consumer catches up.

Monitoring at scale requires tracking per-table lag (the delay between source commit and downstream application), throughput (events per second per table), and error rates. Alerting on lag growth is the most important signal, because increasing lag indicates that the pipeline is falling behind and may eventually lose data if WAL retention is exceeded.

CDC FAQ

What is the difference between CDC and ETL?

ETL extracts data in bulk on a schedule, transforms it, and loads it into a target system. CDC captures only the changes (inserts, updates, deletes) as they happen and streams them continuously. ETL is batch-oriented and introduces latency; CDC is event-driven and near-real-time. Many modern architectures use CDC to replace the "extract" step of traditional ETL.

Which databases support log-based CDC?

Most modern relational databases support log-based CDC: PostgreSQL (via logical replication and WAL), MySQL (via binlog), SQL Server (via its built-in CDC feature), Oracle (via LogMiner or GoldenGate), and MongoDB (via change streams). The specific configuration varies by database.

Does CDC add overhead to the source database?

Log-based CDC adds minimal overhead because it reads the transaction log asynchronously -- it does not modify queries or add triggers. The primary cost is slightly increased disk I/O for log retention. Trigger-based and polling-based CDC add more overhead because they execute additional queries or triggers during write operations.

How does CDC handle schema changes?

Schema changes (adding columns, changing data types) are one of the most challenging aspects of CDC. Log-based CDC systems typically detect schema changes in the transaction log and propagate them downstream. The downstream consumer must handle schema evolution -- for example, by using a schema registry or applying migrations automatically.

Can CDC be used with data federation?

Yes. CDC and data federation are complementary patterns. Federation queries data across sources in real time; CDC keeps local acceleration caches synchronized with those sources. Together, they enable sub-second federated queries backed by always-fresh local data. Spice uses this combination to deliver real-time performance across distributed data sources.

Learn more about CDC and data acceleration

Guides and blog posts on real-time data synchronization and acceleration with Spice.

Docs

Spice.ai OSS Documentation

Learn how Spice uses CDC to keep accelerated datasets fresh and synchronized with source databases in real time.

Blog

Real-Time Control Plane Acceleration with DynamoDB Streams

How to sync DynamoDB data to thousands of nodes with sub-second latency using a two-tier architecture with DynamoDB Streams and Spice acceleration.

Blog

Introducing Spice Cayenne: The Next-Generation Data Accelerator Built on Vortex for Performance and Scale

Spice Cayenne is the next-generation Spice.ai data accelerator built for high-scale and low latency data lake workloads.

Talk to an engineer

See Spice in action

Walk through your use case with an engineer and see how Spice handles federation, acceleration, and AI integration for production workloads.

Talk to an engineer