What is Change Data Capture?
Change data capture (CDC) is a data integration pattern that identifies and tracks row-level changes -- inserts, updates, and deletes -- in a database and delivers them to downstream systems in real time.
Keeping data synchronized across systems is one of the hardest problems in distributed architectures. A customer updates their shipping address in the transactional database, but the analytics dashboard still shows the old one. A product price changes, but the search index serves stale results for hours. An AI model generates answers based on yesterday's data because the vector index hasn't been refreshed.
The traditional solution is batch ETL: extract all data on a schedule, transform it, and load it into downstream systems. This works, but it introduces latency (minutes to hours), wastes resources (re-extracting unchanged data), and adds fragile pipeline infrastructure to maintain.
Change data capture (CDC) solves this by streaming only the rows that changed, as they change. Instead of periodic bulk extracts, CDC monitors the database's internal change log and delivers a continuous stream of insert, update, and delete events to downstream consumers. The result is near-real-time data synchronization with minimal impact on the source database.
How CDC Works: Three Implementation Patterns
CDC can be implemented at different levels of the database stack. Each approach makes different tradeoffs between reliability, latency, and source impact.
Log-Based CDC
Log-based CDC reads the database's transaction log -- the internal record of every committed change. In PostgreSQL, this is the Write-Ahead Log (WAL). In MySQL, it's the binary log (binlog). In SQL Server, it's the transaction log.
This is the preferred approach for production workloads because:
- Zero application changes: The transaction log already exists. CDC reads it asynchronously without modifying queries, adding triggers, or changing schemas.
- Complete capture: Every committed change is captured, including deletes. Nothing is missed.
- Minimal source impact: Reading the log is an asynchronous, read-only operation. It adds negligible overhead to the source database.
- Ordering guarantees: Changes are read in commit order, preserving transactional consistency.
The main limitation is that transaction log formats are database-specific. Each database requires its own CDC connector, and log retention policies must be configured to keep logs available long enough for the CDC process to read them.
Trigger-Based CDC
Database triggers fire on insert, update, or delete operations and write change records to a separate tracking table. An external process then reads the tracking table and forwards changes to downstream systems.
Trigger-based CDC works on databases that don't expose their transaction logs (some older or proprietary systems), but it has significant drawbacks:
- Write overhead: Every write operation on the source table triggers an additional write to the tracking table, increasing latency and I/O
- Schema maintenance: The tracking table must be maintained alongside the source schema
- Performance impact: Under high write loads, triggers can become a bottleneck
Polling-Based CDC
A process periodically queries the source table using a timestamp column (updated_at) or incrementing sequence number to detect new or changed rows. This is the simplest approach to implement but the least reliable:
- Misses deletes: Without a soft-delete pattern, there's no way to detect that a row was removed
- Latency proportional to poll interval: A 5-minute polling interval means changes are at least 5 minutes stale
- Source load: Frequent polling adds query load to the source database
Polling-based CDC is useful for prototyping or for sources that support no other mechanism, but it's generally not suitable for production real-time workloads.
CDC Pipeline Architecture
A production CDC pipeline has three components: the change capture mechanism, a transport layer, and downstream consumers.
Capture
The CDC connector monitors the source database and emits a stream of change events. Each event includes:
- Operation type: INSERT, UPDATE, or DELETE
- Before state: The row values before the change (for updates and deletes)
- After state: The row values after the change (for inserts and updates)
- Metadata: Timestamp, transaction ID, source table, and schema information
Transport
Change events are typically published to a message broker or streaming platform -- Kafka is the most common choice. The transport layer provides durability (events aren't lost if a consumer is temporarily offline), ordering (events from the same table are delivered in commit order), and fan-out (multiple consumers can independently read the same stream).
For simpler architectures, CDC can also be consumed directly without a message broker. Some systems, like Spice, provide built-in CDC consumption that eliminates the need for a separate streaming layer.
Consumers
Downstream systems consume change events and update their local state. Common consumers include:
- Analytics databases: Keep analytical copies synchronized with transactional sources
- Search indexes: Update Elasticsearch or OpenSearch indexes as data changes
- Cache layers: Invalidate or refresh Redis or Memcached entries when source data changes
- Vector indexes: Re-embed and re-index documents for RAG systems as content is updated
- Acceleration layers: Refresh local query caches used by SQL federation engines
CDC Use Cases
Real-Time Analytics and Dashboards
Stream changes from transactional databases into analytical systems so dashboards and reports reflect the current state of the business. Instead of waiting for the next ETL batch, every change is visible within seconds.
This is particularly important for operational dashboards -- monitoring inventory levels, tracking order fulfillment, or observing system health -- where stale data leads to wrong decisions.
Cache and Search Index Synchronization
Application caches (Redis, Memcached) and search indexes (Elasticsearch, OpenSearch) go stale when source data changes. Without CDC, teams resort to time-based expiration (which causes periodic staleness) or manual invalidation logic (which is error-prone and hard to maintain).
CDC automates this entirely: when a row changes in the source database, the corresponding cache entry or search index document is updated within seconds. No manual invalidation, no stale reads.
AI and RAG Pipeline Freshness
Retrieval-augmented generation systems depend on vector indexes that represent the current state of source data. If the vector index is rebuilt nightly, every answer is at least a day stale.
CDC enables incremental index updates: when a document changes in the source database, only that document is re-embedded and re-indexed. This keeps RAG retrieval fresh without the cost of full re-indexing.
Event-Driven Microservices
CDC turns database changes into a stream of events that microservices can react to. Instead of services polling each other for updates, changes propagate automatically through the event stream. This pattern -- sometimes called the "outbox pattern" -- decouples services while ensuring they stay synchronized.
Data Lake Ingestion
Continuously stream changes from operational databases into data lakes (S3, GCS, Azure Blob) in formats like Apache Parquet or Apache Iceberg. This replaces batch export jobs and ensures the data lake reflects the current state of operational systems.
CDC Best Practices
Schema Evolution Handling
Source database schemas change over time -- columns are added, data types are modified, tables are renamed. A robust CDC pipeline must handle these changes gracefully. The most common approaches are:
- Schema registries that track schema versions and ensure consumers can handle multiple versions
- Automatic schema migration where the CDC pipeline detects changes and applies them downstream
- Backward-compatible changes enforced by policy, so consumers always handle the latest schema
Monitoring and Alerting
CDC pipelines should be monitored for:
- Lag: The time between when a change is committed at the source and when it's applied downstream. Increasing lag indicates the pipeline is falling behind.
- Error rates: Failed events that couldn't be applied downstream
- Log retention: If the source database's transaction log is truncated before CDC reads it, changes are lost permanently
Initial Load
When a CDC pipeline is first set up, the downstream system needs a full snapshot of the current source data. This "initial load" or "snapshot" must be coordinated with the CDC stream to avoid duplicates or gaps. Most production CDC tools handle this automatically.
CDC with Spice
Spice uses CDC to keep accelerated datasets synchronized with source systems. When source data changes, the CDC pipeline detects the change and updates the local acceleration cache within seconds. This enables federated SQL queries that are both real-time (reflecting the latest source state) and fast (served from local acceleration).
Spice supports CDC from PostgreSQL, MySQL, and other common sources, with built-in change detection that eliminates the need for a separate streaming infrastructure like Kafka for many use cases.
Advanced Topics
WAL Internals and Logical Replication
Log-based CDC in PostgreSQL reads from the Write-Ahead Log (WAL), which is the database's crash-recovery mechanism. Every committed transaction is first written to the WAL before being applied to the actual data files. CDC leverages this by attaching a logical replication slot to the WAL, which tells PostgreSQL to retain log segments until the CDC consumer has acknowledged them.
A logical replication slot decodes the raw WAL bytes into structured change events using an output plugin (e.g., pgoutput or wal2json). The output plugin determines the format of change events -- whether they include full row images, old values for updated columns, or just the changed fields. Configuring REPLICA IDENTITY FULL on a table ensures that UPDATE and DELETE events include the complete before-state of the row, which is critical for consumers that need to maintain materialized views or detect specific field-level changes.
The key operational concern with WAL-based CDC is slot management. If a CDC consumer goes offline for an extended period, the replication slot prevents PostgreSQL from reclaiming WAL segments. This can cause disk usage to grow unbounded, eventually filling the disk and crashing the database. Production CDC deployments must monitor replication slot lag and set maximum retention policies to prevent this failure mode.
Exactly-Once Delivery Semantics
Distributed systems offer three delivery guarantees: at-most-once (changes may be lost), at-least-once (changes may be duplicated), and exactly-once (each change is applied precisely once). CDC pipelines must handle the gap between at-least-once delivery (which most transport layers provide) and exactly-once semantics (which consumers require).
The standard approach is idempotent consumers. Instead of trying to guarantee that each change event is delivered exactly once (which is impractical in distributed systems), the consumer is designed so that applying the same event multiple times produces the same result. For database targets, this means using UPSERT (INSERT ... ON CONFLICT UPDATE) instead of plain INSERT. For search indexes, it means writing documents with deterministic IDs so that re-applying an update overwrites the previous version.
When idempotency is insufficient -- for example, when the consumer maintains counters or running aggregates -- the consumer must track its position in the change stream (the Log Sequence Number, or LSN, in PostgreSQL terms) and store it transactionally alongside the applied changes. On recovery, the consumer resumes from its last committed LSN, ensuring no events are processed twice.
CDC at Scale
Scaling CDC introduces challenges that don't exist in single-source deployments. When dozens or hundreds of tables must be captured simultaneously, the CDC system must manage connection limits, replication slot resources, and downstream throughput.
Partitioning the change stream by table or by key range allows parallel processing. Events for independent tables can be consumed by separate workers without coordination. Events within a single table can be partitioned by primary key, enabling parallel consumers that each handle a subset of rows -- as long as ordering is maintained within each partition.
Backpressure management is critical at scale. If a downstream consumer slows down (due to indexing lag, network congestion, or resource contention), the CDC pipeline must buffer events without dropping them and without allowing unbounded memory growth. Production systems use bounded buffers with overflow to persistent storage -- writing excess events to disk when in-memory buffers fill, then draining the disk buffer when the consumer catches up.
Monitoring at scale requires tracking per-table lag (the delay between source commit and downstream application), throughput (events per second per table), and error rates. Alerting on lag growth is the most important signal, because increasing lag indicates that the pipeline is falling behind and may eventually lose data if WAL retention is exceeded.
CDC FAQ
What is the difference between CDC and ETL?
ETL extracts data in bulk on a schedule, transforms it, and loads it into a target system. CDC captures only the changes (inserts, updates, deletes) as they happen and streams them continuously. ETL is batch-oriented and introduces latency; CDC is event-driven and near-real-time. Many modern architectures use CDC to replace the "extract" step of traditional ETL.
Which databases support log-based CDC?
Most modern relational databases support log-based CDC: PostgreSQL (via logical replication and WAL), MySQL (via binlog), SQL Server (via its built-in CDC feature), Oracle (via LogMiner or GoldenGate), and MongoDB (via change streams). The specific configuration varies by database.
Does CDC add overhead to the source database?
Log-based CDC adds minimal overhead because it reads the transaction log asynchronously -- it does not modify queries or add triggers. The primary cost is slightly increased disk I/O for log retention. Trigger-based and polling-based CDC add more overhead because they execute additional queries or triggers during write operations.
How does CDC handle schema changes?
Schema changes (adding columns, changing data types) are one of the most challenging aspects of CDC. Log-based CDC systems typically detect schema changes in the transaction log and propagate them downstream. The downstream consumer must handle schema evolution -- for example, by using a schema registry or applying migrations automatically.
Can CDC be used with data federation?
Yes. CDC and data federation are complementary patterns. Federation queries data across sources in real time; CDC keeps local acceleration caches synchronized with those sources. Together, they enable sub-second federated queries backed by always-fresh local data. Spice uses this combination to deliver real-time performance across distributed data sources.
Learn more about CDC and data acceleration
Guides and blog posts on real-time data synchronization and acceleration with Spice.
Spice.ai OSS Documentation
Learn how Spice uses CDC to keep accelerated datasets fresh and synchronized with source databases in real time.

Real-Time Control Plane Acceleration with DynamoDB Streams
How to sync DynamoDB data to thousands of nodes with sub-second latency using a two-tier architecture with DynamoDB Streams and Spice acceleration.

Introducing Spice Cayenne: The Next-Generation Data Accelerator Built on Vortex for Performance and Scale
Spice Cayenne is the next-generation Spice.ai data accelerator built for high-scale and low latency data lake workloads.

See Spice in action
Get a guided walkthrough of how development teams use Spice to query, accelerate, and integrate AI for mission-critical workloads.
Get a demo