Change Data Capture
Change Data Capture (CDC) is a data integration technique that identifies and processes data changes in a source system so downstream systems can consume inserts, updates, and deletes in near real time or on a controlled schedule.
Expanded Explanation
1. Technical Function and Core Characteristics
CDC records data manipulation operations in transactional systems and makes those changes available to other systems without repeatedly extracting full datasets. It uses mechanisms such as database logs, triggers, timestamps, or versioning columns to detect inserts, updates, and deletes. Implementations often provide ordered change streams, metadata about operations, and guarantees about delivery and consistency that align with the underlying database or messaging infrastructure.
Log-based CDC reads commit logs or redo logs from databases to capture committed transactions with low overhead on the source system. Trigger-based and query-based approaches rely on database triggers or periodic queries, which can introduce higher load but may be used when log access is not available. Implementations need to handle schema changes, transaction boundaries, idempotency, and error recovery to maintain data quality and consistency across systems.
2. Enterprise Usage and Architectural Context
Enterprises use CDC to move operational data into data warehouses, data lakes, and analytics platforms while keeping reporting datasets aligned with transactional systems. It supports near-real-time replication, event streaming, and synchronization between heterogeneous databases and applications. Architects include CDC in data pipelines to reduce batch windows, minimize extract workloads, and support microservices patterns where one service publishes changes for others to consume.
CDC often operates within an event-driven or streaming architecture that uses message brokers or streaming platforms to distribute change events. It can also integrate with Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) workflows to maintain slowly changing dimensions, audit trails, and operational reporting stores. Governance teams align CDC processes with data lineage, retention, and security controls in the broader data platform.
3. Related or Adjacent Technologies
CDC relates to database replication, event sourcing, and message queuing but serves a distinct role focused on extracting and propagating committed changes from existing systems. Traditional ETL tools may include CDC capabilities, but ETL focuses on transformation and loading, while CDC focuses on capturing and transporting change events. Streaming platforms and data integration tools often consume CDC feeds as one input among others.
Event sourcing persists application state as a sequence of events by design, whereas CDC derives events from external databases that may not have been built for event orientation. Database-native replication can use CDC concepts internally, but enterprise CDC solutions typically expose standardized formats and connectors for downstream analytics, monitoring, and operational use cases. Data quality, catalog, and observability tools frequently integrate with CDC streams to monitor schema evolution and data anomalies.
4. Business and Operational Significance
CDC enables business teams to access operational metrics, customer activity, and transactional data with reduced latency compared with traditional batch integration. This supports use cases such as operational dashboards, fraud detection pipelines, inventory monitoring, and regulatory reporting that require current data. By capturing only deltas instead of full extracts, organizations can lower load on production databases and network resources.
From an operational standpoint, CDC affects data governance, security, and compliance because it moves detailed transactional events across systems and regions. Enterprises implement access controls, encryption, retention policies, and monitoring around CDC pipelines to manage exposure of sensitive data and align with regulatory requirements. CDC design decisions also influence recovery point objectives and service-level targets for analytics and downstream applications.