Log Compaction
Log compaction is a data retention and storage optimization process in distributed commit logs that retains only the most recent record for each key while discarding older, superseded entries.
Expanded Explanation
1. Technical Function and Core Characteristics
Log compaction operates on append-only logs by scanning records grouped by key and rewriting segments so that only the latest value per key remains. Systems use it to bound log size while preserving current state for each key.
Compaction typically runs in the background, often alongside traditional time-based retention. It preserves ordering guarantees within segments for surviving records and maintains durability by writing compacted segments before discarding originals.
2. Enterprise Usage and Architectural Context
Enterprises use log compaction in event streaming platforms, replicated state machines, and Change Data Capture (CDC) pipelines to reconstruct up-to-date state from a log without replaying all historical updates. It supports recovery, rebalancing, and late-joining consumers.
Architects deploy compacted topics or logs for datasets that behave like key-value tables, such as account balances, configuration data, or reference entities, while using non-compacted logs for audit trails and analytical events.
3. Related or Adjacent Technologies
Log compaction relates to snapshotting, checkpointing, and garbage collection, which also reduce stored history while retaining data needed for correctness. It appears in distributed logs, replicated databases, and consensus systems.
Vendors and open source platforms implement compaction with log-structured storage engines, segment merging, and index structures that track the latest offset per key. These mechanisms coordinate with replication and partitioning strategies.
4. Business and Operational Significance
For enterprises, log compaction controls storage costs and recovery times for long-lived data streams by preventing unbounded log growth while keeping the current representation of business entities. It supports data retention policies and operational resilience.
Compaction also supports consistent reprocessing and system migration because consumers can rebuild stateful services from compacted logs without accessing external snapshots, which simplifies operations and reduces operational coupling between systems.