Introduction to Compaction Strategies in Cassandra

Overview

In Apache Cassandra, data is written to disk in immutable files called SSTables. Over time, multiple SSTables accumulate for the same table, leading to redundancy and increased read latency. To manage this efficiently, Cassandra uses a process called compaction.

Compaction merges multiple SSTables into fewer, larger ones — removing outdated or deleted data and improving read performance. The strategy Cassandra uses to determine when and how to perform compaction is known as a Compaction Strategy.

What is Compaction in Cassandra?

Compaction is the process of merging SSTables to optimize disk usage and read efficiency. It helps by:

Removing tombstones (markers for deleted data)
Merging overlapping SSTables
Reducing disk fragmentation
Improving read performance

Every compaction results in fewer, more optimized SSTables.

How It Works:

Data is first written to a memtable (in-memory structure).
When the memtable is full, it’s flushed to disk as an SSTable.
Over time, multiple SSTables accumulate.
Cassandra triggers a compaction process to merge and clean them.

Types of Compaction Strategies in Cassandra

Cassandra provides several compaction strategies, each suited to different workloads and data patterns.

1. SizeTieredCompactionStrategy (STCS)

Default strategy for most write-heavy workloads.
Merges SSTables of similar sizes into larger SSTables.
Best suited for write-intensive workloads.

Advantages:

Simple and effective for write-heavy systems.
Efficiently merges small SSTables.

Disadvantages:

Can lead to high disk I/O during major compactions.
Increased read latency as SSTables accumulate.

Example:

ALTER TABLE users WITH compaction = {
  'class': 'SizeTieredCompactionStrategy'
};

2. LeveledCompactionStrategy (LCS)

Designed for read-heavy workloads.
Organizes SSTables into levels, where each level has SSTables of a fixed size.
Each SSTable in a level contains non-overlapping token ranges.

Advantages:

Fewer SSTables per read → improved read performance.
Predictable disk usage.

Disadvantages:

Higher write amplification.
Requires more disk space during compaction.

Example:

ALTER TABLE orders WITH compaction = {
  'class': 'LeveledCompactionStrategy'
};

3. TimeWindowCompactionStrategy (TWCS)

Best for time-series data (e.g., logs, IoT data, metrics).
Groups SSTables into time windows (e.g., hourly, daily) and compacts within each window.
Avoids rewriting old data unnecessarily.

Advantages:

Ideal for data with a TTL (time to live).
Reduces I/O by only compacting recent data.

Disadvantages:

Not ideal for random-access data patterns.

Example:

ALTER TABLE metrics WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'DAYS',
  'compaction_window_size': 1
};

🧠 Comparison of Compaction Strategies

Strategy	Best For	Read Performance	Write Performance	Disk Usage	Notes
STCS	Write-heavy workloads	Moderate	High	Moderate	Default strategy
LCS	Read-heavy workloads	High	Moderate	High	Levels control overlap
TWCS	Time-series data	High	Moderate	Efficient	Best for TTL data

⚡ Choosing the Right Compaction Strategy

Use Case	Recommended Strategy
High write volume, minimal reads	SizeTieredCompactionStrategy
Read-heavy, low-latency queries	LeveledCompactionStrategy
Time-series or TTL-based data	TimeWindowCompactionStrategy

Example: Checking Current Compaction Settings

DESCRIBE TABLE user_activity;

Output:

compaction = {
  'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
}

To change it:

ALTER TABLE user_activity WITH compaction = {
  'class': 'LeveledCompactionStrategy'
};

🛠️ Best Practices

✅ Choose a strategy based on data pattern (writes, reads, or time-series).
✅ Avoid frequent major compactions—they’re I/O intensive.
✅ Monitor compaction metrics using nodetool compactionstats.
✅ Use TWCS for expiring data with TTLs.
✅ Always test compaction changes on a staging cluster first.

Conclusion

Compaction is a vital process in Cassandra that keeps your data efficient, consistent, and fast to read. Understanding and choosing the right compaction strategy ensures optimal performance and storage management.

Use STCS for write-heavy tables.
Use LCS for read-intensive workloads.
Use TWCS for time-series or expiring data.

By fine-tuning compaction strategies, you can greatly improve cluster health, query speed, and storage utilization.