Apache Cassandra Troubleshooting Guide
Introduction
Apache Cassandra is one of the most popular NoSQL distributed databases, designed to handle massive data workloads with high availability and scalability. However, due to its distributed nature, administrators often face configuration, performance, and consistency issues.
This Cassandra Troubleshooting Guide (2025) will help you diagnose and fix the most common problems in your Cassandra cluster — from node failures to data inconsistencies — ensuring smooth performance and reliability.
1. Understanding Cassandra Troubleshooting
Before you start fixing issues, it’s crucial to understand Cassandra’s architecture:
-
Nodes → Individual database instances.
-
Clusters → Groups of nodes working together.
-
Keyspaces & Tables → Logical data structures.
-
Replication → Ensures fault tolerance across nodes.
When an issue occurs, Cassandra’s logs, metrics, and nodetool commands are your best friends for troubleshooting.
2. Common Cassandra Problems and Their Solutions
a) UnavailableException: Cannot achieve consistency level
Cause:
One or more nodes required for a specific consistency level are not available.
Solution:
nodetool status -
Restart or repair failed nodes:
nodetool repair -
Reduce consistency level if replication factor is low:
CONSISTENCY QUORUM;
b) WriteTimeoutException / ReadTimeoutException
Cause:
Cassandra didn’t receive responses from enough replicas within the timeout period.
Solution:
-
Increase timeout values in
cassandra.yaml:write_request_timeout_in_ms: 10000 read_request_timeout_in_ms: 10000 -
Avoid heavy queries scanning large partitions.
-
Add nodes or improve hardware performance.
c) NoHostAvailableException
Cause:
Client cannot connect to any Cassandra nodes.
Solution:
-
Check if Cassandra service is running:
sudo service cassandra status -
Ensure native transport port
9042is open. -
Verify correct IPs and contact points in your application driver.
d) OutOfMemoryError
Cause:
Heap space is insufficient for the workload.
Solution:
-
Edit heap size in
jvm.options:-Xms4G -Xmx4G -
Avoid large partitions and use pagination.
-
Use
nodetool infoto monitor memory usage.
e) Corrupted SSTables or CommitLogs
Cause:
Power loss or hardware failures cause data file corruption.
Solution:
-
Run scrub tool:
nodetool scrub keyspace_name table_name -
Backup data regularly using snapshots.
-
Avoid abrupt shutdowns.
f) Tombstone Overload
Cause:
Frequent deletions create tombstones, slowing reads.
Solution:
-
Use TTL (Time-To-Live) for data expiry.
-
Avoid large deletes and updates.
-
Compact tables:
nodetool compact
g) Disk Full or Disk Failure
Cause:
Disk usage exceeds 90% due to large SSTables or commit logs.
Solution:
-
Clear old snapshots:
nodetool clearsnapshot -
Move commit logs to separate drives.
-
Monitor disk space regularly using
df -h.
🔍 3. Key Troubleshooting Tools
| Tool | Purpose |
|---|---|
nodetool |
Manage and inspect cluster status |
cqlsh |
Execute Cassandra Query Language (CQL) |
sstableloader |
Load and repair SSTables |
OpsCenter |
Visualize metrics and manage clusters |
system.log |
View error messages and warnings |
4. Performance Troubleshooting
Slow Queries
-
Use EXPLAIN PLAN in CQL to analyze query execution.
-
Avoid
ALLOW FILTERING. -
Use appropriate partition keys for efficient lookups.
High Latency
-
Monitor GC (Garbage Collection) activity.
-
Increase concurrent reads/writes in
cassandra.yaml. -
Tune Linux kernel parameters (
vm.swappiness,ulimit).
🧠 5. Preventive Maintenance Tips
✅ Regularly run:
nodetool repair nodetool cleanup nodetool compact ✅ Monitor metrics using:
-
Prometheus + Grafana
-
DataStax OpsCenter
-
ELK Stack for log aggregation
✅ Always test schema changes in a staging cluster before production rollout.
🧾 6. Cassandra Cluster Health Checklist
| Check | Frequency | Command |
|---|---|---|
| Node Status | Daily |
nodetool status |
| Disk Usage | Weekly |
df -h |
| Repair Process | Weekly |
nodetool repair |
| Compaction | Monthly |
nodetool compact |
| Backup Snapshot | Weekly |
nodetool snapshot |
Conclusion
Effective Cassandra troubleshooting requires a good understanding of cluster behavior, monitoring tools, and common error patterns.
By following this guide and applying preventive maintenance, you can ensure your Cassandra cluster remains healthy, efficient, and ready for high-performance workloads.