Configuring Cassandra Cluster: A Complete Step-by-Step Guide
Cassandra cluster configuration architecture diagram
Introduction
Apache Cassandra is a distributed NoSQL database designed for scalability, fault tolerance, and high availability. To fully utilize its power, you need to configure it as a cluster — a group of interconnected nodes that share data and handle workloads efficiently.
In this guide, we’ll explain how to configure a Cassandra cluster from scratch, including network setup, configuration files, replication, and verification.
A Cassandra cluster is a collection of multiple nodes (servers) working together. Each node stores a part of the data and communicates with others using the gossip protocol.
Node: The basic unit in Cassandra that stores data.
Cluster: A collection of nodes working together.
Keyspace: The top-level namespace defining data replication strategy.
Data Center: A logical grouping of nodes for replication and load balancing.
Example Setup:
Node1 → 192.168.1.101
Node2 → 192.168.1.102
Node3 → 192.168.1.103
Before configuring Cassandra nodes:
Follow the Cassandra installation guide for Linux or Windows.
All nodes use the same Cassandra version.
All nodes have unique IP addresses.
Firewall allows communication on these ports:
7000 – intra-node communication
7001 – encrypted intra-node communication
7199 – JMX monitoring
9042 – CQL clients
9160 – Thrift clients (optional)
The main configuration file for Cassandra is located at:
/etc/cassandra/cassandra.yaml
You must modify this file on each node.
| Parameter | Description | Example |
|---|---|---|
| cluster_name | Defines the cluster’s name |
cluster_name: 'MyCassandraCluster' |
| listen_address | Node’s local IP for communication |
listen_address: 192.168.1.101 |
| seeds | List of seed nodes for gossip |
- seeds: "192.168.1.101,192.168.1.102" |
| rpc_address | IP address to connect clients |
rpc_address: 0.0.0.0 |
| endpoint_snitch | Network topology setting |
endpoint_snitch: GossipingPropertyFileSnitch |
Note:
The seed node helps other nodes discover the cluster.
Use at least two seed nodes for fault tolerance.
Select one or two nodes as seed nodes.
In cassandra.yaml on all nodes, set:
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "192.168.1.101,192.168.1.102"
Edit cassandra-env.sh (Linux) or cassandra-env.ps1 (Windows) to set the IP for each node:
JVM_OPTS="$JVM_OPTS -Dcassandra.listen_address=192.168.1.101"
JVM_OPTS="$JVM_OPTS -Dcassandra.rpc_address=0.0.0.0"
sudo systemctl start cassandra
nodetool status
Example output:
Datacenter: dc1
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.101 200 MB 256 33.3% e91f1c9f-87b6-44c5-b98d-77a6512e53d2 rack1
UN 192.168.1.102 210 MB 256 33.3% a02e1b1b-6a0d-4eab-b3db-8d2a68db6572 rack1
UN 192.168.1.103 220 MB 256 33.3% b21e3c9d-77a1-4f67-9e9a-223a121e623f rack1
If all nodes show UN (Up and Normal) — your cluster is configured successfully!
Use CQLSH to define data replication across nodes:
CREATE KEYSPACE company
WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3};
Then:
USE company;
CREATE TABLE employees (id UUID PRIMARY KEY, name text, department text);
Replication ensures that your data remains available even if one node fails.
Run from one node:
cqlsh 192.168.1.101
Insert test data:
INSERT INTO company.employees (id, name, department) VALUES (uuid(), 'John', 'IT');
Then check on another node:
SELECT * FROM company.employees;
If data appears — replication works!
| Command | Description |
|---|---|
nodetool status |
Shows cluster health |
nodetool repair |
Repairs inconsistencies |
nodetool cleanup |
Removes deleted data |
nodetool ring |
Displays token distribution |
You can also integrate Prometheus + Grafana for advanced monitoring.
You’ve successfully learned how to configure a Cassandra cluster across multiple nodes. By setting up proper seed nodes, replication strategies, and snitches, you ensure that your cluster is scalable, fault-tolerant, and high-performing.
This foundation will help you manage large-scale distributed applications effectively.