Secondary Indexes in Cassandra: A Complete Guide
Cassandra Secondary Index architecture diagram showing index table and base table relationship
Introduction
In Apache Cassandra, queries are usually performed using the primary key (partition key + clustering key). However, in many real-world scenarios, you may need to query data using non-primary key columns. This is where Secondary Indexes come into play.
A Secondary Index allows you to efficiently query columns that are not part of the primary key, expanding the flexibility of data retrieval in Cassandra.
A Secondary Index in Cassandra is a data structure that provides an alternate lookup path for querying a table using non-primary key columns.
Let’s say you have the following table:
CREATE TABLE users (
user_id UUID PRIMARY KEY,
name text,
email text,
city text
);
If you frequently need to query users by city, you can create an index on the city column:
CREATE INDEX ON users (city);
Now, you can run queries like:
SELECT * FROM users WHERE city = 'Delhi';
Without this index, Cassandra would not allow such a query because city is not part of the primary key.
Internally, Cassandra creates a hidden table to store mappings of indexed column values to the primary key of the original table.
When you query an indexed column:
Cassandra looks up the value in the index table.
It retrieves the corresponding primary key(s).
Then, it fetches the full data from the main table.
This lookup mechanism makes Secondary Indexes convenient but not always optimal for large datasets.
Secondary Indexes can be helpful in specific cases, but they should be used with caution.
When the cardinality (number of unique values) of the indexed column is low or moderate.
When queries are not frequent and data volume is relatively small.
When the indexed column is evenly distributed across partitions.
You have a high number of unique values (e.g., indexing a timestamp column).
The table has a very large number of rows.
Queries involve multiple conditions or filtering across different indexed columns.
CREATE TABLE products (
product_id UUID PRIMARY KEY,
name text,
category text,
price double
);
CREATE INDEX category_index ON products (category);
SELECT * FROM products WHERE category = 'Electronics';
This will return all products belonging to the 'Electronics' category.
Cassandra supports multiple indexing mechanisms depending on the version and use case.
Default index type created using CREATE INDEX.
Best for small, evenly distributed datasets.
Introduced in Cassandra 3.x.
Supports text search and range queries.
Example:
CREATE CUSTOM INDEX ON products (name) USING 'org.apache.cassandra.index.sasi.SASIIndex';
Highly optimized indexing engine.
Supports multiple columns, numeric, and range queries.
Example:
CREATE CUSTOM INDEX ON products (price) USING 'StorageAttachedIndex';
DROP INDEX category_index;
DESCRIBE TABLE products;
REBUILD INDEX ON products (category_index);
| Factor | Recommendation |
|---|---|
| Cardinality | Use for low to medium cardinality columns only. |
| Query frequency | Avoid for frequent queries on large datasets. |
| Write-heavy workloads | Indexes can slow down writes because they must update both the base and index tables. |
| Replication factor | Choose replication carefully to avoid overloading specific nodes. |
| Monitoring |
Use nodetool cfstats and nodetool tablestats to analyze index performance. |
If Secondary Indexes are not suitable, consider these alternatives:
Automatically maintain a denormalized version of data for specific queries.
CREATE MATERIALIZED VIEW users_by_city AS
SELECT * FROM users WHERE city IS NOT NULL PRIMARY KEY (city, user_id);
Store the same data in multiple tables based on access patterns.
For complex queries, integrate Cassandra with a search engine like Elasticsearch or Apache Solr.
Avoid indexing columns with high cardinality.
Do not create multiple indexes on a single table.
Use SASI or SAI for range and full-text queries.
Regularly monitor index performance and rebuild if necessary.
Consider query volume and write throughput before adding an index.
Secondary Indexes in Cassandra provide flexibility for querying non-primary key columns, but they must be used judiciously. For large-scale or high-throughput systems, they can become a performance bottleneck if not designed properly.
Always analyze your query patterns, data distribution, and workload before implementing an index. In most cases, denormalization or materialized views offer more scalable alternatives.