Published on


Cassandra is a NoSQL distributed database. The data is distributed across partitions for scalability and resiliency. It is a wide-column store (not column-oriented) where each row can have a flexible schema.

The data are distributed to different partitions based on the partition key of each row. We define the primary key of a table, and the primary key will be used to compute the hash and put it in a partition. Consistent hashing is used to distribute the data evenly among nodes. Cassandra makes the system more available by letting go of consistency.

While primary keys are used to partition the data, clustering keys can be used to store the data in a sorted manner for easier retrieval. To query the data, the primary is mandatory. Otherwise, it will lead to a full table scan and query all the partitions leading to performance degradation.

name text PRIMARY KEY,
age int,
dob timestamp

INSERT INTO world.people
(name, age, dob)
VALUES ('kev', 12, toTimeStamp(now()));

INSERT INTO world.people
(name, age)
VALUES ('vek', 21);

Here name is the partition key which will be used to assign the row to a partition.

The data is stored in the below manner

people {
    row1: {name: kev, age: 12, timestamp:20230101}
    row2: {name: vek, age: 21}

Few things to note

  • Since data is distributed, joins are not possible.
  • Each row does not necessarily have all the data.
  • Adding a node will automatically rearrange the data. Zero downtime!
  • Data cannot be deleted. Instead, it maintains a version for each cell.
  • The primary key should always be unique.
  • ACID transactions are not supported.
  • High write throughput.
  • Read is very good when the primary key is known.
  • The schema is flexible.
  • Scalable, reliable and available.
  • No ad hoc queries.



  2. How is data exactly stored? -