Cover slide for Apache Cassandra presentation

Cassandra Launch #70: Basics of Apache Cassandra

In Apache Cassandra Lunch #70: Basics of Apache Cassandra, we discussed the basics of Apache Cassandra, we discussed why we need apache Cassandra, we also looked at some of the features of Cassandra. In the process, we were able to learn the background knowledge about Cassandra, we then went a step forward to set up a standalone Apache Cassandra. With the standalone Cassandra, we used Cassandra Query Language (CQL) to communicate and interact with the Cassandra database. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!

Basics of Apache Cassandra

Recently, Cassandra has gained a lot of audience from all over the world, different companies are using Apache Cassandra to deploy their big data applications. The next question is why are companies moving towards Apache Cassandra? What are the different reasons why thousands of organizations have decided to trust Cassandra with their application? With this, we are going to look at why Apache Cassandra is so powerful, we are going to look at the basics of Cassandra.

In this article, we are going to look at:

  1. What is Apache Cassandra?
  2. Features of Apache Cassandra. 
  3. Some of the components that make up Cassandra. 
  4. Data Modelling in Apache Cassandra.
  5. Common terminologies used in Apache Cassandra.
  6. Install a standalone Cassandra using Docker. 
  7. Querying the Cassandra database. 

What is Apache Cassandra ?

Cassandra is a distributed database, the scalability feature of Cassandra is inherent and this is what makes Cassandra compatible with big data applications. Apache Cassandra is a fast, open-source distributed database that is built for high availability and linear scalability. The idea behind Cassandra is that there should not be a single point of failure because every node in the cluster has the same ability, no master or slave relationship, your data is in sync across your cluster and data centers.  

In Cassandra, data is automatically replicated to all data centers or nodes of your choice. Interestingly, just like the relational world, Apache Cassandra has a native language that allows you to work with Cassandra databases easily. With Cassandra Query Language (CQL) you can write simple CQL queries to manipulate your data (DML), you have the ability to perform data definition operations (DDL). As you might have known before now, Cassandra is an example of NoSQL databases like MongoDB and Apache HBase, but Cassandra’s data structures are faster than relational database structures. In the next section, we are going to look at some of the features of Apache Cassandra. 

Features of Apache Cassandra

In the next part of this article, we will look at the features of Apache Cassandra. Before we go into other cool features of Cassandra, we will look at the features of Cassandra in relation to the CAP theorem proposed by Eric Brewer. 

The CAP Theorem
The vein diagram of the CAP theorem

The CAP theorem is a concept used to describe a distributed data system in which a system can only guarantee two properties out of the three properties proposed by Brewer’s theorem. The three properties that must be guaranteed by a distributed data store are availability, consistency, and being partition tolerance.  What the theorem is saying here is that a data store would have to trade off consistency for availability and partition tolerance. We are going to look at each of these properties in the next section. 

Availability: 

Cassandra performs its operations in a distributed manner, both write and read operations; this is the process behind why Cassandra is always available. Cassandra can write to a variable number of nodes and it can also read from a variable number of nodes across data centers. Apache Cassandra’s real sauce is found when you deploy an architecture that has multiple data centers with multiple nodes. These setups would allow Cassandra to distribute its resources across data centers, making sure data is available at all times.

Diagram of reading from a Datacenter with a RF of 3

Consistency: 

Consistency in Cassandra is achieved by placing a pass where one node or variable number of nodes of the cluster acknowledge a read or write operation. Your read request must contain updated data, and your writes must be acknowledged by the number of nodes that you specify to confirm your writes. The minimum number of nodes that must provide this confirmation is configured during the creation of Cassandra Keyspaces. Cassandra’s consistency level is tunable, you can have a strong, eventual, or linearized consistency level in your application. All of these are based on the configuration you provided when you create your keyspaces. 

Partition Tolerance:

One of the areas where Cassandra shines is the ability for it to continue operating even when a part or some part of the database becomes unavailable. There must not be a single point of failure; I mean no single point of failure because all nodes in a cluster gossip with each other, and they know when a node (s) are down. They pick up other node tasks as soon as they notice they are down because data is distributed evenly across different datacenters, racks, and nodes.

Although an instance of Cassandra can be designed to run on a single server, a deployment like this is prone to failure; there is always an additional overhead cost when there is a need to scale. Setting up Cassandra on a single machine is done for development purposes. You must deploy your infrastructure for high availability, partition tolerance, and for a reasonable consistency level that fits your application needs. This is the recommended design for production.  

CQL is Simple to Use:

Cassandra Query Language is the primary language used for communicating with Cassandra. CQL is simple to use just like structure query language (SQL). With CQL you can be able to perform data definition operations, you will be able to manipulate your data. CQL syntax is very informational and easy to use and if you come from a relational database world you would understand that SQL queries are one of the simplest query languages you have worked with in the past. 

Linear Scalability: 

Cassandra has the ability to scale linearly by adding many more servers if the need arises; this feature is available in Cassandra by default. One of the cool things I like personally about Cassandra is that there is no Master or Slave node relationship, every node has an equal chance of being chosen as a coordinator node. You have the ability to add more nodes to your existing cluster without disrupting your ongoing read or write operations. The process where you add more nodes to the existing cluster is known as horizontal scaling, while the process of adding more CPU, RAM, or Disk space to an existing node to increase the computing capability is known as vertical scaling.  

Now that we know why Cassandra is super cool, let’s go into a few important components that make up Apache Cassandra. 

Some of the components that Make up Cassandra

Cluster: A cluster contains one or multiple datacenters, they are a group of machines attached to function as a single machine. 

Datacenter: It houses groups of racks together, the racks are related and they are configured together for the purpose of replication. 

From the basics of Apache Cassandra presentation, this a diagram showing a rack with four nodes.

Rack: A combination of nodes make up a rack. A node will always have its own CPU, memory, and storage disk. It is worth noting that a rack does not have a personal CPU, memory, or storage disk. Due to the fact that a rack contains multiple nodes and they are connected for the purpose to share resources. 

From the basics of Apache Cassandra presentation, this is the diagram of a four rack cluster.

Node: Nodes are also referred to as virtual nodes. They are the entities that handle all big data workloads. There is no master-slave orientation here. All nodes are designed the same way, where they communicate with each other for various reasons. This process is called peer-to-peer distribution architecture. 

Commit Log: This is the process where data are written sequentially to the Cassandra disk using append-only log. This operation happens on the local Cassandra node where the write operation is performed. One of the advantages of using Cassandra is that you can write much faster – a commit log making this possible. Data is written parallel to the commit log and also to the memTable. The purpose of this is to avoid data loss.

MemTable: Just from its name, a MemTable is in-memory where Cassandra pushes its write operations to. Ideally, a table will have a single active memTable where all its data are ordered as defined in the partition key. Once the configured threshold is met, the data is flushed to the disk.

Sorted String Table (SSTable): SSTables are immutable data files that are used by some NoSQL databases to persist data on disk. The SSTables house the data that are no longer in the MemTable. 

Terminologies used in Cassandra

Primary Key: The primary key of a Cassandra table is super important, it contains the partition key/ keys and the clustering column/columns. It has functions for different scenarios:- it is used to identify each row as the need might be required and is also used to group the data into nodes. The clustering column defines how to best order the column being defined. We can decide to use the partition key and clustering column together, it is also used to partition our data. 

Partition: In Cassandra, the partition is used to distribute the data across the nodes in a cluster. The process where the nodes’ token is computed is known as a hash function. The rows of the table are evenly distributed across all the nodes, and this distribution is uniform by default unless you decide to change it from Murmur3Partitioner (Default) to RandomPartitioner or ByteOrderedPartioner. The partition key helps us to achieve this. 

Clustering Column: A clustering column is used for the sorting order of the partition of that table. It also helps to provide uniqueness across the table. A table can have multiple clustering columns, and interestingly the data are stored according to this order in the table. 

Keyspaces: A keyspace is the top-level object of a node that controls how rows are replicated on nodes.  keyspaces contain tables, and tables contain families of columns.  Columns

Partition Key: The partition key is the column used as a partition in the Cassandra table primary key declaration. 

Data Modelling in Apache Cassandra

Data modeling is at the heart of apache Cassandra, its performance strategy is based on how you model your database. Very importantly, you need to define your goal from the start, you need this even before you set up the database. Having this information at hand is very important because you will build your database based on the use cases. Using the right data model will make your database consistent, tolerant to faults, and will be available faster at every time you need it. Now, let’s look into the basics of data modeling in Cassandra. The primary motive of data modeling in Cassandra is for;

  • Availability
  • Partition Tolerance
  • Consistency

I will not go into detail about how to model your database for availability, fault tolerance, and making your database consistent. But for a start I want you to know that you must model your deployment to support availability, partition tolerance, and make sure your data is consistent. 

Apache Cassandra Installation Guide

You can install Apache Cassandra using Docker image, tarball binary file, package installation (RPM, YUM), or Kubernetes. Installing Cassandra is simple using Docker. You’ll need to install Docker desktop for Mac, Docker desktop for Windows, or have docker installed on Linux. After you have installed docker desktop you can then pull the appropriate image and then start Cassandra with the run command. 

With the line below we would be able to pull the latest Cassandra image from docker hub. 

docker pull cassandra:latest

With the line below, we would be able to run the Cassandra image we pulled in the previous line. 

docker run --name cass_cluster cassandra:latest

The line below will help us get into the CQL shell command line interface of Apache Cassandra in the container. 

docker exec -it /bin/bash cqlsh

With this, you can now start interacting with Apache Cassandra. 

Querying the Cassandra Database

Cassandra Query Language shares some query keywords that are similar to that of SQL, CQL supports CRUD operations. It is important to note that the best practice for creating a table in Cassandra is to create a table according to what your read queries will look like in the future. Try to create a table in such a way that a minimum number of partitions needs to be read, avoid reading all the rows of your data, remember you might have 1M – 1B rows of data in your database. Navigate to the CQLSH command line and follow along with the commands provided below to perform different operations. 

  • Create your database with CQL as follows;

CREATE KEYSPACE IF NOT EXISTS voting_system WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };

  • You can be able to describe the Cassandra keyspaces or your Cassandra table with the following CQL command;

DESCRIBE KEYSPACES

  • Create your table with the CQL command below;

CREATE TABLE IF NOT EXISTS voting_system.voters (

userid text PRIMARY KEY,

item_count int,

last_update_timestamp timestamp

);

  • Insert some data into your table;

INSERT INTO voting_system.voters

(userid, item_count, last_update_timestamp)

VALUES ('9876', 2, toTimeStamp(now()));

INSERT INTO voting_system.voters

(userid, item_count, last_update_timestamp)

VALUES ('1234', 5, toTimeStamp(now()));

Collection Columns

One of the advantages of using Cassandra is that it supports unstructured data. You must always provide this opportunity when you design your database. There are two different ways to nest data in Cassandra: you can nest using the multi-row partitions or using the collection types. There are different data types that are supported by Cassandra but we are going to look into collection columns. Collection columns are used to combine more than one attribute (data) into a single column, you are at liberty to adjust this to your big data application need. In Cassandra, there are five types of collection columns, they are set, list, map, tuple, and user defined types (UDT).

  • Sets are referred to as an unordered group of values. They are enclosed within two curly brackets. You include the attributes in the bracket while separating the values with a comma, An example of this look like this; {“Strawberry”, “Apple”, “Banana”}
  • Lists are an ordered group of values that are enclosed in a square bracket. This list can contain multiple attributes that are particular for an entity. Lists are given as [“Strawberry”, “Apple”, “Banana”]
  • Map is a key-value relationship, it’s a case that gives you a question and an answering style. Maps are generally enclosed in curly brackets. An example of a map is given as {“State”: “New York”, “Fruit”: “Apple”, “Age”: 90}

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!