In Apache Cassandra Lunch #55: Migrating PostgreSQL to Cassandra, we will discuss the differences between Relational and Non Relational database, as well as the data modeling process for Cassandra. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
Before we talk about the advantages and disadvantages of Relational Databases like PostgreSQL it’s important to note that one is not better than the other, rather they both have suitable use cases. So starting off with Relational Databases the good thing about them is that SQL databases use ACID transactional guarantees and they query data using JOIN’s on top of existing normalized relational data models. They have Consistent global secondary indexes and Scale vertically. They are great for medium-sized data.
The problem here becomes when we need to scale this database. Whenever we want to scale we either have to buy increasingly more expensive hardware or try to manually shard. This is not ideal when working with Big Data and we are faced with the problem of having a large team coordinating, manually moving data and schema changes.
So the solution to our scaling problem is a non-relational database such as Cassandra. Cassandra scales linearly, it’s highly available so that means there is no downtime, it is easy to manage operationally. This solves a lot of our problems when handling Big Data but it has a unique data model and it takes some getting used to if you want to use it effectively.
In Cassandra Data is replicated automatically, the Replication Factor is per keyspace. Consistency levels in Cassandra can be configured to manage availability versus data accuracy. Configure consistency for a session or per individual read or write operation. Within
CONSISTENCY, to set the consistency level for all queries in the current
Now the tricky part about migrating PostgreSQL to Cassandra is the way we model our Data for Cassandra. The process is different in a lot of ways and we will cover the basics here.
The first thing we can notice is that Cassandra is query-driven. The way to minimize partition reads is to model your data to fit your queries. Don’t model around relations. Don’t model around objects. Model around your queries. Try to determine exactly what queries you need to support this is different from relational databases like PostgreSQL where you think about the queries after you defined your schema. Primary Keys are different and they represent a lot more in Cassandra.
In order to utilize the best out of Cassandra we need to make sure our Data Model is done right, so to make sure we get the right Data Model we need to follow the Cassandra modeling methodology. First, we have to create our Conceptual Data Model. This is where we take a step back and Identify entities and relationships and their attributes.
This part of the modeling does not require being a developer, and so that means more people without technical skills can be involved in this process.
The next thing in the process is to create our Application workflow. This is generally where we determine what queries we need for our application. Remember we are modeling our Data Model around our queries so we have to determine what those queries are going to be early in the process.
After we have these two parts ready we can move onto the Logical Data Model.
Now that we have the Logical Data Model it’s time to transform it into Physical Data Model and to do that we just have to assign the CQL Types.
A key thing to understand when migrating from PostgreSQL to Cassandra, As I mentioned before the Primary Key is quite different between PostgreSQL and Cassandra, in fact, this is the most important thing to get right when you are modeling with Cassandra so let’s dive in and look at what exactly is a Partition key and Clustering column.
Let’s start off by taking a look at the CQL and SQL, we can notice that they are quite similar and if you have worked with SQL, picking up CQL will be no challenge. The partition key has a special use in Apache Cassandra beyond showing the uniqueness of the record in the database. The other purpose, and one that very critical in distributed systems, is determining data locality. When data is inserted into the cluster, the first step is to apply a hash function to the partition key. The output is used to determine what node (and replicas) will get the data. The algorithm used by Apache Cassandra utilizes Murmur3 which will take an arbitrary input and create a consistent token value. That token value will be inside the range of tokens owned by a single node. In simpler terms, a partition key will always belong to one node and that partition’s data will always be found on that node.
Hopefully, this will give you enough information to start the process, but this is just scratching the surface there is a lot more information you need before you can start actually migrate a PostgreSQL to Cassandra so make sure to research deeply so you can do it right.
As mentioned above, the live recording which includes a live walkthrough of the demo is embedded below, so be sure to check it out and subscribe. Also, if you missed last week’s Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2, be sure to check that out as well!
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!