Spark and Cassandra: Doing SQL and Joins on Cassandra Tables

In this blog post, we will introduce Spark, a unified analytics engine for large-scale data processing, and discuss how to use it to do SQL on a NoSQL database like Cassandra. We will also give you a quick demo to show how you can quickly test it out yourself. This is Part 2 of our series on “Doing SQL and Reporting on Apache Cassandra with Open Source Tools”, and part 1 is linked below. Also, be on the lookout for Parts 3 and 4 coming soon!

Apache Spark is a unified analytics engine for large-scale data processing. The most recent version that was released on 06/18/2020 is Spark 3.0.0. Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. It also offers over 80 high-level operators that make it easy to build parallel apps. Users can also use it interactively from the Scala, Python, R, and SQL shells. For our demo in this blog and in the video linked below, we will be using the Spark shell; however, the same commands can be used in the PySpark shell. If using the Spark-SQL shell, you do not have to append the spark.sql("") and .show as seen below when using the queries.

spark shell
Spark Shell
PySpark Shell
PySpark Shell
Spark-SQL Shell
Spark-SQL Shell

Spark also powers various libraries, which include SQL and DataFramesMLlib for machine learning, GraphX, and Spark Streaming. Users can combine these libraries seamlessly in the same application. We will mainly be focusing on the SQL library; however, if you want to check out some content on using Spark and Cassandra for Machine Learning, one of our team members Obioma Anomnachi has a series which you can check out by hitting the link on his name (blogs + videos with demos).

Spark can be run using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Users can access data in HDFSAlluxioApache CassandraApache HBaseApache Hive, and hundreds of other data sources. We will be focusing on Apache Cassandra for this blog and demo.

We will also take a look at the Spark UI that is run at Address:4040 when the Spark session is started to see the breakdowns of our queries, jobs, stages, DAG visualizations, and etc. More images of what the UI looks like can be seen at the link above.

Spark UI
Spark UI

In order to make the demo as streamlined as possible, we will be using DataStax Enterprise for Cassandra, Spark, Spark Shell, and Spark UI; as well as, DataStax Studio for importing and using the notebook for the demo. If you choose to use a local instance of Apache Cassandra and a standalone instance of Apache Spark to try this demo out yourself, then you can by following this DataStax quick start guide. Also, if running this without DataStax Studio, you can still download the Notebook file (linked below), and open the code.txt file once extracted. You can then copy and paste the CQL commands and queries into a terminal running CQLSH instead of running it in DataStax studio like we will be doing in this demo.

To begin the demonstration, we will need to download the DataStax studio notebook that we have already created for you.

A general overview of the notebook is that we will create 2 tables, seed those two tables, and try to run a join query using CQL, which will not work. We will try that join statement in the Spark shell, however, and visualize the results. 

We will need to run the DataStax Enterprise Cassandra server in analytics mode, Datastax Enterprise Spark, and DataStax Studio server. You can choose however to run them, but we ran them in 3 separate tabs within 1 terminal with each tab cd’d into each respective directory. 

  • DataStax Enterprise Cassandra Server
    • cd dse-6.8.1/
    • bin/dse cassandra -k
  • DataStax Enterprise Spark
    • cd dse-6.8.1/
    • bin/dse spark
  • DataStax Studio
    • cd datastax-studio-6.8.1/
    • bin/server.sh

Once we have all 3 servers running, we can open localhost:4040 for the Spark UI, and localhost:9091 for the Datastax Studio UI as seen below.

Demo Spark UI
Demo Spark UI
Datastax Studio UI
Datastax Studio UI

Now we can drag and drop the downloaded notebook file into Datastax Studio, which should import it. Once imported, we can open it.

Demo Notebook
Demo Notebook

Moving back to the notebook, we will start with the first cell. We can run the first cell by hovering over the cell and clicking Run LOCAL.QUORUM, which will create our demo keyspace.

Moving the terminal running Spark shell, if we run spark.sql("show schemas").show, we will see our newly created demo schema/keyspace.

Running spark.sql("show schemas").show in Spark Shell
Running spark.sql("show schemas").show in Spark Shell

Now, we will move back to the notebook, and run the second cell. Before running the cell, we will need to select the keyspace demo, and then hit Run LOCAL.QUORUM. Once that has run, we can run spark.sql("use demo") in the Spark shell to access the demo keyspace. After, we can then run spark.sql("show tables").show to confirm we have no tables.

Running show tables in Spark Shell
Running show tables in Spark Shell

We will now move onto the 2nd, 3rd, and 4th cells in the notebook. The 2nd cell will create our first table, the 3rd cell will seed that table, and the 4th cell will confirm that the table is seeded. Go ahead and run cells 2-4, and remember to select demo as the keyspace. Once the 4th cell runs, you will see the result of the query from a CQL aspect. Once that is completed, we will confirm that the table was created and seeded correctly by running the same query we ran in cell 4 and running it in the Spark shell: spark.sql("SELECT * FROM spacecraft_journey_catalog where spacecraft_name = 'vostok1'").show.

Running SELECT * FROM spacecraft_journey_catalog where spacecraft_name = 'vostok1' in Spark Shell
Running SELECT * FROM spacecraft_journey_catalog where spacecraft_name = 'vostok1' in Spark Shell

With this, we have now confirmed that Spark can connect to our local instance of Cassandra and make queries on it.

Also, we can check out our query analytics in the Spark UI at localhost:4040.

Visualizing the SELECT * FROM spacecraft_journey_catalog where spacecraft_name = 'vostok1' query in Spark UI
Visualizing the SELECT * FROM spacecraft_journey_catalog where spacecraft_name = 'vostok1' query in Spark UI

Moving back to the notebook, we can run cells 5, 6, and 7, and repeat the steps we did for cells 2, 3, and 4 to verify that the table was made, seeded correctly, and visualize the query analytics in the Spark UI.

Show Tables in Spark Shell after running cell 5
Show Tables in Spark Shell after running cell 5
Results of select * from spacecraft_speed_over_time in Spark Shell
Results of select * from spacecraft_speed_over_time in Spark Shell
Spark UI analytics of select * from spacecraft_speed_over_time query
Spark UI analytics of select * from spacecraft_speed_over_time query

We have now reached the end of the demo, and more importantly, the purpose of why we are using Spark with a NoSQL database like Cassandra. Say we want to get the speed of a spacecraft by spacecraft name, and also return the summary of the journey by spacecraft name as well with only one query. Will we be able to do that with only CQL? NOTE: The data model we are using is strictly for demo purposes to showcase how Spark can connect with Cassandra.

Take the command in the last cell of the notebook and try running it. It won’t work as we can’t do joins natively with CQL, and we would need to run at least 2 queries, one to each table, and then programmatically connect that data.

Error running join query with CQL
Error running join query with CQL

If we try running the same query in the Spark Shell, then it will work. We will get results of a join statement that joins 2 Cassandra tables based off of the journey_id in both tables.

Same join query, but run in Spark shell
Same join query, but run in Spark shell

We can also visualize the analytics of that join query in the Spark UI.

Spark UI analytics of the above query
Spark UI analytics of the above query

To also demonstrate some other cool things you can do with Spark’s UI, we can analyze queries by query. If we go the SQL tab in the UI, we can analyze the different show tables queries we made and use that for BI analysis.

Analyzing different show tables queries made during the demo

For this demo, we could use a BI tool, or manually plot the duration times of the show queries and make observations on them. However, these observations would not likely hold weight in the real world without a larger sample size and proper regression analysis.

With Spark, a unified analytics engine for large-scale data processing, we connected two NoSQL tables and ran a join query on them. This example demo only utilizes one data source, Cassandra, but if we want to use multiple supported data sources and connect them via Spark to run joins across them, then we can. This concludes part 2 of our “Doing SQL and Reporting on Apache Cassandra with Open Source Tools” series. Part 1 is linked below, and be on the lookout for parts 3 and 4!

If you want to watch this demo in video form, it is linked below:

Also, the slide deck associated with the video is linked below:

Doing SQL and Reporting on Apache Cassandra with Open Source Tools

  1. Presto and Cassandra
  2. Spark and Cassandra
  3. Open Source Notebooks and Cassandra
  4. Open Source BI Tools and Cassandra

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!