In this blog post, we will introduce Spark, a unified analytics engine for large-scale data processing, and discuss how to use it to do SQL on a NoSQL database like Cassandra. We will also give you a quick demo to show how you can quickly test it out yourself. This is Part 2 of our series on “Doing SQL and Reporting on Apache Cassandra with Open Source Tools”, and part 1 is linked below. Also, be on the lookout for Parts 3 and 4 coming soon!
Apache Spark is a unified analytics engine for large-scale data processing. The most recent version that was released on 06/18/2020 is Spark 3.0.0. Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. It also offers over 80 high-level operators that make it easy to build parallel apps. Users can also use it interactively from the Scala, Python, R, and SQL shells. For our demo in this blog and in the video linked below, we will be using the Spark shell; however, the same commands can be used in the PySpark shell. If using the Spark-SQL shell, you do not have to append the
.show as seen below when using the queries.
Spark also powers various libraries, which include SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. Users can combine these libraries seamlessly in the same application. We will mainly be focusing on the SQL library; however, if you want to check out some content on using Spark and Cassandra for Machine Learning, one of our team members Obioma Anomnachi has a series which you can check out by hitting the link on his name (blogs + videos with demos).
Spark can be run using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Users can access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources. We will be focusing on Apache Cassandra for this blog and demo.
We will also take a look at the Spark UI that is run at
Address:4040 when the Spark session is started to see the breakdowns of our queries, jobs, stages, DAG visualizations, and etc. More images of what the UI looks like can be seen at the link above.
In order to make the demo as streamlined as possible, we will be using DataStax Enterprise for Cassandra, Spark, Spark Shell, and Spark UI; as well as, DataStax Studio for importing and using the notebook for the demo. If you choose to use a local instance of Apache Cassandra and a standalone instance of Apache Spark to try this demo out yourself, then you can by following this DataStax quick start guide. Also, if running this without DataStax Studio, you can still download the Notebook file (linked below), and open the code.txt file once extracted. You can then copy and paste the CQL commands and queries into a terminal running CQLSH instead of running it in DataStax studio like we will be doing in this demo.
To begin the demonstration, we will need to download the DataStax studio notebook that we have already created for you.
A general overview of the notebook is that we will create 2 tables, seed those two tables, and try to run a join query using CQL, which will not work. We will try that join statement in the Spark shell, however, and visualize the results.
We will need to run the DataStax Enterprise Cassandra server in analytics mode, Datastax Enterprise Spark, and DataStax Studio server. You can choose however to run them, but we ran them in 3 separate tabs within 1 terminal with each tab cd’d into each respective directory.
bin/dse cassandra -k
Once we have all 3 servers running, we can open
localhost:4040 for the Spark UI, and
localhost:9091 for the Datastax Studio UI as seen below.
Now we can drag and drop the downloaded notebook file into Datastax Studio, which should import it. Once imported, we can open it.
Moving back to the notebook, we will start with the first cell. We can run the first cell by hovering over the cell and clicking
Run LOCAL.QUORUM, which will create our
Moving the terminal running Spark shell, if we run
spark.sql("show schemas").show, we will see our newly created demo schema/keyspace.
Now, we will move back to the notebook, and run the second cell. Before running the cell, we will need to select the keyspace
demo, and then hit
Run LOCAL.QUORUM. Once that has run, we can run
spark.sql("use demo") in the Spark shell to access the
demo keyspace. After, we can then run
spark.sql("show tables").show to confirm we have no tables.
We will now move onto the 2nd, 3rd, and 4th cells in the notebook. The 2nd cell will create our first table, the 3rd cell will seed that table, and the 4th cell will confirm that the table is seeded. Go ahead and run cells 2-4, and remember to select
demo as the keyspace. Once the 4th cell runs, you will see the result of the query from a CQL aspect. Once that is completed, we will confirm that the table was created and seeded correctly by running the same query we ran in cell 4 and running it in the Spark shell:
spark.sql("SELECT * FROM spacecraft_journey_catalog where spacecraft_name = 'vostok1'").show.
With this, we have now confirmed that Spark can connect to our local instance of Cassandra and make queries on it.
Also, we can check out our query analytics in the Spark UI at
Moving back to the notebook, we can run cells 5, 6, and 7, and repeat the steps we did for cells 2, 3, and 4 to verify that the table was made, seeded correctly, and visualize the query analytics in the Spark UI.
We have now reached the end of the demo, and more importantly, the purpose of why we are using Spark with a NoSQL database like Cassandra. Say we want to get the speed of a spacecraft by spacecraft name, and also return the summary of the journey by spacecraft name as well with only one query. Will we be able to do that with only CQL? NOTE: The data model we are using is strictly for demo purposes to showcase how Spark can connect with Cassandra.
Take the command in the last cell of the notebook and try running it. It won’t work as we can’t do joins natively with CQL, and we would need to run at least 2 queries, one to each table, and then programmatically connect that data.
If we try running the same query in the Spark Shell, then it will work. We will get results of a join statement that joins 2 Cassandra tables based off of the journey_id in both tables.
We can also visualize the analytics of that join query in the Spark UI.
To also demonstrate some other cool things you can do with Spark’s UI, we can analyze queries by query. If we go the SQL tab in the UI, we can analyze the different
show tables queries we made and use that for BI analysis.
For this demo, we could use a BI tool, or manually plot the duration times of the
show queries and make observations on them. However, these observations would not likely hold weight in the real world without a larger sample size and proper regression analysis.
With Spark, a unified analytics engine for large-scale data processing, we connected two NoSQL tables and ran a join query on them. This example demo only utilizes one data source, Cassandra, but if we want to use multiple supported data sources and connect them via Spark to run joins across them, then we can. This concludes part 2 of our “Doing SQL and Reporting on Apache Cassandra with Open Source Tools” series. Part 1 is linked below, and be on the lookout for parts 3 and 4!
If you want to watch this demo in video form, it is linked below:
Also, the slide deck associated with the video is linked below:
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!