In this blog, we discuss how to run an Apache Spark job on our DataStax Astra database. Also, a webinar recording is embedded below if you want to watch a live demo where we use Gitpod, sbt, Scala, and Spark-Submit to run 2 Spark jobs against our DataStax Astra instance. If you missed part 1 of this series: Connect Apache Spark and DataStax Astra, it will be linked below.
If you are not familiar with Apache Spark, it is a unified analytics engine for large-scale data processing. Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. It offers over 80 high-level operators that make it easy to build parallel apps, and you can use it interactively from the Scala, Python, R, and SQL shells. For our demo, we utilize spark-shell using Scala. Apache Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. We will be running our instance in standalone cluster mode with 1 worker. Additionally, we can access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources. In our case, we will be accessing Cassandra data that lives in the cloud in the form of DataStax Astra.
If you have been following us here at Anant, then you know that we have been working with DataStax Astra for some time. If you are not familiar with DataStax Astra, it is cloud-native Cassandra-as-a-Service built on Apache Cassandra. DataStax Astra eliminates the overhead to install, operate, and scale Cassandra and also offers a 5 gig free-tier with no credit card required, so it is a perfect way to get started and/or play with Cassandra in the cloud.
Check out our content on DataStax Astra below!
As a note, if you are doing additional research on this topic, you may find this article: https://www.datastax.com/blog/advanced-apache-cassandra-analytics-now-open-all. In this article, they mention “Spark Cassandra Connector 2.5.0 fully supports DataStax Astra” and if you look at the Spark-Cassandra-Connector on Github, the 2.5.0 version is only compatible with Apache Spark 2.4. This posed the question of whether or not we would be able to use Apache Spark 3.0, which is compatible with Spark-Cassandra-Connector 3.0, to connect to Astra as well. After I tested the method in the link above, I began testing the method with Apache Spark 3.0 and Spark-Cassandra-Connector 3.0, which worked fine as well.
We will be using Gitpod as our dev environment so that you can replicate this task without having to worry about OS incompatibilities/issues.
To build our JAR for the job, we are using sbt, which is an open-source build tool for Scala and Java projects. Additionally, we are using
sbt assembly to build a fat JAR so that we can package the DataStax Spark Cassandra Connector with our JAR instead of having to build a thin JAR and passing in the package with
spark-submit. There are some things to note with using
sbt assembly vs
sbt package, and that is discussed in the webinar embedded below, so be sure to watch that as well. Additionally, we wrote our Spark code in Scala, and if you want a breakdown of it, be sure to watch the video!
You can open this repo: https://github.com/adp8ke/Apache-Spark-and-DataStax-Astra in Gitpod by going to https://gitpod.io/#https://github.com/adp8ke/Apache-Spark-and-DataStax-Astra. Once opened, you can navigate to the
Job directory and get started with the instructions on the README.md.
And that wraps up part 2 of our “Apache Spark and DataStax Astra” series. In Part 1, we covered connecting Apache Spark and DataStax Astra using Spark-Shell and in Part 2, we covered running an Apace Spark job against our DataStax Astra database. Again, as mentioned before, we have a webinar embedded below where you can watch this demo live. Don’t forget to like and subscribe while you are there!
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!