In Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, we discussed how we can do Cassandra ETL processes using Airflow and Spark. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
In Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, we show you how you can set up a basic Cassandra ETL process with Airflow and Spark. If you want to hear why we used the bash operator vs the Spark submit operator like in Data Engineer’s Lunch #25: Airflow and Spark, be sure to check out the live recording of Cassandra Lunch #53 below!
In this walkthrough, we will cover how we can use Airflow to trigger Spark ETL jobs that move data into and within Cassandra. This demo will be relatively simple; however, it can be expanded upon with the addition of other technologies like Kafka, setting scheduling on the Spark jobs to make it a concurrent process, or in general creating more complex Cassandra ETL pipelines. We will focus on showing you how to connect Airflow, Spark, and Cassandra, and in our case today, specifically DataStax Astra. The reason we are using DataStax Astra is that we want everyone to be able to do this demo without having to worry about OS incompatibilities and the sort. For that reason, we will also be using Gitpod, and thus the entire walkthrough can be done within your browser!
For this walkthrough, we will use 2 Spark jobs. The first Spark job will load 100k rows from a CSV and then write it into a Cassandra table. The second Spark job will read the data from the prior Cassandra table, do some transformations, and then write the transformed data into a different Cassandra table. We also used PySpark to reduce the number of steps to get this working. If we used Scala, we would be required to build the JAR’s and that would require more time. If you are interested in seeing how to use the Airflow Spark Submit Operator and run Scala Spark jobs, check out this walkthrough!
You can also do the walkthrough using this GitHub repo! As mentioned above, the live recording is embedded below if you want to watch the walkthrough live.
Create Database
button on the dashboardGet Started
button on the dashboardThis will be a pay-as-you-go method, but they won’t ask for a payment method until you exceed $25 worth of operations on your account. We won’t be using nearly that amount, so it’s essentially a free Cassandra database in the cloud.
create database
and wait a couple minutes for it to spin up and become active
.dashboard/<your-db-name>
, click the Settings
menu tab.Admin User
for role and hit generate token.Secure Bundle
Connect
tab in the menuNode.js
(doesn’t matter which option under Connect using a driver
)Secure Bundle
Secure Bundle
into the running Gitpod container.setup.cql
into the CQLSH terminalWe will be using the quick start script that Airflow provides here.
bash setup.sh
./spark-3.0.1-bin-hadoop2.7/sbin/start-master.sh
Open port 8081 in the browser, copy the master URL, and paste in the designated spot below
./spark-3.0.1-bin-hadoop2.7/sbin/start-slave.sh <master-URL>
mkdir ~/airflow/dags
mv spark_dag.py ~/airflow/dags
vim properties.conf
example_cassandra_etl
exists.If it does not exist yet, give it a few seconds to refresh.
example_cassandra_etl
, and drill down by clicking on example_cassandra_etl
.Admin
section of the menu, select spark_default
and update the host to the Spark master URL. Save once done.DAG
menu item and return to the dashboard. Unpause the example_cassandra_etl
, and then click on the example_cassandra_etl
link.previous_employees_by_job_title
select * from <your-keyspace>.previous_employees_by_job_title where job_title='Dentist';
days_worked_by_previous_employees_by_job_title
select * from <your-keyspace>.days_worked_by_previous_employees_by_job_title where job_title='Dentist';
And that will wrap up our walkthrough. Again, this is an introduction on how to set up a basic Cassandra ETL process run by Airflow and Spark. As mentioned above, these baby steps can be used to further expand and create more complex and scheduled/repeated Cassandra ETL processes run by Airflow and Spark. The live recording of Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark is embedded below, so if you want to watch the walkthrough live, be sure to check it out!
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!