In Data Engineer’s Lunch #33: Spark Cassandra and Elasticsearch for Data Engineering, we will discuss how you can use Spark and Spark jobs to load data from a CSV file, and save + load the data into Cassandra and Elasticsearch. The live recording of the Data Engineer’s Lunch, which includes a more in-depth discussion, is also embedded below in case you were not able to attend live. If you would like to attend a Data Engineer’s Lunch live, it is hosted every Monday at noon EST. Register here now!
Spark Cassandra and Elasticsearch for Data Engineering
For Data Engineer’s Lunch #33: Spark Cassandra and Elasticsearch for Data Engineering, we will go over a demo project that can be located here. In order to understand how the demo functions, we will introduce each of the software components along with some basic concepts related to how each software component deals with data:
Apache Cassandra is an open-source distributed No-SQL database designed to handle large volumes of data across multiple different servers
Cassandra clusters can be upgraded by either improving hardware on current nodes (vertical scalability) or adding more nodes (horizontal scalability)
Horizontal scalability is part of why Cassandra is so powerful – cheap machines can be added to a cluster to improve its performance in a significant manner
Note: Demo runs the DSE (DataStax Enterprise) version of Cassandra (not open source)
The demo project contains three Scala class files, each corresponding to their own Spark job. The demo is built with SBT and only one jar is created, and by using the -class flag when using spark-submit we can specify which of the three main classes to run. The three spark jobs are as follows:
Reading a .CSV file into a SparkSQL Dataframe and saving it to Cassandra
This first job will read data.csv (located in /test-data/) into a SparkSQL Dataframe and then save it to DSE Cassandra.
Loading data from a Cassandra table into a SparkSQL Dataframe and saving that data into Elasticsearch
This second job will read data from DSE Cassandra that was inserted in the first job into a SparkSQL Dataframe. Afterwards, it will save that data to Elasticsearch.
Loading data from Elasticsearch into a SparkSQL Dataframe
The third job reads from Elasticsearch’s index that was created in the second job (testuserindex) and puts this data into a SparkSQL Dataframe. Afterwards, it displays the data in the console.
The demo also contains a Docker compose file which is used to run two Docker images and link them to each other through a network. The following two docker images are run:
DSE – DataStax Enterprise version of Cassandra with analytics enabled (Spark)
For more detailed information on what commands to run for everything to work and in what order, along with additional commentary and details on each spark job, take a look at the readme file in the Github repo for the demo!
That will wrap up this blog on using Spark and Spark jobs to load data from a csv file, and save + load the data into Cassandra and Elasticsearch. As mentioned above, the live recording which includes a live walkthrough of the demo is embedded below, so be sure to check it out and subscribe. Also, if you missed last week’s Data Engineer’s Lunch #32: Converting JSON to CSV, be sure to check that out as well!
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Posted in Data & Analytics, Events|Comments Off on Data Engineer’s Lunch #33: Using Spark, Cassandra and Elasticsearch for Data Processing