In Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra, we will discuss how you can use Apache Spark and Apache Cassandra to perform basic Machine Learning tasks. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
In Apache Cassandra Lunch #50, we will discuss how to use Apache Spark and Apache Cassandra for basic machine learning tasks such as the KMeans Clustering algorithm. We discuss basic concepts of machine learning as a whole, the technologies used in the presentation, and present a Github repository with a demo project which uses these technologies and be run without any issues. The live recording embedded below contains a live demo as well, so be sure to watch that as well!
Machine Learning is “the study of computer algorithms that improve automatically through experience and by the use of data”. It is seen as a part of artificial intelligence. Machine Learning is used in a large number of applications in today’s society in many different fields I.E. Medicine (drug discovery), image recognition, email filtering, and more. Machine Learning uses complicated mathematical concepts and systems (heavily focused on linear algebra), making it appear intimidating to many beginners. However, many open-source and easy to access tools exist for working with and using machine learning in the real world. Some examples include:
Additionally, many well constructed resources exist for learning machine learning online:
Below we list a couple of neat and well known examples of machine learning:
Generally, there are four stages to a machine learning task. The following diagram splits these four stages up:
Machine learning model training can be roughly categorized into two categories (and varying combinations of these two categories exist as well): Supervised vs Unsupervised learning:
The following three technologies are the primary drivers of the demo project associated with this blog:
GitHub Repo link for the demo: https://github.com/HadesArchitect/CaSpark
The demo we go over in this week’s Cassandra Lunch was written by HadesArchitect at DataStax and contains a Docker compose file which will run three docker images:
The Github repo contains a lot of examples Jupyter notebooks with different machine learning tasks/models being used, but we will be primarily looking at KMeans.ipynb. The notebook KMeans.ipynb focuses on running the K-Means clustering algorithm and applying it to some social media data to determine whether a social media post is a status vs a video depending on the number of likes and number of comments it has. In order to run the demo project, one change needs to be made to the docker-compose.yml file in the main folder of the project. The line which starts with “PYSPARK_SUBMIT_ARGS” needs to be replaced with the following:
PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=dse pyspark-shell'
This change uses a more up-to-date version of the DataStax Spark Cassandra Connector and will allow PySpark in the example Jupyter Notebooks to run (successfully connect to Spark running in the DSE Cassandra docker container). Instructions for running the demo besides this change are included in the readme of the repository.
That will wrap up this blog on using Cassandra and Spark for basic machine learning tasks. As mentioned above, the live recording which includes a live walkthrough of the demo is embedded below, so be sure to check it out and subscribe. Also, if you missed last week’s Apache Cassandra Lunch #49: Spark SQL for Cassandra Data Operations, be sure to check that out as well!
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!