In Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra, we will discuss how you can use Apache Spark and Apache Cassandra to perform additional basic Machine Learning tasks. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register here now!
In Apache Cassandra Lunch #54, we will discuss how you can use Apache Spark and Apache Cassandra to perform additional basic Machine Learning tasks such as Naive Bayes classification and Random Forest classification. We discuss basic concepts of machine learning as a whole, the technologies used in the presentation, and present a Github repository with a demo project which uses these technologies and be run without any issues. The live recording embedded below contains a live demo as well, so be sure to watch that as well!
Machine Learning is “the study of computer algorithms that improve automatically through experience and by the use of data”. It is seen as a part of artificial intelligence. Machine Learning is used in a large number of applications in today’s society in many different fields I.E. Medicine (drug discovery), image recognition, email filtering, and more. Machine Learning uses complicated mathematical concepts and systems (heavily focused on linear algebra), making it appear intimidating to many beginners. However, many open-source and easy-to-access tools exist for working with and using machine learning in the real world. Some examples include:
Additionally, many well constructed resources exist for learning machine learning online:
There exist a large number of reasons why Cassandra is a great database to use for machine learning applications. Below we list a couple of them:
In the following section, we briefly discuss the Random Forest, which is an ensemble learning method for classification, regression, and similar tasks, and involves building a set of decision trees with incoming data. The building block of a Random Forest is a concept known as a decision tree:
A Random Forest, as the name may suggest, consists of a large group of decision trees made with different segments of the data. Suppose we have N pieces of data with M different features. We arbitrarily select to make n decision trees as part of the random forest.
The following diagram visually shows the process a single piece of data goes through in order to be classified by a Random Forest classifier:
The Naive Bayes classifier is a collection of simple probability based classification algorithms based on Bayes’ Theorem:
In a classification problem, however, we are trying to obtain the probability that some object/thing is a member of a particular class given some large number of known values. If there is plenty of variation/possible values for the input data, then using Bayes’ theorem in complicated problems is not very practical if we are basing our results on probability tables. We write out Bayes theorem in terms of the probability of some piece of data being part of a class Ck given some known variables about the data x:
If we have a complicated real-life problem with many portions to the known variables x, then we may not have a single data point for a given class Ck with exactly the values contained in x. Thus, we cannot properly obtain a value for p(x | Ck) using a probability table for such a complicated problem. To resolve this issue, we make the “naive” assumptions of conditional independence: assume that all features in x are mutually independent, conditional on the category Ck:
Although different features of most data sets are not conditionally independent, it turns out that making this assumption and building a classifier based on it gives pretty good results on real-life data sets. After going through some math, we come to the following conclusion given this set of assumptions:
From here, we can obtain values of p(xi | Ck) from a probability table and build a classifier based on this equation, which is the Naive Bayes classifier.
The following three technologies are the primary drivers of the demo project associated with this blog:
GitHub Repo link for the demo: https://github.com/HadesArchitect/CaSpark
The demo we go over in this week’s Cassandra Lunch was written by HadesArchitect at DataStax and contains a Docker compose file which will run three docker images:
The Github repo contains a lot of examples of Jupyter notebooks with different machine learning tasks/models being used, but we will be primarily looking at Random Forest.ipynb and Naivebayes.ipynb. The notebook Random Forest.ipynb focuses on running the Random Forest classifier to classify a wine into a classification score based on the physical and chemical properties of the wine. The notebook Naivebayes.ipynb does a similar classification task but using the Naive Bayes classifier instead of Random Forest. In order to run the demo project, one change needs to be made to the docker-compose.yml file in the main folder of the project. The line which starts with “PYSPARK_SUBMIT_ARGS” needs to be replaced with the following:
PYSPARK_SUBMIT_ARGS: '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.1 --conf spark.cassandra.connection.host=dse pyspark-shell'
This change uses a more up-to-date version of the DataStax Spark Cassandra Connector and will allow PySpark in the example Jupyter Notebooks to run (successfully connect to Spark running in the DSE Cassandra docker container). Instructions for running the demo besides this change are included in the readme of the repository.
That will wrap up this blog on using Cassandra and Spark to perform additional basic Machine Learning tasks. As mentioned above, the live recording which includes a live walkthrough of the demo is embedded below, so be sure to check it out and subscribe. Also, if you missed last week’s Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark, be sure to check that out as well!
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!
Subscribe to our monthly newsletter below and never miss the latest Cassandra and data engineering news!