Testing is how we guess at the efficacy of our machine learning models out in the real world. The basics may seem obvious, but specific test metrics can help you emphasize performance on the parts of your application that are the most important. Depending on the rewards for good predictions and the penalties for bad ones, we can change the type of tests we use to emphasize one over the other. Testing can help tell us that we need to go back and change something about our training process or tell us when we can move forward and finalize our model.

This post is part of a series on machine learning with Spark and Cassandra. The previous part (found here) discussed data pre-processing methods. This part of the series focuses on how we test the efficacy of our machine learning models and tells us how well they might generalize to real data. The first part (found here) helps set up the environment we work in and discusses why we might want to use Spark and Cassandra here.

Testing is the process of checking how well a machine learning model works. Usually, a single statistical calculation is run on the results from some held out data, collected from the same set as the data that the model was trained on. For classification models, the predicted classes the model generates are compared to the actual classes from the testing dataset. For regression models, generated values are tested against the actual values and some sort of error metric is calculated. Testing can help us tell how well this particular trained model is at estimating our training data. We usually hope to use our test results to generalize to performance on the data that our model will process in production, once it is deployed.

Testing must obviously come after model training in our machine learning process. Sometimes, the results of our test tell us to go back to an earlier step in the process to achieve a better result. Our test results can tell us to pick a different algorithm to train or to try some different parameters for our algorithm. We can also place a test phase at the very end in order to test for overfitting.

Most individual test characteristics are taken from values in a confusion matrix. For binary classification, a confusion matrix is a two by two grid of predictions and results. Correct predictions are split into True Positives and True Negatives. Incorrect predictions are split into False Positives (prediction positive, actually negative) and False Negatives (prediction negative, actually positive). We then combine these metrics to learn about how well our models work in various situations.

The most common metric we pull out from the confusion matrix is accuracy, which divides correct predictions by all predictions. It results in a percentage that can be used as a general measure of how well our model works on the test data. Accuracy makes assumptions about the value of the various types of errors. Depending on the cost of potential false positives and false negatives we may prefer other measures.

Recall divides correctly predicted positives by all actual positive values. It measures how well our model detects the presence of positive values. High recall means that when we feed in a positive value we can be sure that the model will detect it. Precision divides the number of true positives by the total number of positive predictions. It measures the reliability of the positive prediction. High recall means that if we see a positive prediction, we can trust it.

We can combine precision and recall to calculate the f1-score. This measure is the harmonic mean of recall and precision. If either recall or precision is low, the f1-score will also be small. Accuracy focuses on the value of true positives and true negatives and therefore are used in situations where incorrect predictions are relatively low stakes. If incorrect predictions can have a high cost, we use f1-score, which emphasizes false positives and false negatives.

Some machine learning algorithms predict a scalar value. In these cases, classification error tests cannot help us learn about our algorithm. To test our regression models we use a number of formulas that all rely on a simple error metric. For a single example, the difference between the predicted value and the actual value is the error we tend to use. We combine these values over our test set in order to retrieve a single measure for how well our model predicts these examples.

The most common method for combining these values is the sum of squared error. A normal sum of errors would have positive errors and negative errors canceling out, potentially hiding some of the errors in our model. We also use mean squared error which is the mean of the squared error values, since it normalizes by the number of test examples. This results in a measure that is independent of the size of our test set. Beyond that, root mean squared error is also sometimes useful. It is calculated as the square root of the mean squared error statistic. Both mean squared errors are non-negative and have a minimum value of 0, which would represent a perfect fit. They are also monotonically increasing, so a lower error is always better. The difference between MSE and RMSE is that MSE is measured in units squared while RMSE is measured in the same units as the dataset.

We can also use absolute value rather than squaring our error values in order to calculate the sum of absolute errors or the mean absolute error. The advantage of using absolute error is that squaring affects large values more than small ones. This causes large error values to have more effect on squared error calculations, making it more vulnerable to large errors and other outliers. However, if large errors have large consequences, it may be advantageous to use squared error measures anyway. The absolute error also has some mathematical attributes that make it difficult to use for further processing, making the squared error more useful for things like gradient descent.

Testing for multi-class classification models is similar to testing for binary classification models. We still build a confusion matrix, only instead of having a two by two grid for positive and negative predictions and classes, we have an n by n grid where n is the number of possible classes.

In order to retrieve singular test characteristics like accuracy and f1-score, we split the single matrix into n matrices, one for each possible class and calculate from there. If the correct class is predicted for the class whose matrix we are working within, that is counted as a true positive. If the correct class is predicted but it is for one of the other classes, in this particular matrix it would count as a true negative. However, if the prediction is for the class whose matrix we are working in, but the reality is one of the other classes, we can call that a false positive. If the prediction is one of the other classes, but the reality is the matrices class, we call that a false negative.

From there we can calculate the same test characteristics as we use in binary classification for each class and use those measures in the same way. We can calculate an overall score by averaging the ones from the individual classes. We can take the values for each class and average those, called the macro-average of the test characteristic. This method puts the same importance on each class, so the representation of each class is not taken into account. If a single class has 90% of examples but has a low score, using the macro average could still potentially result in a high score if the other classes are all high values, regardless of the size of those classes in the dataset.

The micro-average averages over each individual example in the test dataset. It is, therefore, less resistant to unbalanced representation than the macro-average. This is another case where priorities shift based on how important certain aspects of model performance are. If predictive capability on a particular class is very important, it might be better to use the micro-average or even to just observe the individual test scores. Other times, the overall score will be more important than performance on a particular class, and the macro-average might be better in these cases.

Testing our model via predictions on a test set of data can help us decide what to do next in the machine learning process. Different types of models have different types of tests that work well for them. When choosing what test statistic to look at within those boundaries, pick one that emphasizes the most important things for your particular applications.

We held a webinar on this topic. The recording can be found below. Keep an eye out for some great discussion towards the end.

The slides can be found below.

The link for the base environment can be found here. The extra notebooks and datasets used in the presentation can be found here.

Part 1 – Spark and Cassandra For Machine Learning: Setup

Part 2 – Spark and Cassandra For Machine Learning: Data Pre-processing|Video|Deck

Part 3 – Spark and Cassandra for Machine Learning: Testing| Video | Deck

Want to learn more about these and similar topics? Read some of our other posts about search or about Cassandra.

We help our enterprise clients with DataStax and Sitecore projects. Our company also provides companies with project planning workshops and Virtual CIOs.

We build and manage business platforms. Is your project going south? Let’s talk for 15 minutes.

Photo by Pietro Jeng on Unsplash