Spark and Cassandra For Machine Learning: Cross-Validation

Cross-validation is a collection of methods for repeated training and testing of our machine learning models. We do it in order to learn more than simple testing can tell us. These tests can help us tune our model parameters. We need to do this before any final evaluation takes place and we try to move forwards to deployment.


What is validation and how is it different from testing?

Validation is a method for repeated testing. By testing repeatedly on different portions of our dataset, we can get a better idea of how our model will generalize. Testing on data that has been used to train your model is useless. It can be misleading about the actual efficacy of our model. Therefore each time we decide to test on a separate portion of our full dataset, we need to train our model using the rest of our dataset. The advantage here is that we get several test results. We can average those together for a better idea of how our algorithm works. It also allows us to test using each piece of data. We also get to feed different training sets into our algorithm to determine how well our parameters are set.

Cross-validation usually encompasses the repeated training and testing of a specified machine learning algorithm with the same parameters. This means that it starts after we have already selected our algorithm. It ends before final testing and moving on to deployment.

Validation Schemes

Normal Train / Test Split

The simplest type of validation scheme we use is a test/train split. We also sometimes want a third split of our data in which case it is a test/train/holdout split. First, we train on the designated training split and then run the first set of tests on the test split. We can use the results of this first test to tune the parameters of our model. Once we have decided that our parameters are good enough, we can test on the holdout set. We do this in order to check for overfitting and determine if we are ready to move forwards in our machine learning process.

K-Fold Cross-Validation

K-Fold Cross-validation is a standard way of checking the performance of a model with certain parameters. First, our dataset is split into k disjoint subsets. The data in each subset do not overlap, and each split is roughly of size n/k. In turn, each split is designated as a test set and the model is trained on the rest of the data. We can average the performance of our model over the splits or pick out the best or worst-case scenario. The performance of a model on k-fold cross-validation tests will improve as n approaches k since it is being trained on more of the data each time, but it will also take more time and resources to run those tests.

Leave-One-(Group)-Out Cross-Validation

Leave one out cross-validation has each sample act individually as a test set and the entire rest of the dataset act as a training dataset. We train n models and then test them on a single example. This is the same as k-fold cross-validation where k is equal to n.

Leave one group out cross-validation designates a specific subset of the dataset each time. This subset is not chosen randomly like in normal k-fold cross-validation. Instead, we pick based on some feature of our data and split along the divisions in that feature. For example, if the data in our dataset comes from a number of different sources, and that information is a feature in the data, we can split the dataset based on that. Each source would be in turn, held out as the test set and our model would be trained on the remaining data.

Nested Cross-Validation

Nested cross-validation is a more complete validation scheme, more similar to a train/test/holdout split than a single cross-validation test. Similar to that scheme, we use some subset of our tests for hyper-parameter tuning and the rest for testing for overfitting. First, we run the inner loop, doing k-fold cross-validation, and use those results to tune the parameters. Then we run the outer loop, which in this case is also k-fold cross-validation which we use for final accuracy estimation and checking for overfitting. We can also substitute in other validation schemes for either the inner or the outer loop.

Time Series Cross-Validation

When working with time-series data, we sometimes need to avoid training a model on data that is further ahead in time than the data that we test that model on. In that case, we cannot use the random splits from either a train/test split or k-fold cross-validation. Instead, we split sequentially. We split into k+1 groups to do k individual tests and for each test set, we train on all preceding data.


Testing our machine learning models using these validation schemes can help us determine when our models are properly tuned and whether or not those models are overfitting to their training data. Different validation schemes have different advantages or are applicable in different situations, but the general idea is to repeat the training and testing of a model to extract performance data.

We held a webinar on this topic. The recording can be found below. Keep an eye out for some great discussion towards the end.

The slides can be found below.

The link for the base environment can be found here. The extra notebooks and datasets used in the presentation can be found here.

Series – Machine Learning with Spark and Cassandra

More Information

Want to learn more about these and similar topics? Read some of our other posts about search or about Cassandra.

We help our enterprise clients with DataStax and Sitecore projects. Our company also provides companies with project planning workshops and Virtual CIOs.

We build and manage business platforms. Is your project going south? Let’s talk for 15 minutes.

Photo by Pietro Jeng on Unsplash