Model-selection tests are used to determine which of the two trained machine learning models performs better. The point of model selection tests is to predict which model will generalize better to unseen data and thus comparisons of single test results are not enough. Today we will run through a number of different model selection tests, discuss how they work and how we interpret their results.
We use model selection tests to compare the relative performance of two different machine learning models or algorithms. These results give information useful only within the specific domain for the data and algorithms used. They cannot be used to generally define one algorithm as better than the other in all cases. These tests require the models being compared to have a specific shared measure of model skill. Because of this, we cannot use these tests to compare classification and regression models. We also cannot compare the accuracy of one model to the precision or recall of a second model. The different tests discussed today all make different statistical assumptions, and so have advantages and disadvantages in different situations.
The model selection tests produce results similar to statistical significance tests used in scientific research. We try to determine whether any observed differences are due to statistical chance or due to actual differences in model performance. Just like in scientific research, we use the results of these tests to determine whether we can reject the null hypothesis. In this case, the null hypothesis is that the performance of the two models does not differ. When we are able to reject the null hypothesis, we can say that any observed differences are due to real differences in model skill.
The Wilcoxon signed-rank test is a variation of the student’s t-test that is usable with a small number of samples. We generate these samples via k-fold cross-validation, extracting accuracy, or some other test statistic. The two models must be trained and tested using exactly the same cross-validation folds. We then feed those scores into the Wilcoxon signed-rank test, which involves calculating the absolute difference between the samples and ranking them. In the end, this test, like the following ones produces a p-value.
Like when used in scientific research, a p-value roughly describes the probability that observed results are due to random chance. The way that p-values are used is to have previously decided on a threshold value. If your observed p-value is below that threshold, then you have grounds to reject the null hypothesis. The least rigorous threshold to use is usually p<0.05 which corresponds to a 5% chance that the results are due to chance.
McNemar’s test generates a p-value based on how well the predictions that the two models make. In order to do this, we build a contingency table, which is similar to the confusion matrix built for generating test statistics from classification results. In this case, the two model’s results are compared to the actual results as well as each other.
If both models agree on a particular sample and they are both correct, that adds one to box A, column 1 and row 1. Instead, if one of the models is correct and the other is wrong, one is added to either box B or C. If both are incorrect, then one is added to box D. These counts are combined to generate the x² statistic, which is then used to calculate a p-value. McNemar’s test works best with a sufficient number of data points in the b and c boxes, though there are variants that claim to work in a situation where you don’t have enough.
This test is another variation on the paired t-test like the Wilcoxon signed-rank test above. In order to test using this method, we take a random 50% splits of our data and train and test to get results. Then we flip which set is the training set and which is the test set and retrain our models with the new data. We repeat this five times and feed that data into a formula that produces our t statistic which is then used to calculate a p-value.
There is a variation of this test called the 5x2CV combined f test where the results for each split are combined and used to calculate an f statistic that is then used to calculate a p-value.
In conclusion, model selection tests are tests that give us some way to determine which of two models, even ones for different machine learning algorithms work better on our dataset. These tests generate a p-value that we use to decide to accept or reject a difference that we see in the accuracy of our models. We make decisions using this p-value by setting a threshold value for the acceptable chance that any difference that we see in our models is due to random chance.
We build and manage business platforms. Is your project going south? Let’s talk for 15 minutes.