Evaluating Your ML Model
Among your concerns when developing a model is its accuracy. In this post, we’ll discuss techniques around managing your data that can help you better evaluate the accuracy of your model.
Before we begin our discussion of model evaluation, however, we should clarify what is meant by “model”. Often, the word is defined implicitly and it’s assumed that you will mostly figure out what it means over the course of the article or blog post or what-have-you.
I’m going to break with that practice and attempt to give you an intuitive understanding of what a model is. My favorite definition is via Joel Grus in Data Science from Scratch
“What is a model? It’s simply a specification of a mathematical (or probabilistic) relationship that exists between different variables.”
Think of a model like an equation. If you’ll recall from school the equation for a line
f(x) = mx + b
The function takes x as input. In our case, x is our data set.
Let’s say we want to build an image classifier that, given a photo, can tell us if the animal in the photo is a cat or a dog.
In this case, x is our collection of photos of cats and dogs. Now we need to figure out for what values of m and b the equation is most accurate, i.e. what m and b make it such that, given any photo of a cat or a dog, the equation correctly classifies the animal in the photo.
A model can be thought of as the equation for which m and b have been discovered. Training a model is the process of choosing m and b so as to minimize error. Choosing an algorithm (linear regression, a neural network, etc.) is choosing the form of the equation and the means by which you figure out m and b.
Continuing with our example above, we have a large number of labeled photos of cats and dogs. We now need to divide this collection into two parts - a training set and a test set. Our goal is to “teach” our model using the training set and to confirm that the model is accurate using another set that the model has never seen before, i.e. our test set.
We begin with a model that has a given m and b (these may be set randomly or by other methods). We feed our model our training data and, via some algorithm, the model’s parameters, i.e. m and b, are adjusted so as to reduce the difference between the model’s output and the expected output. For instance, if our model was incorrect 30% of the time, the parameters would be readjusted to get the error rate as close as possible to 0.
Thus far, we’ve ignored a couple important things. For starters, we haven’t explained how to go about choosing the appropriate algorithm for the problem at hand. There are general guidelines around what algorithms are best suited for certain categories of problems. Having said that, it is not always readily apparent what algorithm to choose.
We’ve also neglected to discuss how hyperparameters will be tuned. In case you’re not familiar with the concept, a hyperparameter is a parameter that is not tuned during the training process. Instead, hyperparameters are set by you, the human agent. Examples of hyperparameters include the number of layers in a neural network or its learning rate. Some hyperparameter configurations will produce more accurate results than others, so it’s important to get them right.
Let’s say we know neither what algorithm to choose nor what values our hyperparameters should take, but that we have combinations of algorithms and hyperparameters that we think might work, each with a corresponding model.
One option for choosing a model would be to take each one in turn, train it, evaluate it against the test set, modify its hyperparameters, and test it again. We could then compare the models and choose the most accurate. This method seems sound, but it presents a serious problem. The purpose of the test set is to check that the model can handle previously unseen data.
Testing is meant to preclude overfitting. Overfitting is when a model is very accurate when given data it has seen before and rubbish with unfamiliar data. Because we have optimized our hyperparameters against our test set, we have no way to gauge how our model performs with new photos.
This is where validation sets come in. A validation set allows us to evaluate our model before we get to the test set. There are two common methods of performing this evaluation - holdout validation and k-fold cross-validation.
Holdout Validation
Holdout validation requires that one part of the training data be set aside and not used in the training of the model. The isolated set is the validation set.
You may have multiple algorithms under consideration, each with multiple hyperparameters that need t be set. Thus, you could have many different combinations of hyperparameters and algorithms. For each combination, the model is trained against the training set, and evaluated against the validation set.
K-fold Validation
The other method of evaluation we’re going to talk about is k-fold validation; in particular, nested k-fold validation.
In k-fold validation, the training set is split into k different sections. For instance, if your k equals 5, then the training set is divided into five folds. One fold is set aside as the validation set. The other four are used to train the model.
Then a different fold is chosen as the validation set, and the remaining four folds are the training set. This process repeats until all five of the folds has been used as a test set once. The scores (e.g. error rate) achieved after each iteration are averaged and this is then the score for that particular model. Each of the models goes through this process and the model with the best score is ultimately chosen to proceed to the test set.
Nested k-fold validation is meant to mitigate the effects of using a set of data as both training data and validation data. Each of the folds is used to both train and evaluate the model, which creates a similar issue to the one discussed above - the set being used to validate the model is also being used to train it.
To lessen this effect, we use both an outer and inner k-fold validation on the model. The outer k-fold validation splits the training set into n different folds. One of those folds is put aside as the validation set and the other n − 1 folds are used as the training set.
We then take this training set and split it, in turn, into m folds, one of which is the validation set, while the other m − 1 folds make up the training set. This is the inner k-fold validation.
The inner k-fold selects the best model and that model is then tested against the test fold in the outer k-fold. We then repeat this until each of the n folds in the outer k-fold has been used as a validation set. We repeat this process for every hyperparameter/algorithm combination under consideration.
It’s important to note that once the most accurate model has been found, whether you are using holdout validation or k-fold cross validation, you start over with a new model. That is, you generate a new model (using the highest-performing algorithm and hyperparameters) and train it on the entire training set. After training this model, you then check it against the test set.
One Last Thing
You may be wondering at this point how to go about splitting up your data. A common split is 80%/20%. Starting with a single collection of data, you split it such that 80% is training data and 20% is test data. You then take the training data and split it such that 80% is training data and 20% is validation data. Roughly, the split is 60% training, 20% validation, and 20% test.
Photo by Nick Hillier on Unsplash