Suppose you are developing a ML system or trying to improve the performance of your ML system. One very important step is deciding what are the promising avenues to try next.
To explain this, suppose you are using linear regression to predict Diamond prices for example and let’s say you have implemented regularized linear regression. However, suppose that when you test your hypothesis on a new set of Diamond, you find out that it makes unacceptably large errors in its prediction. The question is what should you then try next in order to improve the learning algorithm.
Some of the things to think of could be to:
- Try and get more training examples (even though it doesn’t usually work)
- Try a smaller set of features (i.e., carefully selecting a small subset of them to prevent over fitting)
- Try and get additional features (i.e., collecting more data to get more features)
- Try adding polynomial features.
- Try other things like increasing or decreasing the regularization parameter.
The above list contains plausible things to think of in such a situation, but most of the time what people do is to base their judgment on personal intuition (i.e., they might decide to to get more training data or get smaller sets of features or decrease the regularization parameter etc) on a random setting which might be very time consuming.
Fortunately, there is a pretty simple technique that can help rule out some (half) of the things in the above list as being potentially promising things to pursue (use) which will help eliminate things on the above list that will not help. This is achieved through a ML diagnostic. A diagnostic is a test that you can run to gain insight what is or isn’t working with a learning algorithm and gain guidance as to how best to improve its performance. However, diagnostic can take time to implement, but doing so can be very good use of your time. Below are some of the diagnostics to consider in order to make a decision.
Evaluating a hypothesis: You want to evaluate a hypothesis that has been learned from your algorithm. When you fit the parameters of your learning algorithm, you think about choosing the parameters that minimize the training error. One might think that getting a low value of training error might be a good think but just the fact that the hypothesis has low training error doesn’t mean it is necessarily a good hypothesis because it might be over fitting the data and therefore fails to generalize the new examples that are not in the training set. So how do you tell if the hypothesis might be over fitting? In problems with one feature, we could plot the hypothesis function to see what is happening but in general, for problems involving more features it becomes hard to plot what the hypothesis function looks like and so we need some other ways to evaluate the hypothesis. The standard way to evaluate a learned hypothesis is to split the data into two portions; a training set and a test set in the ratio 7:3.
A typical training/testing procedure of your learning algorithm will consist of:
- learn the parameters from the training data.
- then use the learned parameters to compute the test set error. Of course your definition of the test set error will depend on your learning algorithm. Suppose for example that the implementation of linear regression (without regularization) is badly over fitting the training set. In this case, we would expect the training error to be low and the test set error to be high.
Model selection and training/validation/test sets: Suppose you’ll like to decide what degree of polynomial to fit to a data set, what features to include to give you a learning algorithm or suppose you’ll like to choose the regularization parameter of the learning algorithm. These are called model selection problems and how to go about this is very important. As mentioned earlier, just because a learning algorithm fits a training set well doesn’t mean it is a good hypothesis. More generally, this is why the training set error is not a good predictor of how well the hypothesis will do (generalize) on new examples not seen in the training set. Coming back to the model selection problem, let’s say you try to choose what degree polynomial to fit on your data. You could choose a linear function, quadratic function, cubic function all the way to a 10th degree polynomial and get some estimates of how well your fitted hypothesis will generalize to new examples. One thing you could do is:
- first pick a polynomial of degree one (i.e., linear regression) and minimize the training error to get the corresponding parameter vector.
- take a second model (i.e., a quadratic function) and fit that to your training set and get the corresponding parameter vector.
- continue this way up to say a 10th parameter model and get their parameter vectors.
- A thing to try next will be to take these parameters and look at their respective test set errors. That is take each parameter vector and measure their performance on the test set.
One thing to do next in order to select one of these models will be to see which of the models has the lowest test set error. Let say for example that one ended up choosing the fifth order polynomial. But how well does this model generalizes is the question we need to ask. One thing to do is to look at how well this model has done on the test set. But the problem with this is that this will not be a fair estimate of how well the hypothesis generalizes. The reason is that what we have done is to fit an extra parameter using the test set. That is the degree of the polynomial was chosen such that it gave the best possible performance on the test set and so the performance of the parameter vector corresponding to this model on the test set is likely going to be an overly optimistic estimate of our generalization error. So because we have fitted these parameter values on the test set, it is no longer fair to evaluate the hypothesis on the test set since the hypothesis is likely to do better on the test set than it would on new examples that it hasn’t seen before.
To address this problem in a model selection setting, we split the data set into 3 sets namely; training set, cross validation set (validation set) and the test set in the ratio 6:2:2 although it can vary. Next we define the train/validation/test errors. So instead of using the test set to select the model, we now use the validation set to select the model. So at this point, we carry out steps 1 to 4 and instead test the hypothesis on the validation set and then pick the hypothesis with the lowest cross validation error. Let say this time around that it is the polynomial of degree 4 that has the lowest validation error. What we’ve done is that we have fitted the parameter (degree of polynomial) and so it is no longer fitted to the test set thus saving apart the test set to measure (estimate) the generalization error of the model that was selected by this algorithm. In this case, we might generally expect the validation error to be lower than the test error because an extra parameter has been fitted to the validation set.
Diagnosing bias vs variance: If you run a learning algorithm and it doesn’t do well as you’re hoping almost all the time then it could be because you either have a high bias problem or a high variance problem. And in this case, it is very important to figure out which of these problems is affecting the model. Knowing which of these problems is affecting the model will give us a clue on the promising ways of solving the problem. Recall the training and validation (test) errors mentioned above. A graph of these errors against the degree of the polynomial will provide a good way of accessing these problems. The plot of the test error will have some sought (not really) of a parabola shape and that of the train error will be decreasing as the degree of the polynomial increases. So for a polynomial of degree one, the training error will be too high whereas for a polynomial of say degree 10 the training error will be low. On the contrary, for a polynomial of degree one, the validation error will be high but for an intermediate degree (say 2) the validation error will also be intermediate (i.e., decrease) and when the degree is high (say 10), the validation error instead becomes very high. This curves give us a clue on how to distinguish whether the algorithm is suffering from a high variance or a high bias problem. Concretely, for the high bias case, we will realize that the validation and train errors are going to be high (where the validation error might slightly be higher than the training error). In contrast, for the high variance case, we will realize that the training error is going to be low (that is the algorithm is fitting the training set too well) whereas the validation error will be high.
Regularization and the variance/bias problem: Let now see how variance/bias is affected by regularization. Suppose you’re fitting a high order(degree) polynomial of say order 4. In order to prevent over fitting we use regularization. This is a modification of the hypothesis function such that it includes a regularization term with a regularization parameter to penalize high order parameters. The penalty on the parameters depends on the value of the regularization parameter. However, care should be take in choosing the value of the regularization parameter. If the regularization parameter is say 1000, then most of the parameters will be highly penalized rendering them approximately zero and the hypothesis function will become the standard hypothesis function(i.e., that of linear regression). On the other extreme, if we choose a small value of the regularization parameter (say approximately zero) then given that we are fitting a high order polynomial, this will just be the usual over fitting problem since little or no regularization was made. But an intermediate value of the regularization parameter will give a reasonable fit through the data.
So the question of how to automatically choose a good value for the regularization parameter is a central one. So how to choose this value can be achieved by considering a range of values say 0.01, 0.02,…, 10 that one might want to try. Based on the number of values that one might want to consider, one will end up with a corresponding models that you want to select among. For all the corresponding models, what one can do is the following:
- for 0.01, minimize the hypothesis function to obtain the corresponding parameter.
- for the second model with 0.02, again minimize the hypothesis function to get the corresponding parameter.
- continue this way up to the last model with 10.
- next we take all these parameters and use the validation set to evaluate them to pick which ever of these models has the lowest error on this set.
- get the parameter corresponding to this model and look at how it does on the test set.
- lastly, it is useful to see how the validation error and the training error vary as we vary the regularization parameter. To do this, a graph of the training and validation errors should be plotted against the regularization parameters. What we find is that for small values of the regularization parameter, we can fit the training set relatively well because no regularization is actually done. Thus you are just minimizing the square errors. So for small regularization values, we end with a small value of the train error whereas if the regularization value is large then we have a high bias problem. For the validation error, we may end up under fitting for large regularization values and if we have too small value then we may be over fitting. Once again, there would always be some intermediate regularization value (optimal value) that works just best in terms of having a small validation error or small test error.
Learning curve: Learning curve (LC) is a diagnostic tool used to figure out if a learning algorithm may be suffering a bias problem or a variance problem or both. LCs are often a very useful thing to plot if either you want to sanity check that your algorithm is working correctly of if you want to improve the performance of the algorithm. To plot LCs, what we usually do is to plot the train error and the validation error without regularization as a function of the number of training examples.
We have seen above what we should do in selecting a good model but the central question from the beginning is “what next?” Recall the list we had earlier for the potential things to do. Now lets give situations under which we could adopt each points on that list:
- Get more training examples when trying to fix high variance (i.e., if the validation error is greater than the training error.)
- Try a smaller set of features when trying to fix high variance . If you have high bias then don’t waste time trying to select features.
- Try getting additional features (not always) when tying to fix high bias problem.
- Try adding polynomial features when trying to fix high bias problem.
- Try increasing the regularization parameter to fix high bias problem.
- Try decreasing the regularization parameter to fix high variance problem.