So at this stage, we are back to understanding what we were trying to say. At this stage we have the set of outputs. As I explained to you, just run the script, make sure the data file is there, and at your leisure look at what each line is doing, and learn how to step through a script. Between you and me, we really are aiming at to understand, based on this which is the best model. So if somebody tells you forward selection gives you the best model, you should ask what criteria did he use. Of course we use the adjusted R squared. I could have used some of the criterion, like Mallow's CP or something else. You can look at a reference and you can simply use a different criterion that's not difficult. Now in this particular case, the best model is the eighth model. What is the eighth model? The eighth model does not have the metallic color and it doesn't have the doors. But eighth model uses all the variables except those two. Now, we're going to look at this model and see how good it is. But to do that, we're going to play the same trick. I'm going to run the regression on the same data and use tested on the validation data set. So, we're killing two birds with one stone. You haven't learned a clique tree. You've learned how to use off-script. Second, we have understood what is database selection. Third, we're going to go back and use our holdout data. So when you fit the model with the selected variables, nothing big, you're going to ignore the metallic color, you're ignoring the door. Other than that, you're not doing anything different and you are running a regression again. When you run the regression, you get this output. This output, believe it or not, it's on the validation dataset. You look at the fit, you see not bad. The predicted and the observed are behaving as they should. The pseudo R-squared is 75 percent, close to what it was with all the variables. Finally, this is surprising, the root mean squared error in the validation dataset is actually a little lower, that's 1672. So what it's telling you is dropping those two variables has not heard you a lot. But of course you must go to an expert and say, hey, does it make sense? So there I can go back to our studio and very quickly take you through another way of looking at this model. I said I'll tell you about Mallow CP because it's a cute thing. So CP if you click, there you go. If you click CP, there you go. On the site you will see a table-like thing showing up if you click it, and the command appeared down here and you hit enter and it gives you a CP values. If you look at your CP values, it gives 1093 first model, 695 second model, 228 third model, 81 fourth model, 47, 19, 10, 7, 9. So if you go down and go up and the rule is, you stop when the value of CP is equal to approximately the number of variables in the regression. So in this case, the number of variables in the regression is I think in the eighth model. The CP value is 7.6 and the number of variables in the regression is eight. So you stop in the eight model. So that's as simple as that. So CP is another criteria. Coming back. Here are the two more things you can do. I thought I shouldn't leave this without asking how else can I improve the model? So what we have seen is, having a training data, having a validation data and testing it, testing the model on the validation set. Now again, play different models and be happy with one but you may not be happy with how well this model is performing. So we may like to improve it to get better prediction. So you don't like data messy. You want to make it smaller, you want to improve the R squared. Second, you want to make the model robust. Robust means that when I change some of the data the coefficients I estimate shouldn't jump around a lot. Normally both these can be done. Predictive accuracy can be done with more data or more variables. Making it more robust can be done with more data and better data. As you solve more and more problems, you will probably see do I need more variables, do I need more data, and often you may iterate from one to one. Clearly two other issues, as you would notice and I have been enjoying this point, then the estimates you get depend on the sample you have. If you have a very small sample, you're biased sample, you have very few cause with CNG then I cannot test set it. So basically when you have situations like this you have to figure out a way to balance your sample, make sure it's more representative, make sure they're equally represented and the other issues with that. Finally, yes, inference. So now, we'll run the model on a training dataset. We have validated it on holdout set but what we have not discussed is whether the coefficients of the regression equation, how do I infer how accurate there [inaudible]. So these are topics which I will refer you to a wonderful book which we will talk about at the end of this module.