So let's begin with an example as usual. So what I'm going to do is use a classic data set which is of used cars and you can gather your own. We have gathered some. It's the price of 1,436 used Corollas, and their information is given in a data set called Toyota.csv. CSV is a comma-space separated text file, and the source of this is from Uiowa. You can collect that from any of the used-car websites today. But here's the data and the data gives, as you will expect, it gives a number of features of the car. So if I look at it, it gives of course the price, which is what we're trying to predict. It gives the age of the car in months, as of August 2004. It gives the mileage in kilometers. It gives the kind of fuel it uses; CNG, or petrol, or diesel. It gives the horsepower of the car. It gives whether the paint is a metallic paint or not. It also talks about whether the transmission is automatic, manual. It talks about the cylinder capacity, and the number of doors, and the weight of the car. Here's some data for you to look at. The first car has a price of 13,500 at age of 23 months, and so forth. As I was looking at this data, I was asking myself, "Am I happy with that?" The reality is I think the future is not that far off, where some app will probably scan the car and give you all the condition, information you ever wanted. There could also be devices fitted to cars, what we call IOT technology, where it may be capturing the way the car is functioning. So if you're going to buy the car, you get the data stream available to you. So the future is that we just don't have to worry about things like doors and weight. We can also get performance data. When was the last maintenance? It's not too far. But you can see that as more and more data comes about, the number of features expand and we will be talking about it soon. First time in this course, what we're going to do is we're going to break up this data into two parts. I'm going to set aside 10 percent of the data or 144 rows of the data for validation purposes and I'm going to train the model on the rest of the data, okay? So to do that in this particular file where you see the split percentage or partition, I have selected partition. If you don't select partition, it will just use the entire data. I've selected partition and I've said in 90/10, okay? Hit execute. So basically, you'll have to do both and I'll just walk through this thing in a second, okay? So what you do is you load the data and then you select the 90/10 zero partition and hit execute. What it will do is it will randomly sample 90 percent of the data, keep aside 10 percent of the data. As I've said before, the way it samples depends on the random number you chose. So throughout all these lectures, I'm going to use a random number 42, because that's the default random number. If you change that random number, the sample will change. The good thing is starting with a fixed random number, is every time you sample, you get the same sample if you use the seed of 42. Once you do this, right? Though one thing is, I just want to spend a minute on the categorical predictors in this data-set. So what did I mean? Until now, we didn't pay a lot of attention to this, but really if you look at fuel type, we see here that when you look at this data-set, you see that there are three types of fuel, okay? Now, if you also look you can make this a better description. You can ask for the summary statistics and say, "Describe." So if you look here, you can say, "Describe, " okay? When you look at the summary statistics and say describe, it tells you the stuff about the fuel type. If you look at it, there are 1,292 data points that we sample out of which, petrol is 88 percent, diesel is 10 percent, and CNG is just 1 percent. So this sample is heavily biased in favor of petrol. I personally don't like a sample like this simply because the model now will start fitting petrol driven cars much better than diesels. So one, there are various way of getting around it but we say this model is now skewed in one direction, okay? So what does R going to do or what is Rattle going to do when you ask it to predict? So Rattle knows that there are three types of fuel. So what it will do it will create two variables, and that will surprise you, right? So what it'll do is it'll create fuel type, petrol, one column. It will create another column called fuel type, diesel, okay? Now, if the car uses petrol, the column fuel-type petrol will be one. Otherwise, zero. If the car uses diesel the fuel type will be one, otherwise it will be zero. If both these columns are zero by default, the third column should be one. So the third column is never coded, right?There's a reason for that. We do not include variables in a regression which where one variable can be explained using the other two variables. So in this particular case, fuel-type diesel plus fuel-type petrol plus fuel-type CNG will add up to one. So if I knew two, I can get the third. So when we run the regression, it is only going to create two variables and it will do it automatically for you, okay? So when we run the regression as usual by choosing model and linear, we get this output. Now then, there's nothing new about this output, right? When we're estimating the model, you can clearly see it has created two different fuel types. It has split them into two variables. It tells you the R-squared of this model is 88 percent, adjusted R-squared of about 87.9 percent and it tells you which variables are significant, okay? Nothing new. So what's new? What's new is I'm going to take this model and also looking at how well it fits on the training data, I'm going to see how well it fits on the validation data. So if you go and do the Evaluate button and you press it, right? Automatically, that will give you this chart, when you ask it to compute predicted versus observed, you can choose whether it's training data or validation data. If you chose the validation data, it will fix this. As we have already learned, one line is the 45-degree line, the other line is the line going through the predicted values, and it's a pretty good fit. It's explaining almost 74.5 percent of the variability, and not just that, it is explaining it in the right direction. So basically, the 45-degree line and the line through the predicted values are coinciding quite a lot. So you say where did you get the 74.5 percent? Remember this 74.5 percent is no longer on the training data. It's on the validating data, right? So basically, it is saying that you're able to explain 74.5 percent of the variability in the validation data. Is this a good thing? Let's wait for the second and develop one more measure. So in this measure, what I'm going to do is I use the same evaluate button. And instead of using predicted versus observed, I have selected, the score button. So when I select the score button, and press execute, it says I'm going to score, and write the score in a file. We're going to call this, we can give this name, it automatically gives a name, Toyota validate score idents.csv. So what is scoring? It's going to predict the prices of each car in the validation set, and write it into a file. What's the use of that? The use of it is to do things like this. So if you look at this file, what I have done is, I have taken the predicted values which are here, and the price, so I have just copied them up here. This is the actual price and column F in this Excel file is the predicted price. I've subtracted the predicted price from the actual price and the next column G is called the residual. Then in the next column, I squared the residual. So all I'm doing is computing the error. The error I'm defining is actual minus predicted, and I'm squaring it, and that's in column H. If I add up the entire set of squared residuals, and find an average, and take a square root. So if you look at the formula on the top, it gives you the formula, square root of the average of the observations and column H. So if you look at this, this is an Excel formula, I assume you know Excel. So it says the root mean square error, RMSE is 1,708. So this is a measure of how well your model fits on the validation data set. So now, we have two measures. One measure is the R-squared on that, which we just saw was 74 percent in the previous slide. We also have another measure called the root mean square error. So what have we achieved? We have fitted a model on our training data, and we're seeing on our hold out sample, how well does the model perform? Based on this, what do we do? If you don't like it, we can go tweak the model again, and fit another model, and test it on the validation set. So the advantage of it is that your training is not biased, which is somewhat biased, but it's not biased that you're not training and testing using the same data. That may lead to what we call overfitting, and we will see more and more. So by keeping these two sets separate, we are able to reduce the bias of fitting very well with the data, and not being able to explain or predict outside this data set. Okay, we'll extend these ideas a little later. So what we have learned in this segment is a concept of computing a metric on a hold-up set which could be either the pseudo R-squared or the R-squared or could be the root mean squared error, and the root mean squared error formula instead. So I have put both the metrics here. I call the predicted versus observed visual fit on the validation data, and I call the root mean square error a more of a numerical fit, and both of them say, I got I'm explaining about 74 percent, I have an error of about 1,708 on the hold out set. Now, if I develop another model which I'm soon going to do, we'll be able to compare how well we do against this metric. So the advantage, to summarize, is that when you have a lot of data, you can keep aside a little bit, develop your model, and test it on the hold out sample using some standard metrics, then change your model and see whether you can improve the model and the fit. Once you're satisfied, we may actually go and test it on a third set. So that's why you have these three pieces, you have the training piece, you have the validation piece, you have the testing piece, and we will come to the testing piece a little later today.