Welcome to Model Evaluation. In this video, you will learn techniques for evaluating the performance of your models. You’ll learn how to split your dataset into training and testing sets, build and train the model with a training set, and compute metrics on the testing set to assess the performance of the model. You’ll also learn a technique for handling cases with small datasets. So far, you have used the entire dataset to train your model. If you try to evaluate your model based on data you have trained it on, it will most likely lead to a result that it is too optimistic. The whole point of a model is to be able to work with unseen data. The solution is to split your data into training and testing sets. Separating data into training and testing sets is an important part of model evaluation. A training set is a subset of data used to train the model. A test set is a subset of data used to test the trained model. You use the test data to determine how well your model will perform in the real world. When you split a data set, the larger portion of data is usually used for training and a smaller part is used for testing. For example, you can use 80% of the data for training; you then use 20% for testing. You first use a training set to build a model. You then use a testing set to evaluate model performance. When you have completed the testing of your model, you should then use all the data to retrain the model to get the best performance. In this module, we are using the tidymodels package. Tidymodels is a collection of packages for modeling and machine learning using tidyverse principles. To create training and testing sets, you’ll use functions from the tidymodels library. The first step is to load the tidymodels library. Then, you provide a seed value using the set.seed() function. This function sets the starting number used to generate a sequence of random numbers. This ensures that you get the same result if you start with that same seed each time. You can set the seed to any arbitrary value. This example uses 1234. Finally, you use the initial_split() function to create a single binary split of the data into a training set and a testing set. You must specify the dataset you want to split and the proportion of the dataset you want to use for the training set. In this case, we want the training set to be 80% of the dataset. You'll find more readings about training and testing functions later in this module. Now that you have training and testing sets, you need to train it using the model of your choice. This example uses linear regression and the modeling functions in the tidymodels framework. The first step is to pick the model. This example specifies a linear regression model by calling the linear_reg() function and setting the engine to “lm”. We are going to call this specification lm_spec. You can now fit the model using the training data. The output shows the formula and the respective coefficients. Now it is time to evaluate the models to estimate how well the models perform on new data, the test data. This example uses the model created in the previous step to make the predictions. This output is stored in a data frame with only one column, called pred. You can then add a new column to this data frame using the mutate() function. This new column is named “truth” and contains values of ArrDelayMinutes from the test_data. In the end, you will have a data frame with the predictions and the true values. The testing root mean squared error, or RMSE, provides an estimate of how well your model performs on new data. So, let’s compare it to the training RMSE. From the code, you can see that the RMSE is higher for the testing data than it is for the training data. The same happens to the R-squared value. Why is the test set RMSE higher than the training set RMSE? The reason is that the built model corresponds too closely or exactly to the training data, so it has a lower RMSE in training. In this case, the model fits the data very well but performed poorly when predicting new data. This is called overfit. You can also make a plot to visualize how well you predicted the Arrival Delay. This example plots the actual values, the truth of ArrDelayMinutes, versus the model predictions as a scatter plot. It also plots the line y equals x through the origin. This line is a visual representation of the perfect model where all predicted values are equal to the true values in the test set. The farther the points are from this line, the worse the model fit. There is a noticeable problem when you do the training and testing set split on small datasets. If you have a small dataset, splitting it into a training and testing set might leave you with a small test set. This actually introduces bias into your testing because you’re reducing the size of your in-sample training data. Another problem is that you may be keeping some useful examples out of the training set. There is also a risk that you will not effectively evaluate the model performance because you do not have enough data in the test set. One solution to this problem is to perform n-fold cross-validation. One of the most common “out-of-sample” evaluation techniques is cross validation. Cross-validation is an effective use of data because each observation is used for both training and testing. In this method, the dataset is split into k-equal groups; each group is referred to as a fold. This example has four folds. Some of the folds can be used as a training set, which you use to train the model, and the remaining parts are used as a test set, which you use to test the model. For example, you can use three folds for training and then use one fold for testing. This is repeated until each partition is used for both training and testing. At the end, you use the average results as the estimate of out-of-sample error. The evaluation metric that you use depends on the model. You might ask why it’s worth the effort to perform cross validation. It’s because cross-validation is used to test the generalizability of the model. Generalizability is a measure of how useful the results of a study are for a broader group of people and situations. As you train a model on the training set, it tends to overfit most of the time. To avoid this situation, you can use regularization techniques. Cross-validation provides a check on how the model is performing on a test data (new unseen data), and since you have limited training instances, you need to be careful while reducing the amount of training samples and reserving it for testing purposes. Moreover, cross validation works well with small amount of data. For example, assume that you only have 100 samples. If you do a train test with an 80 to 20 percent split, then you only have 20 samples in the test set, which is too small to generalize reliable results. With cross validation, you can have as many as K-folds, so you can build K different models. In this case, you can make predictions on all your data and then average out the model performance. In R, you can use the vfold_cv() function to create folds for cross-validation. Usually, you would fit multiple models and select the one with the smallest RMSE. This example demonstrates the process with only one model. In this video, you learned that separating data into training and testing sets is an important part of model evaluation. You also learned the three steps to evaluating the model: Train the model, make predictions, and compute metrics. Finally, you learned that cross validation techniques, like n-fold cross validation, can help improve the evaluation of models when datasets are small.