If you're not careful about how you divide a data set into in-sample and out-of-sample portions, that process itself can produce bias. And useful examples may be excluded from training. And sometimes a dataset is so complex that a test that though apparently similar is actually not similar. And it's where you get into data sets where you have high dimensionality, and it's hard to visualize that. How many people can think in 37 dimensions? I can't. I struggle with four. I read Einstein's theory of relativity, and three dimensions in time, I was like I was right on the edge of my comprehension. Some data sets have high dimensionality. So go back and answer. There's a better way than just making these arbitrary decisions, and that's to use a process called cross validation. Personally, this appeals to me, this process appeals to me. So You take your data set. Can you see these? Do these look gray to you? Can you see a difference between this one and this one? Okay. My angle, it's not so good. Okay, good, so you can see that it's kind of running down across this diagonal here. So the idea behind cross validation is to divide your data sets into what are called folds. And k folds is some number that you choose, depending on how big your data set is. You decide how many folds you want. This example shows five folds, okay? And then, You make an iteration, and you train, and you train, and you train, and you train. And then, you test, and you record the error. And then, you make a second iteration. But now, this time, you're going to use this set, and this fold, and this fold, and this fold, and that fold for training. And use this one to test. And continue that all the way down. And recording the errors for each of the number of iterations that you take. And you can then average that error together and make a decision on whether your algorithm is producing good results or producing undesirable results. Is the error too high? Is the error too small? I guess that's a good. If the error's too small, that's good. If the error's too high, then we have more work to do. Maybe we need more data, maybe we need a different algorithm, so forth. A cross-validation benefit is it works regardless of the dataset size. I think I have a little issue with that if the data set size is only like 10 or 16 housing examples. I'm not sure it work very well on my cartoon example for my linear regression. But to me, have a data set of significant size works irregardless of the size because increasing the number of folds, you're effectively increasing the size of your training set. Differences in the distribution for individual Folds don't matter as much when you do this. And you are testing all of your observations because you're not leaving any of the data in the traditional just I'm going to use this for training, this data for training, and this data for validation for my auto sample. You're testing all of your observations. That's one of the reasons I like it. It appeals to me as an engineer. And by taking the mean of the error results, you can expect a predictor of performance. Higher variation in the cross validation performance informs you of extremely variegated data that the algorithm isn't capable of properly identifying. So this is something to be on the watch for. Usual behavior and erratic predictions that your learning algorithm that your model is predicting. Both R and Python offer functions to slice and dice your matrix of data, your data set into training and test parts. It's pretty handy. It's documented out there, and this particular method has now been What's the word, deprecated? When I run it I get a message [COUGH] there's a new method that has replaced it. But it's nice because you can give it the dataset. I need to go update my code with this. But you can give it the iris dataset and call the cross_validation method train_test_split. And it returns features for training, features for test, outcomes for training, and outcomes for testing. That's handy, you don't have to do it yourself. And you can control the test size. There's a whole bunch of parameters that you can pass into this thing to control how many Folds there are and how you want to slice and dice this thing up. And there it is, deprecated.