Previously, we just looked at the two kinds of pitches, changeups and fastballs, and two different features, effective speed and spin rate. But we have both many more features and more classes of pitches. Trees are flexible enough to work with any number of classes and features just like SVMs. Let's dig into this data in a bit more detail. I want to go back to our full pitching dataset and bring that in and we'll just pop a Visualization in here. That's the data we're looking at for pitches. Now that's just with those two features. Now, in our SVM work, we looked at a number of different pitch metrics and game details. It seems like a fair approach to see how trees might handle this. I'm just going to copy and paste some of that in. We're going to look at the release spin rate and the release extension position of the ball, etc. We're going to bring in the player name, we're going to bring in the game details and so forth. I'm going to turn this all into a vector. If you'd like to, you can go back and you can reference that work for the SVMs. Now, before we create the validation and the training sets, let's talk about what we actually have in our data. Let's look at the prevalence of each class, each type of pitch in our actual data. Here I've called it pitch_type and pitch_type2, and I'm just going to apply the length function to those groups. We've got all sorts of different pitches here. We've got this one pitch, the ufuncs, which is almost never seen. I'm actually going to get rid of that and just acknowledge that this is a limitation of our model, of our approach and it's not going to be something that we can predict. Now that we're also in this multi-class scenario, I want to randomly sample from our data-frame for the test set, and then we'll make our validation set just everything that's not in the sample. Now we just need to impute those missing values throughout. Remember we always do this imputation after we've separated our training set, in this case, it's df pitches from our validation set because we don't want to leak any information. None of this was really new, but let's take a look at how a few different tree parameters might change the decision boundaries and the accuracy of our classification. Just because we're using a new algorithm doesn't mean we can't use the same powerful techniques that we've seen previously, like cross-validation. This is really one of the beautiful aspects of the sklearn architecture. I'm going to bring in cross-validation. I'm going to reduce our data to just the features that we're actually interested in, so that whole set of features, and we're going to predict our y hat is going to be set to this pitch_type2, which is the numeric version. Here I'm going to create a dictionary called classifiers, and I'm going to add to it the six different classifiers. I'm going to add decision trees with a max depth of 1, 2, 3, 4, 5 and then one that's just completely unbounded. Then I'm going to run through each one of these and train and then validate and print out those scores. Well, there's a lot to unpack here, so let's start with one of the positives. Did you notice how fast that was? Absolutely amazing. The SVMs took what seem forever to train. But here the trees just whip through 5,000 entries like nothing. But speeds only one consideration though it is a consideration when you're building models and doing exploration. We see that our actual validation set accuracy is a bit lower than our cross-validation accuracy. Well, this isn't uncommon, but it's actually not so far off, so I'm not feeling too bad about these numbers. Keep in mind that this is just one random sampling of the data for our training data. If you change, or you remove that random state parameter, you're going to get different results and likely they are comparable, but there will be some outliers to. Now, we talked previously about the issue with accuracy as a metric and considering how unbalanced our dataset is, it seems like this may make things even more confusing. Let's take a look at the confusion matrix for this last model, the unbounded model. We see a decently strong diagonal line with a few classes below 50 percent accuracy on a nine-class scale. One place that looks like a lot like our boxing data is this Class 6, which we tend to predict more's Class 4 than Class 6. How much does this matter? Well, it depends on what your use-case is for this model. If you go back and look at our list of pitches, you'll see that Class 4 is a four-seem fastball and Class 6 is a two-seam fastball. The pitches are different but not nearly as different as say how fastball and changeup. Another good one to consider here is Pitch 7, which should be a knuckle curve ball, and we only correctly predicted about a third of the time. We regularly misclassify this as either a changeup or a slider. Both changeups and sliders join knuckle curve balls as off speed balls moving slower than fastballs, and it's clear that the model that we've built is picking up on this. That's a bit how decision trees are created and work. You'll see that this was actually really fast to learn how a new classifier can be used in sklearn, and that's the power of the sklearn API. Let's look next at sklearn's ability to create regression trees. I'm going to introduce you to a regression tree called the M5P, which I'm specifically very interested in.