Decision trees are flowcharts of rules and they describe how we segment data into classes given some features. The nice thing about decision trees is that they're very introspectable. They'd been used quite a bit in all fields when it comes to machine learning. With sports analytics they're specially nice because they're easy to understand by non-experts, so coaches, players, executives, and other members on your team. Now, decision trees are actually a very old technique in Machine Learning, one of the very earliest techniques. There's a lot of different algorithms to choose from. ID3 was one of the first. C4.5 is one that I've used a lot throughout my life as a grad student and faculty and CART. The classification and regression tree is the main decision tree algorithm that's used now in both sklearn and [inaudible] Now, at a high level, you read a decision tree starting at the root, which is the top of the tree and going down. Each branch, so each split is binary and it's based on some given feature and some value. Then the bottom of the tree, these are called leaves and these are your classification classes. Now, trees grow big quickly and there's numerous ways to constrain their growth and big trees tend to be overfitting to the data. This is actually much easier to explain with data, especially if you don't know anything about trees as a data structure. Instead of giving you a whole bunch of slides to talk about these, I want to jump straight to the notebook this week and tackled decision trees there. Let's go back to the pitching data that we used previously for the SVMs, it's really good data actually to talk about trees as well. You'll see I'm just going to bring in all of the imports that we used previously. I'm going to read in the data file from a zip file into a DataFrame and work on visualizing this to look at that effective speed versus the release spin rate. That's what our point cloud looks like. Remember, this is actually many different kinds of pitches. I want to take a look at some of the overlaps between those pitches. I'm trying to build on what we did last time and I want to bring in Matplotlib color spaces here. I'm going to generate a list of colors. Then in this line down here, I'm actually going to assign a color to every single one of the pitches. That'll become a little bit more clear here. There we see with respect to, and remember it's always with respect to the features that we're looking at. In this case, because we can only really see a 2D plot, we're looking at effective speed in the release spin rate, but we're seeing how the different pitches are classified by color and how they're distributed. You can see that if we wanted to look between the green and the light blue or the purple and the orange that we should be able to separate those classes fairly easily. That there's a lot more messiness when we start to consider this class, whatever the blue one is. Some of those categories are pretty distinct and some not so much. Let's go back now to our initial two pitches, fastballs, and change ups. We're just going to pull them out as our pitch type. Then we're going to look at just those two pitches. Here we go. We can see that there's some interesting overlap in the center part of the image between the two, the fastballs and the change ups where there's a difference in classification. For the sake of teaching, I actually want to zoom in on that area of data because it's going to be more interesting to show you how decision trees work. I played around with this data a little bit. It's a little bit fabricated because I'm cherry-picking my data here to show you a few things, but let's limit our speed between 85 and 90 miles per hour. I'll just take the first thousand pitches that are in that space. Now, look at this area for a moment. We're only considering two features. The effective speed of the pitch and the release spin rate. Our fastballs are red and our change ups are in yellow. A decision tree has to decide on one binary split to make. A rule that either cuts the release spin rate somewhere or the effective speed somewhere. Now if you are going to segment this space, if you're going to draw a line either horizontally or vertically, those are your only two options on this chart. Where would you draw it? All right. The way the CART algorithm works is that it's going to look to separate the dataset into actually two smaller dataset, where each one of those smaller datasets is pure. This means homogenous with respect to the classes that it has into it. It's going to try and keep those pure with respect to their size. The tree algorithm's going to consider our two features and attempt to segment them based on this measure. This is actually very hard to do for large datasets, so instead, the method aims to estimate impurity instead of trying to do it perfectly. Now, we only consider one feature because this is actually a recursive process. For each of the smaller datasets, we'll just run the algorithm again, breaking them into two or more datasets. We'll stop when either all of the data we have left in the node is of a single class, for instance, fastballs or until some of other threshold has been hit. The nice thing is we can just leverage all of our knowledge about SKLearn. We don't actually need to do much different than we did in the SVMs to start building these models. We'll create our X, our set here of training instances or our features, df small, effective speed and release rate. Then our y hat is going to be our pitch type. Then we have to turn this into a number for it to work with the hat method that's being done here. Then we just pull in the decision tree classifier here instead of the SVC. Here I'm using two different parameters, the max depth, I'm setting it to one, and then of course I'm setting that random state variable again so that you should be able to see what I get to see. Then, I want to leverage Sebastian's great code, again to see the decision regions so that we can visualize it. We can see that the tree decided to split using the effective speed at around 87.5 miles per hour. We can see that the accuracy at the top of this graph is only 92 percent. How close was this to what you would've done as far as splitting the data? Regardless, the orange triangles on the left and the blue squares towards the top suggest we could have done a bit better if we recursed a bit more deeply. Keep in mind that we're going to look at each side of this tree individually. We actually have two more splits that we can do. One for the left-hand side under 87.5 miles per hour, and one for the right-hand side. These splits are completely independent, so we can choose totally different numbers. In fact, on one of them, we could choose the bottom axis if we wanted to continue splitting there, and on the other, we could choose the y-axis. Pause the video for a moment. Where would you split on the left-hand side in the blue space and where would you split on the right-hand side in the orange space? To do that here, we just increase our max depth parameter to two and then we fit the data again and we run the plotting again. Great. First thing to notice is that our accuracy shot up a bit. We're now at 96 percent. We see that the tree split on the second feature, our release spin rate and actually did that for both the right-hand and the left-hand sub trees. But this happened to different values, around 2,250 for the left-hand side and 2,500 for the right-hand side. We now have four leaf nodes in our tree. But what are the rules that it's actually created? We see how this boundary has changed. We see these different regions and it looks pretty good. Now, most of the blue nodes are in the blue space and most of the orange triangles are in the orange space. But what are those actual rules that have been created underneath by the CART algorithm? SKLearn has built-in functions to display the actual decision tree itself. We can use the plot tree to actually generate this and then render it in line. This is our decision tree. Here's our decision tree. Let's walk through it. In each node, we see the rule, which is a binary comparison, a greater than or a less than with respect to a single feature. At the root node, we see the split happens at 87.353 miles per hour of effective speed. Right under that is a gini value. This is a measure of impurity and we can control the algorithm CART uses. The default is the gini coefficient. Now I'm not going to go into this more, but you can read about your options and how the Gini coefficient is calculated in the SKLearn docs. We then see the number of samples which are considered in the node in this tree that's just over 600 pitches, which are either change ups or fast balls. Now, remember, we only decided to look at a small portion of the data and then we filtered for just two pitches. The value line contains our true values for the observations within this split. In this case there were 125 instances of the zero class. Those are blue squares and 499 instances of our one class or orange triangles. Lastly, we have the class value which would be predicted by this node for the samples that sit at it. This is always just the majority class. In this case, our orange triangles or one. Now, let's take a look at the left-hand node. This would be all the data points which hadn't effective speed below the threshold of 87.353 miles per hour. We see that there are 156 samples here and then out of those, 114 were zeros or blue squares, and 42 were yellow triangles. The predicted class would be zero. We also see that the split at this position isn't great. That when we segment by the release spin rate of 20-42 the Gini coefficient is actually almost 0.4 the purity is much better though at the next level in the tree, especially on the right-hand side, which is largely our blue class pitches. That's how the decision tree method works at a high level and as you can see, interpretation of the model is pretty intuitive, but it's not always so nice and clean. Let's go down one more level in this tree. That's easy to do. We just set our max depth rate to three and run it again. Now this tree actually looks no different from the previous one and our accuracy is the same too. So what gives? Let's take a look at the plot of the rules. Let's spend our time on that left-hand side. We see the both nodes split on another value of release spin rate. But it didn't change any of our classifications. Both subtrees are either one or zero and it doesn't seem that any new information has been gained. This isn't actually completely true. There's more segmenting going on, but the decision boundary space is still the same. That is, we can't see the splits because we're still predicting the same class outcomes. That tree is no more useful for prediction than the last. Now, if you go look at the SKLearn documentation, you'll see that there's a parameter to control the pruning of the tree or removing branching just like this called CCP Alpha. By default there's no pruning being done, so cart just continues to split based on the purity of the split choice until you get a Gini of zero, a completely homogenous class. We can see actually that this happened in the right-hand side of the tree. What's going to happen if we decide to keep recusing down into this tree. Let's take a look. Wow, doesn't that look odd? You see the previous level set up this level to split a bit more intelligently and our accuracy has increased. We can see that there's a very small line of orange on the left-hand side which captures maybe four triangles and we also have a bit more granularity on the diagonal line, which looks a little bit like a set of steps and this actually demonstrates something really important about decision trees versus say SVMs. Decision trees are sensitive to the rotation of our data points. Splits are always a single feature, either x or y in this case. While a linear SVM, for instance, was a straight line in any direction, if you know you're data is separated by a diagonal like in this case, then it's better to use an SVM or to transform your data by rotating it if you want to use a decision tree. This is just the basics of how decision trees work and we're going to dive in a little bit more in depth and look at how they work in multi-class situations and how we can do also regression with decision trees.