Decision trees don't math good. I realized math isn't a verb, and good isn't an adverb, but all they can do is narrow down to groups of individuals by delimiting ranges of values. If you're in this range for this variable, and you're in this range for this variable, you're high risk, that kind of thing. In comparison to other forms of predictive models that we're going to get into, decision trees can't do mathematical fancy footwork. But in this video, I'm going to show you why decision trees are awfully limited but then why they totally rock anyway. Let's look at the decision boundaries of this tree. Oh, wait. No, not that tree, it's too complex. This tree. This small simple tree operates on two independent variables: spend and tenure, for how much the customer has spent so far, and how many years they've been around. Let's say the training data used to generate the tree looks like this. The green customers are positive examples who responded to a marketing campaign, and the red ones are negative. As we did in the first course, we're going to refer to this kind of two-dimensional view of the data and draw the decision boundaries of models as a nice way to intuitively understand how they work and what they're actually doing and as a nice visual way to compare their relative capabilities. This will continue for nine or 10 videos spilling over into the next module for the entire survey of modeling methods. We're diving much deeper than we did in the first course and this way to visualize will remain throughout. But don't forget that this view greatly simplifies in one important way. Real projects always have more than only two independent variables, practically always. They have dozens or even hundreds. So you usually can't picture it as two-dimensional like this. In fact, you can't picture or visualize it in your mind at all unless you have an 11-dimensional brain. However, this view is great for an intuitive sense of what the models do and to illustrate the basic function and mechanics of each kind of model. So the decision tree modeling process started with spend and asked whether it was less than a hundred thousand dollars. That first distinction at the top of the tree corresponds with this decision boundary. Then, among those for which the answer is yes, for which the spend was below that threshold, the entire portion on the left was divided into two parts along the second dimension by asking whether the tenure was less than one year. That corresponds with this decision boundary. Finally, within the section for which the answer was yes, the lower left portion, it revisited the first variable, spend, and thresholded it at $50,000 thus drawing this final decision boundary. Decision trees can only draw horizontal and vertical decision boundaries, like an Etch A Sketch that toy with two knobs. The tree has four endpoints, four leaves, each corresponding to one rectangular region in the data. For example, this leaf, which is the segment of those with a tenure less than one and a spend between $50,000 and a hundred thousand dollars, corresponds with this region. So to use the tree, you navigate from root to leaf to find that leaf. Doing so is exactly equivalent to determining which rectangular region is the one within which the individual you're scoring belongs. This ability to draw straight perpendicular lines, I look like I'm doing a strike a pose in that Madonna video, strike a pose. Anyway, that limited ability is clunky in comparison with modeling capabilities that you'll see are yet to come. It's just like standard marketing segmentation. In this case, the segmentation scheme is optimized automatically as guided by the data. Also, a decision tree is actually just like some nested if, then, else, statements if you've done any computer programming. One limitation here is that the predictive score, the probability that the model assigned to each individual within any given rectangular region is actually all the same for that entire region. Like take the one we have highlighted, it has six training cases, five of which are positive. So when you use the model, it would estimate for any individual that falls anywhere within that region, the probability to be 5 out of 6, that is an 83 percent chance of being positive. It doesn't have any basis for reliably distinguishing any further within that region. If it did, the decision tree would have expanded further and another decision boundary would've been drawn to divide it into even smaller regions. It didn't do so. So everyone in there gets treated the same. But as you'll see when we get to more sophisticated methods, every individual will indeed be treated uniquely assigned its own unique probability as calculated by its specific independent variable values. But despite all these limitations, decision trees totally rock. They're an extremely valuable tool in your analytical toolbox. For a model, it's so easy to understand with the human eye. If then rules that read like an English sentence, it's amazingly versatile, dynamic, and capable. In industry polls, decision trees are often voted as the most or second most popular by machine learning practitioners due to their balance of relative simplicity with effectiveness. They're often a great place to start for any given machine learning project, and often a great place to stop as well, when you consider that other methods will increase complexity and decrease understandability, while gaining you what is sometimes a relatively small improvement in predictive performance. Remember that oftentimes, improving your data delivers a much better pay off than increasing the sophistication of your modeling method. One advantage of this simplicity is that in addition to deploying the model to drive operational decisions, you can sometimes derive strategic insights from examining the rules within the tree. This is like an ad hoc, informal process, but can turn up discoveries such as one found via social network, which found that new users were more highly retained if they had taken certain key actions, such as updating a profile picture. This is an old example back when Facebook hadn't fully dominated yet, since this wasn't Facebook actually. But at the time, it led to operational changes, the business credited as doubling the retention rate of new users. You may be wondering if the discoveries or rules encoded by a tree are obvious, like adding a profile photo means more engagement and retention, or like the commonly known effect that more frequently active customers will also tend to be more active in the future. If these discoveries could have just been written down by human experts in the first place. Then what's the big advantage of an automated learning process in the first place? Let's address the fundamental value of machine learning in this light. Data matters. For one thing, many hunches turn out to be false. Modeling determines which are true, and to what degree they hold. When you see a discovery that resounds with you, within let's say a decision tree, and you say, "I knew it". Well, that's known as confirmation bias. It doesn't happen for the hunches you had that never arise in the data. So your bad hunches don't get the same attention, and so you might be misled to think all your hunches are sound. Also, modeling discovers how different factors compare in their relative predictive contribution. That is, which are more important, and how to best combine them for overall model performance. This includes setting the optimal point for each threshold cutoff for the yes-no questions of a decision tree. It determines those thresholds automatically, and they don't usually end up being clean like 100,000 dollars in the toy example. Then, of course, you'll find that modeling often does discover unanticipated trends that do take you by surprise and were never hunches in the first place. Besides, those strictly perpendicular decision boundaries actually often do serve reasonably well. Check out this data where other modeling methods that can draw a diagonal line, but can only draw a single line, such as linear or logistic regression that we're getting too soon, are just dead in the water. You can try, but there's no single line boundary that's going to do a good job putting the red dots on one side and the blue dots on the other side. However, a decision tree's rectangular regions can do a much better job for this data set. More generally, as you go to more realistically high dimensional spaces, since you almost always have more than only two independent variables, you can start to imagine how dynamic and well adapted the boundaries can actually be, for example, with three variables, each individual would be in some position in three-dimensional space, and the thresholding yes-no questions of a decision tree would in effect draw boundaries that aren't lines, but rather are planes. Each plane will either be parallel to the floor or to one of the walls like this or like this. Like if this variable is greater than 10, that means we're referring to the region above this imaginary plane at my chest. As you add more and more planes, you can divide the space into arbitrarily shaped boxes. Because I took a mime class when I was eight years old. If it's more than three-dimensional, you can vaguely imagine the capabilities, especially as the tree grows to be fairly large. Although careful, don't try too hard and you'll get sucked into the 11th dimension. Another way to think of it is if there are dozens of independent variables, you can visualize any two or three of them at a time as two-dimensional or three-dimensional space with perpendicular lines or planes dividing up the regions, but only two or three at a time. By the way, two other big advantages to decision trees also result from their relative simplicity. Number 1, they're robust to noise and outliers in independent variables. Since, if an individual has a strangely large value, like an erroneous age of 302 years old, or an actual income that somebody has $50 million. Well, since these unusually large values aren't directly added or multiplied by this model, only compared to thresholds, their presence doesn't throw the modeling process off. Number 2, decision trees essentially do feature selection automatically as a built-in part of the modeling process. Since the tree growth just keeps adding one one at a time into the model, the leftover variables that it doesn't end up using are just left out. This means you usually don't need to do feature selection as a separate pre-processing step before running decision trees. So with these capabilities, pushing the Go button on decision tree software is pretty fun. It feels like pressing the gas pedal the first time you drove a car. There's a palpable source of energy at your disposal, the data and the power to expose discoveries from it as the tree grows downward, defining smaller sub-segments that are more specific and precise. It feels like a juice squeezer that is crushing out knowledge juice. If there are patterns to be found, they can't escape undetected. They'll be squeezed out into view.