If you allow a decision tree to just keep growing, it'll ultimately fully master the training data, but it'll fail dismally on the test data. Where do you draw the line? In this video, we'll strike this critical balance between learning and over-learning. In the last video, we showed a tree that divides the training data into four segments corresponding with these four regions. It looks like it's doing a pretty good job since each region contains mostly either positive or negative cases. But why stop there? How about adding this new decision boundary shown in blue? It divides that mostly positive region into two regions. One that only has positive cases and another with a single negative case. Well, that's probably going too far. You can't generalize well from just one single example. Here's a visual in which we've allowed the same thing to happen on more realistic data. Although this data is also artificially generated to emulate a realistic distribution. You can see that the decision tree drew boundaries that stretch a narrow rectangle way out to the left, just to include that one lonely blue case within a narrow blue region. But when we try that model on test data not used to create it, we do indeed see that this narrow region going out to the left now includes a bunch of red cases. Yeah, it had definitely gone too far. But in general, when performing modeling and therefore only looking at the training cases, we can't absolutely know for sure. The fact is, we don't know very much about what shape the decision boundaries should most ideally take. We can see here that there's an area dominated by positive cases towards the bottom. Should it be a semicircle, an oval, a polygon? Just how tightly should it fit around that apparent zone of positive cases? That my dear friends is the ultimate mystery for induction, for learning decision trees. Take a stub at answering that question by assuming that the boundaries must form only rectangular areas as delimited by perpendicular lines. That's all a decision tree can do. Now, all data scientists know that this is an oversimplifying assumption. But it helps structure and constrain the learning process, and ensure we don't just draw a little boundaries around every single little individual case, that would be totally ridiculous. Another constraint we need to incorporate is that those rectangular regions can't get too small. There must be some limit after all. Decision trees fail unless we tame their wild growth. They can over-learned like nobody's business. If you just keep growing the tree deeper and deeper, ultimately each leaf narrows down to just one individual in the training data. After all, if a rule in the tree references many variables, it could get really specific and eliminate all but one individual case. For example, remember this decision tree from the first course for predicting which mortgage holders will defect. We showed that if we follow this path from the root down to a leaf node, that corresponds with this business rule. If the mortgage is greater than or equal to $67,751 and less than a $182,926, and the interest rate is greater than or equal to 8.69 percent and the ratio of the loan to the value of the home is less than 87.4 percent, then the probability of defection is 25.6 percent. Well, if we keep on going and make the rule even longer, and the borrowers age is between 38 and 39 years old, and their income is between $86 and $87,000, etc, we would quickly narrow down to a single case in the training data. Believing that such a rule would hold in general, would be accepting proof by example. With long rules like that, a large enough decision tree could essentially memorize the entire training data with one leaf per training example. Then you'd have only re-written the data in a new way and you haven't actually learned anything. We do need to limit its growth, but by how much? It's a tough balance to strike, and it's key to making sure machine-learning works as well as possible. Like a parent, we strive to structure our progenies growth and development so they're not too out of control. Yet, we cannot bear to quell creativity. Remember that holding aside test data, only empowers us to detect over fitting, it doesn't prevent over fitting. We need to design each modeling method to work well in this respect and then we only rely on the test data as a way to validate that it has done so. Let's go back to the mortgage churn trees to see how things worked out for them. As the tree grew, it arrived at this small one with only four segments, then to this one with 10 segments, and then this one with 39 segments. As a technical note, it actually doesn't grow in that particular order, but the order in which it expands doesn't matter. The way it grows out at any given leaf node depends only on that leaf's segment of examples in the training data. What happens there doesn't affect how other branches of the tree expands. It doesn't matter the order in which you grew out. Anyway, as you can see, the performance of the tree on the training cases does indeed increase as we continue to grow the tree. This table shows the lift attained at 20 percent. Lift for the 20 percent of individuals most likely to churn according to the model. The smallest tree gets a lift to 2.5 and the biggest gets a lift of three. This performance is evaluated over the 21,816 cases that were in the training set. This is not a surprise and will always be the case since it's the training set that drives tree growth each time it expands a leaf and blossoms out a new branch. It's doing so in a way that more training cases are correctly classified. In fact, if we let the tree go hog wild, it grows to 638 leaf nodes, 638 segments, achieving an even higher lift of 3.8. But have we gone too far? Well, when we test each of these increasingly bigger trees on the 5,486 held aside test cases that weren't used to create them. Indeed, we see that, yes, we have gone too far. The performance of the first three trees pan out well on the test data, showing the same lifts as achieved on the training data used to create them. But that last biggest tree, tanked, it bit the dust. Its leaf on the test data is only 2.4, even lower than the lift of the smallest tree. Growing the tree that big, definitely counts as torturing the data. We can visualize that effect here, looking at the increased accuracy as the tree grows going from left to right. This is for different trees on different data, but the same idea. When we view model performance on the test data instead, we see a very different picture. Starting from the left as the tree grows and we move to the right, we see that the line initially does go up. That's because tree growth is helping. It's actually learning things that pan out on the held aside test data. But then after around 20 nodes, as the tree grows even bigger, we've passed the optimal point and continuing to grow the tree is only overfitting on the training data. The performance of the tree from then on, just decreases on the test data. By the way, what you're looking at here is somewhat an unusual view. You're seeing the performance on the test data, even though that data is not influencing or informing the trees growth at all. The tree's being grown based only on training data. This view was made for teaching purposes, not as part of a modeling tool used commercially. How shall we determine the exact cutoff for when a decision tree should stop growing? This is sometimes referred to as the stopping criteria. Common approaches are to set an absolute limit on the depth of the tree. That is the number of levels or questions asked, as well as limiting how small each segment gets. The problem is any absolute limit you pick is bound to be either too limiting or not limiting enough for some dataset. Even if you think we shouldn't allow segments to get smaller than five training cases, there are times when that's actually too restrictive. Maybe any one segment with only four cases is really too small to be reliable. Yet, if we make these fairly presumptuous generalization based on only a few examples many times over across many segments of a large tree, they may on average pan out a lot better than guessing so the overall model is actually better off. Again, we face this fundamental question. How do we set the limit? Remember, we're working to solve the ultimate dilemma of machine learning here. We want limits so that we don't overfit, and yet we must not overly limit growth and unnecessarily reduce how much is successfully learned from the data. This really matters. The most popular solution to this dilemma is pretty ironic. Instead of holding back, so it avoid learning too much, don't hold back at all. Go all the way, learn way too much. Then take it all back piece by piece, making the smallest cuts that provide the greatest improvement unlearning until you're back to square one and have actually learned too little. We tell the computer set forth and make mistakes. Why? Because the mistakes are apparent only after you've made them. In a word the grow the tree too big and bushy, and then prune it back. In fact, the technical word for this process is pruning. The trick is that pruning is guided, not by the training data that determines the tree's growth, but by the testing data that now reveals where that growth went awry. You grow too much, then you prune back too much, and then look back on the pruning to decide where you should have stopped. This elegant solutions strikes the delicate balance between learning and overlearning. To prune back a tree is to backtrack on steps taken, undoing some of the tweaks that have turned out to be faulty. By way of these undoes that hack and chop tree branches, a balanced model is unearthed, not timidly restricted and yet not overly presumptuous. It's like finding the perfect statue within a raw block of marble. As Michelangelo put it, "In every block of marble, I see a statue as plain as though it's stood before me, shaped and perfect in attitude and action. I have only to hew away the rough walls that imprison the lovely apparition to reveal it to the other eyes as mine see it." Who says science isn't an art?