I've mentioned in passing a few times that we don't have to come up with all of the parameters for building models by ourselves. We can actually leverage SK learns built in functionality for this and that's through a process called grid search. This method allows us to describe the different parameter values that we might be interested in exploring and to build individual models for each one of those parameters. The result is that you can think of having some indication of which features might be reasonable. And I think we've got that already for this model and then you can fine tune the model through brute forest computation. Let's give that a try. So let's load the data set into from our previous lecture. So I'm going to bring in pandas and Numpy. I'm going to bring in cell veins code and then I'm going to load the data set and build the X train and our wide train. Then I'm going to bring in grid search. So grid search actually comes in two different flavors, and the one that we want is that cross validation variety. The way we use grid searches, we create a dictionary object. So in this case I'll call it parameters. And we just list by name each of the parameters that we want to explore. And the values for the dictionary are the different values for the parameters. So in this case I'll take max depth and I'll say we want to try trees with a max depth of 3 max depth of 4, 5 and 6. And we also want to set the minimum samples per leaf to be 1,3 and 6. So let's explore those and whether we're actually going to use pruning false or true. So now it's actually really easy to just train the model. We train it as if it was just a regression model directly. Again, the SK learned API makes this really easy so we create this grid search CV. We pass at the estimator that we want to use. This is our base model in this case the M5 prime. And then our parameters and then our regular cross validation information. And then we just call fit. All right, now that's going to take a moment to run. We already know what are naive approach did with the max depth of 6 in the mid samples of three. We had this are square, that was really quite abysmal. So now let's look at the best model. And so we can find the best score by taking this regression variable. This is the grid search CV. And ask for the best score. Okay, so that's actually not horrible. It's not amazingly better, but it's something and we can find lots more details about this model in this attribute called CV underscore results underscore. And that includes the time and the r squared values for even every fold for every parameter combination. But if I just wanted to see the best set of parameters, the ones which correspond to the score above, we can get those two. So we just call read the best programs. All right, so we see that the best performing model out of the parameter space we gave it. How to max step the four and the min samples leaf of 6 and used pruning. Now there are lots of other investigations we could consider including expanding our grid search, CV results or changing the regression model we use for leaf nodes. But I did sort of sell these M5 trees as an interpreterbal approach to the problem. And maybe we should take a minute to just look at that. Recall that our task is to generate a list of NHL players for a given season and the number of votes they will receive for the hart trophy by journalists, votes are weighted. And we've chosen to predict the normalized waiting value. We should be able to look at a sorted list of our predictions and compare them to our validation year, the 2018 or 2019 year. And we should be able to look at our tree and gain some insight into how it comes up with the classifications. So let's do that now. Now there's a few ways that we can actually look at the model, and I'm going to start just by visualizing the tree as a set of nodes. So let's take our best model which can be found in this red.best estimator and predicted on our training set. So we're going to take our training data and we're going to throw it into this tree and actually see how well it worked. To do that, we're going to use the SK learned tree function called export graft Viz. Now Craft is is another system library that can actually take a text file in a specific format and visualize it for us. And we're going to convert it here, we're just going to call the command line. This .program to actually take those .files. The graph is .file and render it as a tree. Okay, so here's what our tree looks like. And I just want to take a moment to explain it. So we started the route and like all good trees, the root is at the very top. So the very first decision that was made in this decision tree is on the points value. And that maybe makes some sense to us just based on our knowledge of the game. People who had less than 103 points put out into this tree. This sub tree over here on the left and people who had greater than 100 and three points get put into this sub tree over here on the right. Now just to explain these nodes, the next line here is called the Friedman MSC or mean squared error. And this is a measure of how much error there is in a given data set. In this case, there's so many samples at the root of the node and we're told that there's 17,000 of them that we just get a rounding error of 0 for our MSC. Now the value at the bottom is our regression prediction of the number of points that somebody's going to get. Again, this is a percentage of the total number of votes. So the top player could at best get one and then everybody else in that season would get 0. So nodes that have a big value or a larger value, our nodes where people were actually ranked high for the Hart trophy. So here's an example. If somebody had more than 103 points, they were immediately classified into this node. Now this is a leaf node, there's no other nodes underneath it. And we see that 22 people fell into this node and they actually had a fairly high value. Basically, the model is telling us that if you had more than 104 ish points, there was a good chance that you were competitive for the Hart trophy. Let's take a look at this sub tree here on the left. So we see points as further split on. And that's one thing that's really interesting about the decision tree as a mechanism. We can continue to use features if those features provide value and that's not something we normally have with a regression model for instance. In this case, I want to look at the people on the right. So the players on the right, there were 41 people in this. And then we actually have a split between the amount of time they were on the ice per game where it was an even match. We can see that players who were less than or equal to this number again, that's in seconds, right? We change that on ice time, then go off for further decision nodes. Whereas people who had a greater than the number of 1000 seconds or so were then categorized over here. Now the color of each note is pointing out to you the weight of the node, the value. And so the more orange the node is, the more the players in there are highly ranked for the Hart trophy. And actually we could follow this whole tree down to the left. So if you had less than 103 points, actually had less than 92 points and had less than 40 winds and actually had less than 83 points, you would fall into this bucket here. And that's actually almost 17,000 players. Remember the vast majority of the players don't get a single vote for the Hart trophy. It's really Only the top 20 or so players out of over 1000 every season. And that's why we have 17,000 samples who get even a vote for the Hart trophy nevertheless win it. And so we can see here that this is a great way to prune out a vast majority of players who aren't actually going to be competitive at all for that Hart trophy. This is interesting because it turns out there was at least a player who got some votes and probably several players who had more than 84 points or had 84 or more points but met the rest of this criteria. And thus were put down here into this note. So that's how we can inspect the decision tree and get a sense as to the value of different rules that are being generated and how different features come to play. Okay, that was a lot to talk about. The trees, actually just part of the analysis though. We also have those regression equations at each leaf node. Remember that a regression equation is all of these coefficients, one for each feature that are effectively awaiting when some together that are going to produce a target value. And in this case that's our percentage of votes. Now we can get access to these equations in a few ways. But Sylvain is nicely included, a function which prints out the tree nodes and the linear model equations for us as well. So I'm just going to take our best estimator and I'm going to pass it to this function export text M5. Now when it prints out this, it's going to refer to our features as indexes within X. And so I'm also going to print out all of the different feature names as well so we can talk about them. Okay, so we can see here that the value of the top node X sub 9 has to be less than 100 and three and that will cause us a split. And so we can look up what 9 is in X. And so we see down here. Yes, it's points just like that visual tree told us. And then we can follow down okay? If it's less than then we start looking at the next value. We can see also the MSC for each of the nodes and the number of samples. When we get to these leaf nodes, those leaf nodes don't have another decision layer to them. So those leaf nodes are exercised by the rules that follow them in the tree. But you can see that they may have regression equations. Now these two leaf nodes actually it says programs of 1. So there are no regression equations for those. Everybody who goes into those buckets then give them these values. But we can see down here for some of the leaf nodes there are regression equations. And we can look those up. So this is our linear model 2, this is our linear model 3, and this is our linear model 1. And so you can actually see the intercept and the values of the different features in the model. So all of these models seem to use assists in one of these also uses game winning goals to come up with a score for the player. Okay, let's make this prediction now on our hold out. Now we don't need to retrain, we can just use our best estimator. We've already got that. So I'm going to bring in the validation set that we created in the last lecture and I'm going to apply the best estimator to it. Okay, so when I first saw this, I was a little deflated, right? Like how can you even have an R squared value which is negative? And what this really means is that overall in our regression analysis there is some constant horizontal line that would be a better predictor than the model that we've created at those leaf nodes. So is all hope lost? I mean I don't think. So while we've modeled this problem as a regression problem, are real world use cases more likely to be something akin to ranking players who are in the top 10 or 20 as far as their competitiveness for this hart trophy. And this shows some of the gritty challenges in applying machine learning to sports data. Your conceptualization of the problem influences the model you choose and the evaluation methods you use. So what should we do next? Well let's actually see how this works for the top 10 people in our holdout data set. So I'm going to take this DF validate the prediction full data. I'm going to create this new column in there and I'm going to take our predicted values and stuff it in there. So now our validation set will actually have in there are predictions for each one of those people. And then we're going to sort the values in this data frame by the prediction and reset the index. And then I'm actually going to pull down the data from hockey reference for this season to actually see how many votes people were given. Remember we don't have that true label of that data yet. And then I'm going to join these two data frames together on the index and there's a lot of sorting going on here. But it's just to show you how our model compares to the actual data. So on the left hand side we've got a number of columns. The full name of the player and are predicted percentage of votes that that player will get. And then on the right hand side we have the player's name, and then we have a vote percentage. That vote percentage is actually a tough number remember this is one season but individual members, journalists can vote for multiple people. In fact, they almost always do vote for multiple people. So the total vote percentages, something like 250, but no one player could have more than 100%. That is, they would always be voted first place. So it's a little bit deceptive of a number. What's interesting is you can see that our choice for the first pick, Nikita was actually the person who won the Hart trophy. So our model seems like it actually did pretty good if we're just considering that accuracy on first person. It predicted it correctly. But look at this difference between Nikita and Sidney Crosby in votes. Nikita back twice as many points, essentially as Sidney Crosby. And we actually had Patrick Kane coming in a very close second to Nikita and then Aleksander Barkov coming in afterwards. In the actual data, Patrick Kane came quite low down. But I think that what you can take out of this is modeling this problem as a regression equation. Actually give us a lot of power and we maybe didn't do horrible. We were able to identify for instance Sidney Crosby in our top 10. Nathan Mackinnon also in there, Johnny, Connor McDavid's. So we actually have a fair bit of similarity with the actual values that we see when the voting is done. Now remember the vast majority of players in the NHL don't even get a single vote for the Hart Trophy. And so it makes it really challenging to then try and build these predictive models when we have such an imbalanced data set. Okay, that's been a bit of a whirlwind tour of model trees. We've seen how cross validation can help us understand our model evaluation. Grid search, specifically grid search, CV can help us explore different model parameters. How a decision tree can be used to aid in understanding your data. How M5 and M5 prime trees can be used to mix regression and decision trees together. And some of the different challenges in conceptualizing and modeling sports analytics problems. Now model trees actually aren't used that much and a lot of published data science work. And I think this is a shame. They're actually very useful for giving you as a data scientist insight as to which features are impactful in segmenting the space. And importantly which data you should be focusing on collecting and cleaning and trying to find for new features. Beyond this, they're very easy to use to communicate with diverse stakeholders, think coaches, executive scouts, players, et cetera. The kinds of people who will be interested in the analyses coming out of sports analytics. And, lastly, and this is important for me as a computer scientist, they're actually very easy to turn into production systems rules. Nice, clean splits at different nodes.