Sometimes a regression equation can describe a portion of the data well, but really struggles with building an understandable model for the whole dataset. It's possible to blend regression and decision trees into model trees, where the tree is used to split on those coarse-grain attributes of interest and each leaf node holds then a custom regression equation. The tree essentially segments your feature space into these chunks, this subtrees, and then the regression equation provides a more nuanced understanding of the observations that fall just within that chunk. Model trees bring a nice approach to inspectability and interpretability when engaging in regression. They're most appropriate when there's some local contexts that's an important. I find them actually a really compelling way to blend the classification and regression tasks. As an example, if you wanted to predict amateur footrace performance outcomes, coarse grain features might be the geographic parameters of the track, so the location, the weather, and so forth. The timer, the season of the race, so many 5k's, for instance, repeat yearly, and motivational goals of the racer. Fine grading features though might be previous race performance, conditioning, training information, etc. A popular model tree approach is the M5 algorithm and the M5 prime implementation. That's what we're going to be talking about here. I want to work through an example, the NHL MVP. The Heart Memorial Trophy is the annual award given to NHL's most valuable player. It got an interesting system of voting. It's actually awarded by votes from journalists of the professional hockey Writers Association, and it uses a ranked point system. The journalists who are part of this association choose their first through fifth choices and they assign points as appropriate, 10, 7, 5, 3, or 1. The votes for the place that they give individual players are weighted. The one with the most points at the end of the voting is awarded the winner of the Hart Trophy. The question that I'm interested in with this is, can we use historical data to understand the likelihood a given player has of winning the Hart Trophy? There's lots of different ways we could conceive of this and we're going to talk a little bit about that through our analysis. But the approach I'm going to use here is to predict the number of points each player will be awarded by the journalists. It's a regression problem. This analysis is actually really special to me. Two years ago, we started an online Masters of Applied Data Science degree here at UM, and it's hosted here on the Coursera platform. I've had the privilege of teaching in that program and I run a collaborative social hackathon every week where students can work with me on a problem. This work was really built in those hackathons stretching over a period of maybe six months. It was a lot of fun to work together and pair program all through this while challenging ourselves to think about what kinds of features would be useful. If you're interested in going further with your data science skills, I'll put a link to that degree program in this week's content. But let's get building our actual model. In our previous examples, we've written out all of our code in a single notebook, but this is actually a pretty rare case. A more authentic workflows that we put specific libraries or functions which are interesting in different files and then we import them into a single notebook that we're building. We're going to do that here. Now, I can't show you all the ins and outs of building deployable python packages, but we'll do a little bit of separation of concerns. Let's start by bringing in our Data Science imports, pandas and numpy, and I'm going to bring in the import ipynb Package. This will allow us to call another Jupyter notebook from within this Jupyter Notebook. Now we need a bunch of different data for the analysis. I'm going to show you how we got the data, we need it. But most of this won't actually work directly on the Coursera system because it doesn't support web scraping. Regardless, this is going to be useful for you if you want to replicate this work on your own computer, and of course, I've included all of the data files I've pulled down so that you can work on the machine learning part as part of this course. Let's get information on the players and we're going to pull that in from the NHL API directly. Let's create a new notebook for that. This Notebook is self-contained and it's got a number of different interesting functions for us. You're going to first note that I put everything inside of a bunch of functions, and I've started each function with double underscores. These are actually two underscores. Python doesn't support some of the Computer Science encapsulation pieces that a lot of other languages do like private functions. These double underscores help us understand that we shouldn't be calling this function except from within our own code. It's a private piece. This first piece I'm going to write is about downloading data for a given season of the NHL API. The seasons going to be in the form of two years, let's say 2018, 2019, where the first is the start of the year because they start in the fall and the second is the ending, which they end in the spring. There's going to be one parameter, the season, and it's going to return a DataFrame with an ID. That's going to be the stats API key for the player. Some of the work that I've done here is a little bit more complex. I'm using something called JSONPath in order to actually query a JSON document. An example. This pulls out any element inside of the JSON that's called person. This pulls out any element inside of the JSON called code. What's important here is that you can actually take this URL here, put a season in, and go look at it in a web browser and you'll get a sense for what the data is coming back. Just like in previous work, I then use pandas and I merge this all down into one DataFrame and flatten our position information and our person, our player information. There's a second function that I've written here that actually pulls out statistics about a specific player. Now this is going to be called many, many times there's about a 1,000 players, 1,100, 1,200 players per season. We might have, let's say 20 or more seasons worth of data. That's a lot of API calls. It's really nice that the NHL makes this API fully available. There's no login or cost or anything like that. Again, you can go to this URL in your browser, putting in a player identifier and a season identifier to get a sense as to what that data looks like. The last function that I've added here is called get player stats by season. This is one that will actually provide to different code that wants to actually get information. This is the overall web scraping that I did with the NHL API. Good. Now that we've got the official player stats ready to go from the NHL API. We have to consider our Hart Memorial Trophy voting data. The Hart votes are done by reporters who covered the NHL. The voting procedures stayed relatively constant from 1996 up until the 2020-2021 season. This last season, the playoffs are actually going on right now while I'm filming, modified the number of reporters who could vote because of structural changes in the season due to COVID. Will the data from 2020-2021 season be actually useful in the future? I don't know, it might be. The league changes the way that we vote in the next couple of years, maybe this is the new normal. What about this current analysis that we're doing, is that actually going to be useful for the 2020 or 21 season? I actually can't answer that either since the Hart results haven't been shared yet. This would actually be a great place for you to extend this analysis and see what impact the COVID year has on building models like these. Regardless, we need to get that Hart voting data. There are several places to get this, but the website hockeyreference.com is a great resource and makes it easy. Let's create another API called hockey_reference_api.ipynb. You'll see in this API file I've decided to bring in pandas and this import ipynb. Really importantly, I'm actually able to import our NHL API directly. That's another python notebook. That's really interesting for people who are used to maybe traditional python programming and haven't seen this before, we can actually import another Notebook into this space. We start with building a private function here to get the results of the Hart voting. It's done on a year-by-year basis and it's actually really easy to go. It's a table in HTML and so we can use the pandas read HTML function. That allows us to pull down an individual webpage, parse it into any number of tables that are there. I happen to know that this is going to be the very first one is going to be the DataFrame that we're interested in. Then we'll return that as a DataFrame. One of the years in our dataset actually has some labor actions involved in it. There was a strike in a lockout. I think it was in '04-'05. There isn't data for that. There was no Hart winner in that year. We have to build in a little more robustness handling here. Now, we actually just put together another function that's going to bring all of this together. This is going to generate the player stats and the Hart voting results from '96 through to 2019-2020. It's going to save everything as the local CSV files. Again, if you're working on the Coursera platform, you actually don't have to run this code. This is more for you to get a sense as to what it takes to actually build this if you wanted to extend investigations of your own or if you don't have those data files. Turns out web scraping is actually a huge part of Data Science and especially things like amateur sports and sports analytics. With our data now in hand, let's get started on our analysis. I'm first going to read in our two different DataFrames, the Hart results and the player results. Now the first challenge that we run into is the one that we saw in a previous module. Sometimes the data isn't quite as clean as we'd like it to be. It's very common in manual data systems to find alignment issues. We saw that with the spelling of the team, the Montreal Canadiens or Montreal Canadiens if you prefer. In this case, it turns out that two players actually have different preferred names in this dataset, Alex Steen and Olaf Kolzig. We just have to tweak a little bit of that by hand. This unfortunately is very common. Now we want to predict the number of votes that a given player we'll get to in a year. Actually, we don't care about the number of votes per se, which is good because membership actually in this PHWA changes and there are more or less votes cast each year as the number of journalists changes. Instead, we actually want to predict the ratio of the votes that a given player will get in a season. We're going to call this the normalized vote percentage. This is going to be our regression target. I'm just going to calculate this, taking the votes column and dividing it by the sum. We have to group these by season because all of the seasons are in one data file. There are some interesting statistics in the data on the amount of time each player spent on the ice, on the ice during power plays. That's when one team has a penalty and the other team has more people on the ice and so forth. Now, all of these date-time datas, they're all strings, and we want to convert these into a single integer value of a total seconds so that we can use that in our model. This is important because we often have to have only numeric values. We have to take things that are complex data types like time and turn them into something that's a little simpler and we cold still keep it as time, we just have to change what it is we're thinking about. Here's just a quick little function to do that. I'm going to apply it to every column that has the term on ice in it. I figured that out just by inspecting the data. Now, let's merge these two files together and have a look at our data. We can see that there's 78 columns here. We've got things like assists. We've got the name of the player, goals, games. We see that there's a whole bunch of missing values as well. We're getting used to that. We see that there's link information out to APIs for individual players. We just took everything that we had available to us in that a API basically and tried to slam it into one giant data frame. Two quick bits of data cleaning based on our look. We need to fill in these missing values and I'm just going to set them all to zero. We need to make sure anyone who didn't get votes gets a zero as our voting percentage. Everyone who did not get a vote should be recognized in this case as a zero percent chance. I fill na on the full data frame. Now as we explore the dataset, one of the students pointed out that a lot of the stats might only be relevant for some positions where that our player position might influence a given statistics. This is actually very common in Machine Learning that there's a lack of independence between features. For instance, you would expect to forward to be in a better position, to say score a goal than a player on defense. In addition, some of the stats in our data frame, like the save percentages are only calculated for goalies and thus are by definition, nonexistent for the other positions. I want to explore this a little bit more. If we take our data frame and we group it by this column, the position code, we can then apply a Lambda function to it that I've written to get a sense as to really how many votes different positions get. Let's take a look at this for a minute. We've got our positions here. We've got Centers, defenseman, goalies, left wingers and right wingers. Then on the right-hand side in scientific notation, we've got what the voting percentage is essentially for them. Now we haven't normalized things. That's why it's all very, very, very small numbers, but it's still actually fairly straightforward to see, for instance, that defenseman just on a balance of probabilities are much less likely to be awarded the Hart Trophy. Actually, goalies are more likely to be awarded the Hart Trophy than any other group. This is an imbalance in our dataset, but it's also very natural imbalance because this is what happens in the actual voting process. It looks like we should be considering position in some way in our model that there is maybe a signal here that we could catch. The position information isn't numeric though. One way we can incorporate this into our model is to change these values into dummy indicators. Five different features, one for each position which are either a zero or a one if the player is or isn't playing that position. Now Pandas makes this really easy through the get dummies function. Goalies for instance, will show up and there'll be something called positioning code underscore g, and they'll have a one there. Then position code underscore C for centers. They won't have, it'll be a zero for them. We also want to create our holdout and our training datasets. Now in this case, I want to create our validation dataset, our holdout datasets as the 2018, 2019 season and for our training, I actually want to look at everything between 2018 and the 2001 and 2002 season so that should give us plenty of training data. Now there's times that you might not want to have all of the training data available to you in your actual model. It could be that there was something structurally different about the NHL in 2001, 2002, 2003 and that, that feeds a false signal into our model. You should normally pay a little bit more attention, how much data do the methods that you're planning to use need, and how relevant is that data? So I want to build up a list of features that we want to use and I just chose a number of different features that were in the model. You feel free to play with these features, exclude features, modify a bunch of these because this is a regression problem, we need to have numeric features here. We're just about at the exciting part building that model. But before we do, go ahead and do that, let's take a look at these features which are in the model. Which features do you think are most informative for predicting the Hart Trophy winner? Assists, points? That seems pretty obvious what about saves though that's only available for goalies. Wins? Do you have to win a lot of the individual matches or can you just put on a good performance even if you're on an underpowered team. I know I said we were going to go build the model, but I have to put out one more shout out out here, despite being well-known, there's no python implementation of the M5 or M5 prime algorithms in Sklearn. However, Sylvain Marie in the Analytics and Cloud Platforms group from Schneider Electric has coded up that algorithm and he's made it available to us as open source on GitHub. Even better, they're currently pursuing getting this added to Sklearn so maybe in the future this model would be available directly to us in the package. I've put Sylvain's code here in python in the Coursera platform, and we can use the run magic function to run this python file. What that's actually going to do is going to evaluate and run the whole python file in this interpreter and so all the functions that are available in that package in that python file will be available to us to use in this notebook. Now actually building the model is pretty straightforward. We do it just like we did before so here I'm going to create an M5 prime object. I've got a bunch of parameters, so max depth I'll set to six, and min samples per leaf, I'll set to three and I'll say don't use smoothing. We create our x and our y values so we just take our X, which is everything in our df_full. Remember we separated that earlier from the validation set, and so we pulled the features out of that and then we take our target, our y Hart value here and I'm actually going to write both the holdout and the full dataframe to some files. I'm going to call them model_tree_data and model_tree_holdout_data for our next lecture. Then we import our cross_validation and we actually cross_validate passing in the model, passing in our x train, our y train, the number of folds we want, and the evaluation metric we're planning to use. In this case we'll use R squared and so let's just run that. You can see how Sklearn's API makes it really easy to start using new models very quickly without actually having to know much about the model. That's both good and bad. It allows you to test out a lot of things, but ideally, especially if you want future predictive performance, you should know how the model that you're using, the approach you're using actually works. In this case, we actually have quite a range of R squared values so for our folds our first fold was 0.24 as an R squared, but then we have some negative R squared values, which is very bizarre to see. We can see that there's a huge standard deviation, so our average R squared was 0.267 and our standard deviation was 0.346 so quite a huge standard deviation. This suggests to me that there might be a temporal nature to the accuracy of our models and that we shouldn't put too much stock in the predictive power of what we've currently got. That standard deviation is really high. Don't use this model and put your house mortgage on it is my suggestion. Now if we wanted to put this model into practice, we'd want to look at particular folds in our validation which are particularly bad and consider which features may have led the model astray. I think it would be useful to think about these hyperparameters though that I arbitrarily chose. Should the max depth of the tree actually be limited to six? I just pulled that number out of anywhere so I think we're going to tackle this in our next lecture.