Great. We've learned about the machine learning workflow. Why don't we go through it. In this lecture, we're going to do just that. We're going to look at any NHL game outcomes. Specifically, the problem I'm interested in, is building a predictive model for NHL Game outcomes. More specifically, I guess we want to predict who's going to win, home or away team based on a given match up. We'll think about them as home and away teams. We want to leverage data from both the season in-play as well as from last year's game and historical information. Let's include in our model salary information in the form of team salary caps. Now, this probably won't be your go-to model for winning the office pool. But it should give you a great idea about what the machine learning workflow process looks like and the places where we need to start making those decisions. A bit of a caveat, we are web scraping. Because I want to show you an authentic process. But web APIs change all the time. It's quite possible that your code, the code that I gave you won't work by the time you're viewing this video. Even worse, the Coursera environment limits your ability to get data from web APIs. You shouldn't expect it to work on that system directly. But don't despair. All of the data is also provided within the labs so you can both see the process and explore the datasets. Let's dig in. First let's bring in a few imports. Since we need to scrape some web data, I'm going to bring in Urllib and JSON libraries. Then I'm going to bring in some standard data science data manipulation imports, the Pandas library and NumPy. Now, the goal in this case is to predict the winner or loser of a given match using logistic regression. For the time being, I'm going to take out ties. Conceptually, our feature set will include three things. The performance of the two teams home and away based on their statistics this season so far, the salary cap for the teams, which is an indicator of the value of the players and one would think that teams which pay more will have stronger players earnings in that, and the standings of the teams. Again, home and away from the previous season. How good was this team last year? Now, take note of the temporal nature of some of this data. As we gain more information about the current season, our machine learning model should be able to better predict the immediate future. Also, we use last season stats to help incorporate prior knowledge. But we could look back at several seasons and this would affect the model accuracy as well. First step is, I want to write a function to retrieve data from the wonderful NHL APIs which are available directly from the NHL. In this case, we're going to build a model for the 2017, 2018 season. This season had 1,271 games in it. I'm just going to write a function, get game data. I'm going to pull down the JSON data directly, so the URLs for the NHL API are very nice. Here we just put a game identifier in here, so 0001 will be the 1st game, 0002 the 2nd game and so forth. You can find out more information about this API. It's freely available online. Then we're just going to pull that data down and load it, it's a JSON data. The JSON data is pretty rich. For this analysis, we want to get information on scoring, and this is stored inside the goals JSON object. Inside of the data given to us by the NHL, we'll look at the LiveData, the plays, the current play, the about, and the goals. The nice thing about using the NHL API, is that because it's a URL with JSON data, you can just punch the URL into a browser, there's no authentication, there's no authorization needed, and you can just look at what the data actually looks like. We also need to get information on the home and away teams. This isn't useful for our model per say, but we want to predict whether the home or away team will win, and we need this in order to connect to other data sources. I'm going to get this out of the game data teams home and name section. Then we'll include the time as well here. Finally, I'll just bring these three dictionaries together. This might be unfamiliar syntax, it's called dictionary unpacking, but it just breaks each dictionary up and creates a new dictionary which combines them all. In the end, we want to work with Pandas DataFrame objects. This is what we're going to return to the caller, indexed by the time of the game. We'll just call this function for every game in the season; Game 1 through Game 1,271. This is going to take a bit if you're going to run this directly. And then when I bring it down, I'm actually going to save it to a CSV called game_results.csv, and take a look at it. Now, if you're on Coursera, you'll actually want to run this block of code instead, and I'll actually do that just to show you that it works well here. This will load the data file that I've already brought down for you. So you don't have to go hit the NHL API. Here we see the data that's in the data file. We've got the date of the match, any match that happened here. We've got how many points the away team got, how many points the home team got, what the home team's name was, what the away team's name was. Then we see that the time column is actually still there. This includes our game-by-game breakdown of the season. Now we need to add a new column which indicates which team won, either home or away. Remember that's our goal here, not to predict goals scored, but to predict who's going to win. We could do this by setting the default winner to away, and then looking at the game scores and flipping it back where appropriate, and so this code should do that nicely. We're going to set it so every match the away team won, and then we're going to look at all of those places where actually the home team should have won, and set the value to home. Now, let's bring in salary information. I'm going to pull this down from a website called CapFriendly. This website does not have an API, so we need to scrape it. Thankfully, pandas has a function which aims to turn HTML tables into DataFrames for us automatically, and that's called read_HTML. The result of this function is a list of DataFrames, and I've manually inspected this to see that there is only one which has all of our cap information. Now, the website is pretty values of dollars, but we actually want to replace those and just deal with numeric values, so I'm going to change our column of interests, the final cap hit, to be stripped of commas and dollar signs. Again, just like the previous data, I've written the code here to show you how it's done, and then I've stored it as a CSV in the assets folder, and you can load it on Coursera like this. Here we can see that the San Jose Sharks had this cap spending, the Flyers had this cap spending, the Flames this cap spending, and so forth. The dirty secret of Data Science and Analytics is that most of the work is actually in obtaining and cleaning data. It's good to build in some checks to see that all of the teams in our salary data are actually in the game data that we had. We could do this through a set difference. I'll just compare for game results, the home team, the unique values there, and the salary, and see what's there. Okay, so there are two problem teams, the Canadiens and the Golden Knights. Now, as a die-hard Canadian who is also a strong Oilers fan, I don't have a problem dropping the Canadiens from our analysis completely. But it turns out that my wife, a French-Canadian, disagrees with this, and she told me so. So instead, let's rename the team in our salary data. We could do this actually fairly easy, just we'll call salary team dot replace and we'll find where it said Montreal Canadians, and replace it with Montreal Canadiens. Now I'm going to promote the team name column to the index of the data frame, and just get rid of the column, because we're not going to use that anymore. We'll just keep this final cap hit. Now, the Golden Knights represent another important problem. They didn't exist in the league in the 2016 season, so we don't have salary cap information for them. This is going to be a problem when looking at their stats from the previous season too. I'm going to fill in their data as missing using the NumPy NaN, or not a number values, but it turns out we're going to have to address this a little bit later too. Okay, so let's just run these clean up pieces. Great. Okay, great. We've got two data sources down and ready for analysis. Now we need to get some prior information about teams from the previous season. Now, this will be useful for our model when we want to make early predictions, and we don't have the current season data. The NHL API has another great place to get standings for the whole season, so we're going to use that, and I can't really stress enough how great the NHL API is for doing data science work with. We'll just pull down the JSON data directly just like we did before, and in this case we actually want to take that record. If you go and open this URL, you'll see that it's nested JSON and we want to get the records sub-frame, and then we want to get the team_records piece inside. These are both things, they both have collections that we'll have to iterate from. So we're actually doing a little bit of nested iteration. Now, we have to decide which standings we actually want to incorporate in our analysis. Do we just want the rank of the team from last season, the number of games they won, the number of goals they scored? This is where your knowledge of the sport can come in to add context and value to the analysis. I'm just going to include everything for now, but this is usually a poor choice in practice. Since this is a JSON structure and we want to turn it into a DataFrame, we could just use the handy JSON normalized function in pandas to flatten things out. We could just add that data frame to the bottom of our df_standings. Let's just run this and I'll again save this for offline use, and you can go ahead and load that directly. Look at all of this wonderful information. Goals against goals scored, points, divisionRank, divisionL10rank the Roadrank. You can see how you could really use a lot of information from this data frame in a meaningful way and even build new metrics of your own to explore. We have our three sources of data for features. First, we have a game by game breakdown of teams and scores for the season and game results. In our target column, the one that we actually want to predict on is called outcome categorical. We also have the salary information in the salary series, and we have last year's data in the previous season standings. What we're missing, however, is cumulative knowledge about how the teams are performing in the currencies of interest. Our game results data frame only has who won and the game time, but it doesn't tell us what the stats are for each team thus far in the season. Of course, we expect that the stats for the team this season to have the highest predictive power for an upcoming game. We're going to have to build this cumulative data frame ourself, because there isn't unfortunately an API endpoint for us to get this. Let's create a new data frame with one and last columns and initialize it with the teams in our game results data and set the initial values to 0. We can increment this as we gain new evidence of the game performance. I'll call this df_cum. I'm going to use a bit more advanced pandas here in the form of a multi index on columns by calling stack and unstack and then adding a time row. This is just an entry for the default zeros and we'll get rid of it after we've built the cumulative frame. You don't have to worry too much about the more advanced usage of pandas here. But if you'd like to learn more about that and certainly if you're going to do a lot more data cleaning you will need to, there are some excellent resources here on Coursera and in the book. This is what our data frame looks like. We've got the name of a team, for instance, like the Jets. and then we've got whether they've won or they've lost. Then we're going to have in the index all of the times. Now we just need to iterate through all of the results in our game results and calculate the cumulative wins and losses as appropriate. Pandas provide a nice way to do this using the iterrows function. Identifying the winners and losers, actually pretty easy. Remember here, I'm simplifying our analysis and getting rid of ties. Maybe that's appropriate, maybe not. That's something for us to think about and maybe talk about. Then I just want to update the entry in our cumulative data frame. Again, the syntax might be a bit surprising here because we've got this multi index. We actually have to have the team name as the top levels of this would be, for instance, the Winnipeg Jets and then whether they won or lost. But my guess is if it's the jets were probably filling in the lost more than the won. Let's do df_cum.head and take a look at this. Now it's sparse matrix because most teams don't play all on the same day. We have, for instance, in the first game here, the Winnipeg Jets haven't won anything but predictably perhaps they've already lost a game. But we see that the Pittsburgh Penguins here haven't played anything and the Oilers haven't played anything. By the time we get down here, we've got NANs in here for the jet. The Jets did not play that time at all and neither did the Penguins. But we see the Oilers played and again, predictably they won. We can now propagate our scores forward in time and we could do this with the fill in a function. This provides us now a really nice, fully populated list of the current or the cumulative of wins and losses a team has. We can see that the Jets here, we have losses from that first game forward, the Penguins, it's 00 in both columns until they play a game, and then losses there and then the Oilers here. I'm getting excited. We're almost at the good part. Now we need to turn these three different data objects into a feature vector for prediction. Let's write another function, and we can have this function operate on a single row of game_results data and pull from the other DataFrames to create a feature vector. I'm going to call this create features and it's going to be applied across the whole observations DataFrame. It will work on a given row. Inside here I want to pull out, so that row is a sub DataFrame. I want to pull out the team that was away, how many times they've won and how many times they've lost. I want to pull out the team that was home, how many times they've won and how many times they've lost at this point in time. I want to adjust to ensure that we're not leaking results of this match because of how we determined who won and lost and when we put that in, it's actually going to be minus 1, based on the features for this particular day. Then I'm going to add in the salary cap information from last year. Here I'm just propagating. I'm taking data that we have in some of these static DataFrames, and I'm moving it into the features DataFrame for this given time period, this slice of data. I'm going to do that too with the previous season standings. I'm going to actually add an indicator here, so a prefix as well just for inspection later about home and away. These Golden Knights continue to be a problem. They didn't exist in the previous season so our code that converts the values to a dictionary really won't work, and we've got to be robust in this case. We're just going to create an empty dictionary for teams that have no previous season. This is a nice general solution. Then let's pass those back and let's create these set of observations. The observations DataFrame that we got back, this is really the core of our analysis with a little bit more cleaning left to do. We see that there's 80 different features here, 80 different columns. But some of them are a little unclear to me what they mean or how they might be useful, and a number of them are actually not numeric values. When we're using logistic regression in SKLearn, we need to have numeric values. Now there are strategies that we can go to to turn these into numeric values. For instance, the clinchIndicator. We could turn these into numeric values by using something called dummy variables or indicator variables. But I don't think we want to go down that road, I think we could just strip those. I'm actually going to get rid of a few pieces of data. First up, we want to get rid of the away team and the home team, both the indicator and the name of the team. The reason is because this perfectly correlates with some information that we wouldn't have, so we want to get rid of it. The first piece of data cleaning that I want to do here is, I actually want to get rid of the away and the home columns. Now, these are the goals that were scored by either the away team or the home team. We don't know this until after the game. This model of observations is supposed to be things that we know coming into the game that we are going to predict based on. These perfectly correlate. That's actually how we made our classification label. We're just going to drop those two columns. We're going to get rid of now some of these text-based columns as well, the team names, and then a whole bunch of other data here, like the streak code. The streak code's interesting because it tells you what their biggest streak was in the previous season, but sometimes it's a winning streak or a losing streak or something like that. It's really unclear to me how valuable that would be. For this demonstration, it just gets in the way to have to try and deal with this text-based data, and so I'm going to get rid of that too. I made some pretty arbitrary and questionable choices here. I got rid of some semantic information that might be useful about these streaks. I left in though numeric data about the streaks, and this is really not a great approach, but I don't think it'll be a problem here for this. It's a place that you would be able to take this model and start to more iteratively refine it and critically inspect it. In the end, the best model is going to be the one that's clean and has thoughtful data coming into it, where the data is indicative of future data. We're going to talk a little bit about the principle of parsimony in a future lecture. One last bit of data cleaning, let's get rid of the time cleaning column, it's just noise that we've captured.