So in the last session, we generated forecasts of outcomes of Premier League games using using a within sample model. What that means is we used all of the data that we have had available to generate a model, our regression model, the ordered legit regression model. And then we use that model to fit, the results for those same games. So in some sense, we were using our model to predict the games from which we derived our model. So in some sense, that's having it both ways. It's sort of you're being able to fit the data to the model that already exists, whereas what we want to do now is producing out of sample forecasts. So we're going to use some part of the data to generate the model, and then we're going to use that data. That model to forecast, results for another set of data. And what the way with the precise use we're going to put that to here, is that we're going to look at the games played in English Premier League, up to the end of calendar year 2019. And then use that to forecast games played in the first few months of 2020. Now, this season that we're talking about is a particularly odd season because the season was interrupted by the lockdown in the UK, which started in March, as a result of the Covid 19 pandemic. And in fact, the Premier League season was suspended in on March the 13th, 2020. And at the time that we're recording this, I didn't have access to the results that have occurred. Or may occur in the future when they go back to playing games. So in that sense, the see this is a truncated season. But the out of sample games we have in this case are the games that have been played in the first few months of the year of 2020 up to the shutdown on March the 13th. So, another thing we're going to do is in this session is, compare, not only compare our results to the bookmaker's odds. But we're going to compare our results to another forecaster, and that forecaster is Nate Silver's FiveThirtyEight. FiveThirtyEight is a website that is devoted to analyzing all sorts of data and trying to do the same kinds of things we're doing, which are using data to forecast results. They do it for politics as well as for sports, but they have a whole section devoted to, soccer in the United States and in the rest of the world. And they have forecasts of Premier League results as well. So we're going to use their forecasts of Premier League results and then also compare our forecast to their forecasts. Okay, So, let's get going with the analysis, we can import the packages that we need in order to run our code. And we can import the data, which is the, Premier League data for the season 2019/2020, and you can see here again format that we've used before. We got the date, the home team, the away team. This is a list of all 380 games scheduled for the season. And so some of these games, because of the suspension of the season, have not been played, and we've given them a value of 1 or they have a value of 0, otherwise. We have the month of the year the day of the month and the year in which the games were played. The full time home team goals the full time away team goals the results, home went draw, away win. The TM values for the home team and the TM values for the away team, that we're going to use to generate our, our forecasts. And then we have the, betting odds, the decimal betting odds from the Bet365. When bookmakers, the decimal odds of a home win decimal odds of a draw and the decimal odds of an away win. And then here we have the probability is taken from, the FiveThirtyEight website which says these are probabilities, the probability of a home win, the probability withdraw and the probability or away win. So we can use those for comparisons later on. And so if we just run, look at the, we can do it is described here, to look at the data and see we have 288 games played in the season. And we going to use those games played up into, beginning of 2020 in order to generate our model. So we want to identify the outcome of each game based on the probabilities, and we are going to for the Bet365 data. We can simply say, well, we don't need to calculate the probabilities directly. We know that, the decimal odds are inversely related to the probabilities, whether we scale them or whether we don't. So what that means is that the lowest value for the decimal odds, is it the same as the most likely outcome. So what we can do is, if you want to identify the, what Bet365 was predicting as the most likely outcome of the game, we can just take the value, the lowest value of the decimal odds. That will be their prediction of the most likely outcome. Now, It's also worth bearing in mind that if you do this yourselves with other data sets in the future, you might run into the problem that sometimes the odds on and on two different events are the same. So, for example, suppose according to Bet365, the most likely outcome was both a home win and a draw, that they had the same decimal odds, and the away win would be less likely. But the value for the decimal odds would be the same, for the home win and the draw. What should you do in that situation? Well, effectively that firstly, for pointing out that that's very rare. It doesn't happen very often. In certainly looking at, game results in major leagues. And secondly, to say well, if you find if you come across a number of those cases, then you need to adopt some kind of a tiebreaker to say which one you're going to make the most likely. And you could do that at random, so you could just toss a coin and say, well, if it's heads, it will be a home win, and if it's tails it will be, I'll give it a draw. And if you're working with a large data set, that fix will not really matter. But, and that problem actually doesn't arise in the data that we're using here today. But I just wanted to draw your attention to that. Okay, so now we show the predicted result based on the Bet365 odds. And we can do a cross tab to show the congruence between the results predicted by Bet365 and the actual results. So this cross tab is a useful way of looking at things. So along the vertical axis, we have the full time results away win draw and home win. And on the horizontal axis we have the Bet365 results. And one thing to note here is, Bet365, the betting odds never give a drawer as the most likely outcome. And that's perhaps a little bit surprising. But these data sets generally don't do that. You'll find that's also true of our prediction models. And it's also true of the FiveThirtyEight data. So in some sense it's a property of many of these forecasting models, that they don't predict draws. It's not impossible, but it's very unlikely, and mostly they predict either away wins or home wins. So, in that sense, they're going to get quite a lot of results wrong anyway, because of the existence of draws. But, apart from that, it's worth noting, if you look at the table that, the you could think of, what's the success rate when the result is an away win. Well, the Bet365 is more likely to get it right than get it wrong. And when it comes to a home win, actually Bet365 is much more likely to get it right. So that's one thing we really were discussing in the previous session, that the betting odds and the models seem to be better at predicting home wins than anything else. Okay, so we, we want to create. We want to estimate our regression model now, in order to generate our predictions. So to do that, we first we've got the TM ratios. But remember, we're going to use the log of the TM ratios as seen by the home team in order to generate, in order to run our regression models and we need to create a win value, where the win value is 2. If it's a home, win 1. If it's a draw and 0, if it's a loss, and that's going to be used as, in the ordered logic regression. And so we need to, identify, the observations were going to use. So our data is in date order. So, we're going to take all those rows up to the point that were in the 2019, a year 2019. And you can see here, that it goes up to row 197. So there are, remember in python, it counts from 0 is the first row. So in fact there are 198 games played up until the beginning of 2020. And so we're going to take that as a subset of our data call. This, called this season 19 rather than season 1920. And that's just the 1st 198 rows of the data. And you can see that so otherwise, it's the same as everything we've got. We're going to use this subset to run our ordered logic regression. So, as before, we import the, commands in order to run the order loaded regressions. And then we fit the value, the win value to the log of the TM ratio. And we get this regression results here and the coefficients are not exactly the same. But essentially the, output is the same as we got when we did this for all the seasons 2011 to 2019. And so we're going to repeat what we did then, which is, take the coefficients that we need. So we need the beta, which is the effect of the, TM ratio. And we need the two intercepts which separate the boundaries between home win, and draw on one side and draw and away win on the other side. And we're going to use those boundaries to generate our probabilities. So I describe here how those probabilities are constructed and here we actually write down, an equation which solves the solves for the predicted probabilities. And, we can now see that we have our our predictions based on, our model. Now, note here that I've incorporated our predictions into the full data set, not the truncated data set, but the full data set. So what Python is doing here, is taking the coefficients from our model and applying them to the data in the full data set, not the truncated data set. So what that means is we now have predictions not just for the games played up until the end of 2019, but we now have predictions for all of the games played in 2020. As long as we have a TM ratio, then we have a prediction of the game. So, note also that this includes predictions for games. Not yet, that haven't been played because of the interruption due to the coronavirus pandemic. But we're going to be interested in those games, played up until the suspension of the season before, the lockdown. In order to test the, or evaluate our forecasts, we need to find out what results our model is predicting exactly and how often the model is correct. So to do that, we find the outcome that has the highest probability. And then we identify the outcome logic pred, based on which column predA, predD. The opredH is equal to Max problem, and then we can generate another column, which is logic true, which is when the value of our logic pred is equal to FTR, the actual results. So when we do that, data frame now looks like this, and you can see we have values here for true or a false in terms of predicting the outcome. So we armed with this with our predictions, we can calculate briar schools, of course. And with this load, a true we can just calculate our average success rate.