In the previous video, we have learned that econometric methods seek to make sense out of economic data. But in principle, these methods are useful to make sense out of any type of data on variables. The first question that we will try to answer, making use of correlation is, do home teams get more penalty kicks than away teams? This example concerns data on penalty kicks in Dutch Football, given by 15 referees to home and away teams. A basic premise of using econometric techniques, is that you need a benchmark. This benchmark could be the average. Now let's have a look at the data concerning the given penalties by 15 referees in the Netherlands. There are 15 observations on say variable called z, and when we denote them as z_1 to z_n and so forth, then the average of the 15 observations is given by this formula, where capital sigma, the Greek version of capital S, means summation. Here a summation from the first number to the 15th number. In general, we write for the mean of N observations in shorthand. For the penalty kicks given to the home team, the range of the 15 observations 0-53, which is of course related to the number of games led by the referees. The average of these 15 observations is, now let us have a look at a penalty kicks given to the away team. For those, the average is the following. Note that the average number of penalty kicks can be an actual observation in the case of 12, but it can also be a number that cannot occur like 24.27. At first sight it seems that home teams seem to have an advantage. The median is somewhat similar to the average, and it is obtained by ordering the observations. Like for example, for the home teams, and then take the middle number, which is 20 here. Now we turn to the range, or better the variance. The range is of course one way to measure the spread of the observations, but it is a little inconvenient as it does not benchmark the range relative to the mean. So one would want to have a measure of the spread of the observations around that average and that measure is called the variance. When we again have an observations on z, denoted as z_1 to z_n, and the average is given by that z-bar as before, then the variance is defined as, for the 15 observations on penalty kicks given to the away team, this would be a 129.72, which is computed as follows. Note that this variance is always positive as it amounts to a summation of squares. The square root of the variance, of the variable z, is called the standard deviation. For these 15 observations this would be, now we have dealt with some basic properties of the data like a mean, median, and standard deviation. We now turn to a relation between variables. Variables can be related. For example, families with more children spent more on groceries than smaller sized families. This relation is not necessarily both ways. People who spend a lot on groceries do not necessarily have many children. Sometimes it's evident that when one variable obtains larger valued observations then another shall too. Think of tall people who typically also have larger feet, or think otherwise if a grocery store and lowers its prices for detergent, then sales may go up due to these lower prices, and that sense there is some kind of causality. That is because prices go down, sales go up. It could also be that such causality is not obvious, or even absent. National economies may show similar growth patterns, but it might be unclear which of the countries makes it possible for others to grow, think of China. Was China's economic growth in part due to the increased exports to the US because the economy of the US was doing well and they could afford it, or was the US economy doing well because they could import cheaper products from China and therefore China could grow? Disentangling what comes first and what comes next can thus be difficult. A useful measure to describe co-movement, or common properties is the sample covariance. Suppose you have N observations on the variable x, denoted as x_1 to x_n with an average given by x-bar. Another set of N observations on a variable y, denoted as y_1 to y_n with an average given by y-bar. Then the sample covariance between X and Y is defined by, let us have a look at the penalty kicks again. The unit I is a referee, who in various matches gives penalty kicks to home and away teams. So we'll look again at this table. When we called the number of home team penalties x, with an average of 24.267 and the number of away team penalties y, with an average of 12, we can compute the sample covariance as follows. The sample covariance is positive and hence this means that apparently, if a referee gives more penalties to one team, the referee is also more likely to give more penalties to the other team. A drawback of the covariance is that it is not scale-free. For the grocery spending example, if the measurements on grocery spending would have been recorded not in US dollars, but in $1,000 US, meaning that the observations would have been like 0.5, or 0.06, or so, then the covariance will have become 1,000 times smaller in size. Also when the data are not in US dollars but in Euros, then the covariance will become different. A scale-free measure up the strength of a relationship between two variables, x and y, is the correlation. The correlation between these variables is defined by, which in words means that the covariance is scaled by the standard deviations of x and y. For the penalty kicks, we got the upper bound of a correlation is one, the lower bound is minus one. So 0.809 is rather close to one. This confirms the notion that referees who give more penalties to home teams, they also do so to away teams. We may now also want to know if the number of penalties given to the home team is significantly larger than the number of penalties given to the away team. For that purpose we can use a so called t-test. Let us see how that test works. Consider the observations and the sample average as follows. To test for the null hypothesis that the population mean is equal to some value, one can use the t-test statistic given by, which approximately follows the standard normal distribution. In a later video, we will explain that. Where S set is the standard deviation. That is, when there are two matched samples for the same individuals, which is what we have here for the referees and the penalties, and with data and observations for the first sample and observations for the second sample, one can compute the mean of the differences. That is, for the penalties we have the following. To test for the hypothesis that the population mean is equal to some value, one can use the t-test statistic given by, where again N(0,1) means the Gaussian distribution with mean zero and variance one. Where the standard deviation of the differences equal to, which here is equal to 10.504. To test if the differences between penalties for home teams and away teams have mean zero, or in the notation above, we then compute this value. This t-test value is way above the two which associates with the 95 percent interval of the standard normal distribution, and hence we conclude that statistically speaking, the differences between the penalties for home teams and away teams does not have mean zero. In other words, home teams seem to be in a favorable position. In this video, we investigated the question, do home teams get more penalty kicks than away teams? Therefore, we have used an econometric method called correlation. We have calculated the mean, the median, the standard deviation, the covariance, and we did the t-test, and in the end we can conclude the number of penalties given to the home team seems to be significantly larger than the number of penalties given to the away team, and there seems to be a correlation between the number of penalties.