Now, we are ready to discuss the notion of correlation between two numeric variables. Let me consider four different data sets. Each data set contains two numeric variables, X and Y. I will draw the corresponding scatter plots. This is first data set. This is the second data set. This is the third data set, and finally, this is the fourth data set. Let me enumerate them, 1, 2, 3, 4. Now, let us discuss what kind of relations between variables express each of these scatter plots. Let us begin with a scatter plot Number 2. We see here that if X is relatively large, so we have the values of X somewhere here, then the corresponding values of Y are also large. We see values of Y are somewhere here, and if values of X are small, the values of X are somewhere here. The corresponding values of Y are also small. Our data points are somewhere here. We see that there is a relation between X and Y. This relation is more or less monotonic, when X increases, Y also increases, and we can see that this relation is approximately linear. We can draw a straight line that will roughly approximate this tendency, this relation between X and Y. Of course, if we select a pair of points, we can find some violation of this tendency. For example, if I select this point and this point, we see that here, larger X corresponds to lower Y. But on average, we have the relation that larger X corresponds to larger Y. On this graph, the situation is opposite. If we know some value of X, if we know, for example, that X is somewhere here, then we know that Y is somewhere here, but it is the same as if we know that X is somewhere here. So the value of Y more or less is independent of the value of X, at least on average. In this case, we can say that there is no correlation between X and Y. Now, let us consider this third case. Here, we see that larger values of X correspond to lower values of Y. In our terms, we'll say that there is a negative correlation between X and Y. Now, let us discuss this picture and this picture. We see that on both pictures, there is a positive correlation between X and Y, but we also seeing that here this correlation is in a sense more straightforward. For given X, the range of possible values of Y is rather small, and here, for given X, the range of possible values of Y is rather large. So we can say that on this picture, we see much stronger correlation between X and Y than on this picture. Now, we have some intuition about the notion of correlation, and we want the mathematical definition of some value that will encode our intuition. So I want to create a formula that will allow me to distinguish between this picture, and estimate the strength of correlation between variables. I begin with the notion of sample co-variance. Let me draw, again, a scatter plot like this one. First, I want to find the average value of all Xs, and of all Ys. So the average of all Xs is somewhere here, this is X-bar, and average of all values of Y is somewhere here, this is Y-bar. Now, let me draw two straight lines for these values. I will be interested in the following sum. Sum for all data points of the following product, X_i minus X-bar times Y_i minus Y-bar. For example, if I have this point, the corresponding differences are positive. This this thing is distance from this point to this vertical line, this is x minus x bar. This distance is this distance y minus y bar. Let me assume that this is the point number 1, so this is x_1 and this is y_1. So we see that in this part of the graph, x coordinates of all points are larger than x bar, and y coordinates of all points are larger than y bar, and it means that here, this product will be positive. What about this part? In this part, both differences this and this are negative because for example, here x coordinate of this point is less than the corresponding x bar. So this difference will be negative and this difference will be also negative, and product of two negative values is positive. So these points also make a positive contribution to this sum. Let me denote this points with positive contribution by this pink color. What about this point? Here, the difference x minus x bar, this difference is negative, but this difference is positive. So this point gives us negative contribution to this sum, as well as this point. We see that if this picture looks like here, so we have a lot of points here and a lot of points here, and small amount of points here and here, this sum will be positive. But if we work for example somewhere here, then the corresponding sum will be negative. Indeed, let me draw x bar and y bar here, and you see that we have a lot of points with negative contribution. These points, and small amount of points with positive contribution. So for this picture, this value will be mostly likely negative. For this picture, we see that the number of points with positive contribution and the corresponding products will be more or less like the corresponding negatively contributed points. So these points will give us positive contribution and these points gives us negative contribution. We see that they can compensate each other. So we can expect that for this picture, the sum will be close to zero. We see that this sum in fact catches the idea of this relation between x and y. If I divide this sum by n, I will get sample covariance between variables x and y. So this is covariance between x and y. To make correlation out of this covariance, I have to divide this covariance by the product of standard deviations of x and y. So correlation, especially Pearson's correlation of x and y is covariance x and y over standard deviation x times standard deviation y. I use the same denominator while calculate the standard deviation is here. So if I put n here, I have to put n in the denominator in the formula of standard deviation here and here. The advantage of Pearson's correlation is that it does not change when x or y are scaled. For example, if I rescale the variable x and multiply all values x_i by for example 10. It means that I just change the coordinate system here at the horizontal axis, I just change the unit of measure. Then the covariance will multiply by 10, but standard deviation will also multiply by 10, and these multiplications will cancel each other. So we see that a Pearson's correlation does not depend on the choice of units of measure of x and y. It is other good property for the correlation measure because we understand that the strength of relations between two variables should not depend on the units of measure of these variables. Returning to this picture, I can say that in this picture, Pearson's correlation is close to zero. This correlation is usually denoted by the letter R, and here it is close to zero. Here when our relation is almost linear, Pearson's correlation is close to one, but it is not more than one because Pearson's correlation is from negative one to one. So this value lies in the segment from negative one to one. The value when R equals to one exactly means that we have exact linear relation between two variables. Here, Pearson's correlation is positive, but not very large. It is probably something about 0.6 or 0.7, and here Pearson's correlation is negative, it can be about negative 0.7 approximately. So we see that Pearson's correlation which is usually denoted by R is a number from negative one to one that allows us to measure the strength of the relation between variables, and it works pretty well when we assume that the relation between variables is linear or approximately linear as on these pictures. For different kinds of relations, we can use different kind of correlations that we will discuss later.