The final topic that I want to discuss in probability and statistics is Linear Regression. And, here we'll look at the equation of the linear regression line, how good a fit it is, correlation coefficients, and then I'll make some final comments. So, first of all, the idea of linear regression occurs when we want to fit a straight line through some data. And this example data here is some variable plotted versus time. This happens to be the weight of an individual here on the left-hand side, and the number of days on the right-hand side. So what we'd like to do is to fit a straight line through that data. For example, the red line shown here. And this line, generally speaking, the equation of a straight line, a linear regression line, is a + bx, where if this little hat over the top here is sometimes called a circumflex. And this is the terminology they use in the reference handbook. So, the equation of the line here is a + bx. And, to find that line, the equation of that line, we usually do this by the method of least squares. And, here's the extract from the reference handbook that explains this, and in that equation, the y-axis intercept which is a = y bar- b x bar. Where y bar and x bar are the mean or average values of y and x respectively. Okay. So y bar is 1 / n summation yi and x bar is 1 / n summation i equals 1 to n of xi. The normal averages of those variables. The slope of the line which is b is given by Sxy / Sxx, where Sxy = the sum of the xy products and is defined by this equation here. Sxx is the sum of the squares of the x values as defined here. And then later on, we'll also need Sy, which is the sum of the squares of the y values, as shown here. So, when we do this, what this method is actually doing, if I look at any individual data point, and compute its distance, say, delta from the line here, what we're doing is computing delta squared, and these can be plus or minus, delta can be plus or minus, but squaring them makes them all positive. And then we want to find the equation of the line which minimizes the value of these squares. In other words, minimizes the value of delta squares. So that's why this is called the method of least squares. And these equations give us the line which accomplishes that. A related question is how good a fit is this equation to the line? So, in the handbook, they give some measures of this. For example, a standard error of estimate, confidence interval for the intercept, confidence interval for the slope, etc. But, what's more useful is usually the correlation coefficient for the variable between the variables, R, which is given by Sxy divided by the square root of Sxx, Syy. Or the R squared value, the square of that which is commonly called the R squared value which is just R R squared. And in here, they denote this as the coefficient of determination but normally we just call that the R squared value. So, going back to that previous example, the red line here is the linear regression line and this I obtained just by fitting a line in an Excel spreadsheet. And this automatically gives me the equation of the line and also the R squared value. So, if I take the square root of that, I can see that the correlation coefficient between those two variables is approximately 0.89. In other words, the straight line in this case is quite a good fit to that data. But, what about other possibilities? This data also has the same line through it. But here obviously the data is much more scattered and the fit is not as good. So, the equation of the line here is the same but now you see that the R squared value is much smaller or the correlation coefficient is much smaller. It's 0.47, which is a much poorer fit. If the sketch of the data is much less, and this data also has exactly the same curve, the same linear regression curve. And this I obtain just by adding random values to the data. And, the equation is the same, and the correlation coefficient in this case is 0.7823. R squared is 0.7823, so the correlation coefficient, the square root of that is 0.88. Quite a good fit. And the final example, the data are so closely correlated that you can't even see the difference between them in this last example. So, if I fit the equation through it, again, I get the same equation but now, the agreement between them is perfect so R squared is equal to 1 and the correlation coefficient is equal to 1. So, generally speaking, a rule of thumb might be we could say that it's a good fit if the correlation is greater than about 0.85 which is true for the three cases here except the top right hand one. So, if the correlation coefficient is zero, that means that the two variables are completely uncorrelated. On the other hand, if the two variables are perfectly correlated like in the last example here, then the correlation coefficient is equal to 1. Let me do a numerical example on that, and I have this data series consisting of four points, x and y, as shown here. And we can only do an example, a numerical example with a few points because the computation rapidly becomes too cumbersome. But here we're going to answer two questions about this data. First of all, what is the correlation coefficient between those two variables? Which of those alternatives? And secondly, what is the equation of the linear regression line through the data? Which of those four alternatives is it? So, first of all, I'll plot out the data and take a look at it. And this a graph of the data x versus y. So, x versus y here. And just looking at it, it seems like, yes those two variables are reasonably correlated. And, if I compute the line through there, just plug those data into excel. This is the linear regression line I get and this is the equation of the line and the value. So, we've already answered both of those questions. It's given there. But, let's go through and calculate that and see how those arise. So, to do that, we have to compute a number of variables, for example, Sxy, the sum of the x y products, and others. So, I've computed, added some extra columns and rows to the table here. The first column is the product xiyi. The second column is x squared. So for example, 2 squared is four, xy here is 2 times 8 is 16. And the last column is y squared. For example, 8 squared is 64. And, the additional row here is the summation of all of those columns. So, this number is the summation x, this is summation y, this is summation xy, etc. And finally, the additional row on the bottom here is the average values. So, x bar is 4.75, and y bar is 12.0. So, now that we have all these parameters, we can compute these numbers here. So, Sxy is equal to the summation of xiyi, which is equal to 262 minus 1 over n, there are four samples, so that's a quarter, multiplied by summation of xi, which is 19, multiplied by the summation of yi, which is 48, which gives me 34.00 for Sxy. Next, I compute the summation of the x squared terms, Sxx. So, that is equal to summation xi squared, which is equal to 109, minus 1 over n, multiplied by the summation of xi all squared, in other words, 19 squared, which is 18.75. And, the answer is 18.75. Next, we'll compute the sum of the y squared terms, which is given by this expression here. So, that is equal to summation yi squared is 648 minus 1/4 multiplied by the summation of yi squared, in other words, 48 squared, and that is equal to 72. So, now I can compute the correlation coefficient which is Sxy divided by square root of SxxSyy, which is equal to 34.00 divided by square root of 18.75 times 72.00, which is equal to 0.93, and the answer is C. Which you note, if I square that, I get the R squared value of 0.86, which agrees with what Excel told me over here. Next, we want to find the equation of the linear regression line. And here is how general expression for the linear regression line in the notation of the reference handbook, y is equal to a plus bx where the slope of the line b is equal to Sxy divided by Sxx. So that is equal to thirty four divided by eighteen point seven five. So the slope is 1.81. The intercept a is given by y bar minus bx where y bar and x bar are computed as normal and we already have those values over here. They are 4.75 and 12.00. So, substituting n, we find that a is equal to 12 minus the slope, which is 1.81 times 4.75, which tells us that the intercept is 3.39. So, substituting those values of a and b back into the equation for y, we see that y is equal to 3.39 + 1.81x, so the correct answer is A. And again, if we see the equation here, which Excel gave us, we see that indeed, it agrees with that equation. Now, I'll just make a couple final comments about probability and statistics. There are a great many other topics which are covered in the reference handbook. For example, one thing that I haven't covered here is hypothesis testing by means of one-way analysis of variance, or anova, and the corresponding tables here for one-way analysis and two-way analysis. But, I'm not gonna cover those because I don't think that they're very likely to occur in the exam. I also haven't covered the fundamental definitions of sets, and I would also mention that there is a table given in the book here are probability and density functions. For example, the first two of these would probably be useful. The binomial coefficient, and binomial occur in topics that we've looked at previously, so those are useful. However, all the rest of these, for example, hypergeometric, Poisson distribution, geometric, all these different distributions are again, they're given in the reference handbook. But I think are quite unlikely to actually occur in the reference, in the actual exam, so I won't cover them here. So, this concludes my discussion of linear regression and this also concludes the module on probability and statistics.