In addition to modeling and prediction we can also use linear regression models to do inference. In this video we're going to talk about hypothesis testing for the significance of a predictor and confidence interval for the slope estimate. We're also going to talk a little more about conditions for regression with respect to what additional conditions may need to be satisfied if we want to be able to do inference based on these data. In 1966, Cyril Burt published a paper called, the genetic determination of differences in intelligence. A study of monozygotic twins reared apart. The data consists of IQ scores for an assumed random sample of 27 identical twins, one raised by foster parents, the other by the biological parents. Later on in history this study actually got a lot of criticism saying that the data may have been either non-random or non-representative or entirely falsified. But, regardless for this example we're going to be working with the original data set from the paper that was published back in 1966. In the scatter plot we can see the relationship between the foster twin's IQ and the biological twin's IQ, and we can see that as one goes up the other one goes up as well. We have a, a positive and relatively strong relationship with a correlation coefficient of 0.882. The results of this study can be summarized using a regression output that looks something like this. So we have the estimate for the intercept as well as the slope here. We can, based on this, write our linear model as the predicted IQ score of the fostered twin is 9.2076 plus 0.9014 times the biological twin's IQ. The 9.2 value is the intercept, and the 0.9 value is the slope here. To assess the fit of the model, we can also take a look at our R squared. And R square is 0.78, meaning that 78% of the variability in foster twins' IQs can be explained by the biological twins' IQ's. Within the framework of inference for regression we're going to be doing a hypothesis test on the slope. The overall question we want to answer is, is the explanatory variable a significant predictor of the response variable? The null hypothesis as usual says there's nothing going on. In other words, the explanatory variable is not a significant predictor of the response variable, i.e., there's no relationship, or slope of the relationship is 0. The alternative hypothesis says that there is something going on, that the explanatory variable is a significant predictor of the response variable. In other words, there is a relationship between these two variables and the slope of the relationship is different than zero. So the notion, our hypothesis says that beta one is equal to 0, remember beta one was the population parameter for the slope, and the alternative hypothesis says that beta one is not equal to 0. And how do we go about actually going through this test? In linear regression we always use a t-statistic for inference, and remember that a t-statistic looks like this. It's a point estimate minus a null value, divided by a standard error. In this case, our point estimate is simply our slope estimate, b1. And our standard error is the standard error of this estimate. So the t-statistic for the slope can be summarized to b1 minus 0, remember in the null hypothesis, we had set the beta one equal to 0, which means no relationship or a horizontal line, divided by the standard error of b1. And whenever we have a t-score, we also have a degrees of freedom associated with it. And in this case the degrees of freedom is n minus 2. Let's pause for a moment and think about why is the degrees of freedom n minus 2. We haven't really seen that before. In the past we've seen for a t-statistic the degrees of freedom equaling n minus 1. Remember, that with the degrees of freedom, we always lose a degree of freedom for each parameter that we estimate. And when we fit a linear regression, even if you're only interested in the slope, you always also end up estimating an intercept as well. And since we're estimating both an intercept and a slope, we're losing two degrees of freedom, and that's why in linear regression, the degrees of freedom associated with a t-score is n minus 2. For calculating the test statistic, we are actually going to make use of the regression output and then kind of show you guys that we didn't have to do any hand calculations at all. So the t-statistic we said is our point estimate, so that is 0.9014 for the point estimate for the slope, minus 0, the null value, divided by the standard error of the point estimate. And we can simply grab that from the regression output as well. We're not going to be asking you guys to be calculating any of this by hand. You should know how the regression output works, and that's why we're going through the calculation of the t-score. But you're not going to asked ever to calculate the standard error of the slope by hand. It is simply a tedious task that can be error prone and we usually use computation for it. But it is important to understand what that standard error means and how the mechanics of the regression output work. If we do the math here, we're actually going to get a 9.36 for our t-score, and that's simply the value that's already given on the regression output anyway. The degrees of freedom is 27 twins minus 2 is 25, and the p-value is going to be the area under the t curve, that's greater than 9.36 or less than negative 9.36. Remember we had a two sided alternative hypothesis. This comes out to be a pretty low value as you can imagine, 9.36 standard errors from the null value is a really unusual outcome, and therefore the p-value is approximately 0. We can see that the p-value is given as exactly 0 on the regression output, but note that that's simply rounded saying that when rounded to four digits, we still have very little probability. The p-value is probably never exactly equal to 0, but it's a very, very small number. Just like we can do hypothesis tests for the slope, we can also do a confidence interval. Remember, the confidence interval is always of the form point estimate plus or minus a margin of error. In this case, our point estimate is b1, and our margin of error can be calculated as usual, as a critical value times a standard error. We said that in linear regression, we always use a t-score, so we're going to use a t-star for our critical value, and the standard error of the slope, we said, comes from the regression output. Using these, we can calculate the 95% confidence interval for the slope of the relationship between biological and foster twins' IQs. The degrees of freedom we had said was 25, and what we want to do first is to find the critical t-score associated with this degrees of freedom, and the given confidence level. To find the critical t-score, let's draw our curve and mark the middle 95% and note that each tail is now left with 2.5%, or 0.025. So, the cutoff value, or the critical t-score, can be calculated using R and the qt function, as qt of 0.025 with degrees of freedom of 25. This is going to yield a negative value, negative roughly 2.06. But note that for confidence intervals the critical value that we use always needs to be positive. So the t-star is going to be simply 2.06. We know our slope estimate, 0.9014 plus or minus 2.06 is the critical value times the standard error that also comes from the regression output, gives us 0.7 to 1.1 as our confidence interval. And what do these numbers mean? How do we interpret this confidence interval? Basically what this means is that we are 95% confident that for each additional point on the biological twins' IQs, the foster twins' IQs are expected on average to be higher by 0.7 to 1.1 points. So, to recap, we said that we could do a hypothesis test for the slope, doing a t-statistic, where our point estimate is b1, our null, and then we subtract from that a null value and divide by the standard error, and the degrees of freedom associated with this test statistic is n minus 2. To construct a confidence interval for the slope, we simply take our slope estimate b1, and add and subtract the margin of error, that's composed of a critical t-score and a standard error. Note that the null value is often 0, since we usually check for any relationship between the explanatory and the response variables. And also note that the regression output, gives us b1, the estimate for the slope, the standard error for that estimate, and the two tailed p-value for the t-test for the slope, where the null value is 0. So if this is the standard test that you are trying to do, you shouldn't have to do any hand calculations and should simply be able to make your decision on the p-value that is given to you on the regression output. We didn't really talk about inference for the intercept here. We've been focusing on the slope because inference on the intercept is rear, rarely done. Earlier we said that in some cases, the intercept is actually not very informative. And usually when we fit a model, we want to evaluate the relationship between the variables involved in the model. And the parameter that tells us about the relationship between those variables is the slope, not the intercept. So we're going to focus our inference for regression on the slope and not really worry about the intercept. Before we wrap up, a few points of caution. Always be aware of the type of data you're working with. Is it a random sample, a non-random sample, or a population data? Statistical inference and the resulting p-value are completely meaningless if you already have population data. So, we usually use statistical inference when we have a sample, and we want to say something about the unknown population. If you have a sample that is non-random, so it's biased in some way, note that the results that arise from that sample are going to be unreliable as well. And lastly, remember that the ultimate goal is to have independent observations to be able to do statistical inference. And by now in the course, you should know how to check for the independent observations. Remember, we like random samples, we do like large samples, but we don't want them to be too large. And we have that 10% rule that we check if we're sampling without replacement.