So how do we find the equation of this best fit line? In SaaS, the procedure that we will be using is called PROC GLM. GLM stands for general linear model. As you can see from the sample syntax, after the PROC GLM; you use the word model, followed by the response variable, equal sign, and then the explanatory variable, followed by a semi colon. For this sample research question from the get minder data set, we'll type PROC, GLM, semicolon, model internetuserate, which is our response variable, equal to urbanrate, which is our explanatory variable, followed, of course, by a semicolon. So let's run this program and look at the output. First, you can see the number of observations and number of observations with complete data that were used in the model. Here we see the name of our response variable. The F statistic is 113.7 and the P value is very small. Considerably less than our alpha level of .05 which tells us that we can reject the null hypothesis and conclude the urban rate is significantly associated with Internet use rate. Let's move to the parameter estimates at the end of our output. Here we have our estimates also known as coefficients or beta waves for both our intercept and for the variable urban rate. Thus the data sub one bay here is 0.72 and the beta sub zero value is -4.90. So we now know that our equation for the best fit line of this graph, is internetuserate = -4.90 + .72 * urbanrate. Before we analyze this equation a little more in depth, let's look at some more components of our output. For example, looking just at the output for the coefficients, we have a column labeled Pr greater than the absolute value of t. Which gives us the p value for our explanatory variables association with their response variable. This p value will be the same one we get if we run a Pearson correlation on these two variables. The p value is 0.000, which means that it's really small. Here you would report the p value is p < .0001. The GLM Procedure also gives an R-Square value, a value that we talked about in course two, Data Analysis Tools in the module on Pearson correlation. It is the proportion of variance in the response variable that can be explained by the explanatory variable. We now know that this model accounts for about 38% of the variability we see in our response variable Internet use rate. >> Look at how our equation is written. Y is a function of the variable x in some constant. Thus, as x changes, y will change with it. In building this model we're saying that we believe that x relates to y in some meaningful way. What's exciting about this equation is that we can also use it to generate predicted values for y. The symbol that we use for predicted values of y is y hat. For example, let's say we're told that a country has 80% urbanization. Can we predict their level of Internet use? Yes. We just plug the value 80 into our equation where we have our x value. As you can see, in a country with 80% urbanization, we would expect 52.7 people out of every 100 to use the Internet. Also not from our beta sub one that this value is by how much Internet use would increase for every one unit in increase in urban rate. For example, if we had a country with 81% urbanization, we would know that we would expect their Internet use rate to be 0.72 people higher, that is almost one person than a country with 80% urbanization. However, note that this is only the expected Internet use rate, given what we know about urbanization. It's the value that rests exactly on the best fit line. Unless our data were perfectly correlated, we would anticipate that our expected value and our observed values would differ from one another to some extent. From our analysis, we now know there's a statistically significant association between urban rate and Internet use rate. And we can also tell you, what we would expect Internet use rate would be for a given country given its urban rate. This statistical model has opened the doors to being able to better understand what's really going on between Internet use rate and urbanization. As long as we keep in mind that we're limited by the fact that we imposed the causal model. Rather than being able to directly test for causation and that expected data is not the same as observed data. We're still able to explain much about this relationship of interest. >> For example, Canada has an urban rate of about 80%. However, its Internet use rate is observed at 81.3%, not 52.7%. This is exactly why we include an error term in our model. We are not perfect diviners of the future. What we can do with statistics, however, is identify trends in our data and use those trends to look at what we would expect our data to look like. These trends are incredibly important.