This equation makes a lot of sense to us when we're working with a quantitive explanatory variable and quantitive response variable. But what about a categorical explanatory variable and quantitative response variable? It obviously wouldn't make very much sense, for example, for us to create a scatter plot and use gender as our predictor variable. However, a regression model will still be informative. Let's look at the output testing the linear relationship between depression and the number of nicotine dependent symptoms where the major depression is a binary, categorical, explanatory variable and number of nicotine dependent symptoms ranging from zero to seven is a quantitative response variable. Our research question is, is having major depression associated with an increased number of nicotine dependence systems? In this code, the response variable comes first, then the explanatory variable. The full code is PROC GLM; model NDSymptoms=majordeplife/solution and ends with a semicolon. The solution option is necessary in order to generate parameter estimates for the regression model. We also see the same output format as with the Get Minder regression example with number of observations and number of observations with complete data that were used in the model and the name of our response variable. And here are our parameter estimates and P values. Thus, we know that our equation is NDSymptoms = 2.19 + 1.36(MAJORDEPLIFE). >> Let's consider what this equation actually means, since it's not a best fit line of the scatter plot. We know that the variable MAJORDEPLIFE is our depression variable, and it takes on the value zero if the individual does not have major depression, and the value one if the individual does have major depression. Thus, we can plug in the values zero and one into our major deplife variable to get the expected number of nicotine-dependant symptoms for each group. >> As we can see, we would expect daily smokers without depression to have 2.19 nicotine dependent symptoms and daily smokers with depression to have 3.55 nicotine dependent symptoms. Remember that we previously subset our data to daily smokers aged 18 to 25. [MUSIC] Notice that this is also the mean number of nicotine dependent symptoms for each group, which we can see by running summary statistics. >> So although we may not be working with a best fit line, we are still generating important descriptive information out of this equation. Again this does not mean that everyone in my sample with depression has exactly three and a half symptoms. Obviously no can have half of a symptom. Our low R2 value 0.10 tells us that we're only capturing a small amount of the variability, 10%, in the number of nicotine dependence symptoms among daily smokers. But nonetheless this is the value that we would expect given our data. Also note, that the categorical variable is a binary categorical variable. If your categorical variable has more than 2 levels, you will need to create dummy variables for your analysis. We'll go over this process in supplementary material. [MUSIC] There are a lot of factors that contribute to internet use rate and is nicotine dependence, the response variables in each of my variable. If we had more information and if we included those other factors in our model, it's quite possible that our expected values would be even closer to our observed values. We could include several explanatory and or predictor variables into our model in order to evaluate both the independent contribution of multiple explanatory variables in predicting our response variable. And also in order to evaluate whether specific variables confound the relationship between our explanatory variable of interest and our response variable. [MUSIC]