Up to this point, we've talked about multiple regression analysis with binary explanatory variables which are categorical variables with two categories and quantitative explanatory variables. But we haven't yet discussed what to do when we have a categorical explanatory variable that has more than two categories. It is not uncommon to have categorical explanatory variables with three or more categories. Fortunately, it is relatively simple to incorporate these types of explanatory variables into a multiple regression analysis. There are a lot of different methods for examining explanatory variable group differences on a response variable. The type of comparison depends on how we choose to code our explanatory variable. The process of coding categorical explanatory variables is called dummy coding, or parameterization. And these dummy coding, or parameterization, methods can produce explanatory group comparisons ranging from very simple to very complex. For example, if our response variable is number of nicotine dependence symptoms, we might want to compare the number of symptoms from one group to the average number of symptoms for the other groups combined. This type of comparison is called effect coding, or effect parameterization. In this course, we're going to use one of the most basic parameterizations, which is called reference group coding, or reference group parameterization. This method is very similar to the post hoc pair-wise comparisons that you may have conducted as a follow-up to running an analysis of variance in the second course of this specialization, Data Analysis Tools. That is, reference group coding allows us to compare each group of our explanatory variable, referred to as the comparison groups, to another group, which is referred to as the reference group. For example, if our response variable is the number of nicotine dependence symptoms, reference coding allows us to compare number of nicotine dependence symptoms for each group of our categorical variable to a designated reference group. However, unlike an analysis of variance post hoc test, for which we conduct the comparisons after testing the ANOVA, the comparisons are part of the estimation of the multi regression model. This allows us to examine explanatory variable group differences on the response variable after adjusting for the other explanatory variables in the model. To demonstrate how to analyze a categorical explanatory variable with three or more categories, we will return to our NESARC data multiple work aggression analysis, predicting number of nicotine dependent symptoms for multiple explanatory variables. We could also add an ethnicity-race explanatory variable. Our ethnicity-race variable has four categories coded 0 = Hispanic, 1 = non-Hispanic White, 2 = non-Hispanic Black, and 3 = non-Hispanic Other ethnic or racial group. In this example, what we wanna know is whether Hispanic individuals have more or less nicotine dependence symptoms compared to individuals from the other three racial, ethnic groups. That is, we want to compare Hispanic individuals, the reference group, to individuals from the other racial ethnic groups, the comparison groups, on a number of nicotine dependence symptoms after controlling for the other explanatory variables in the model. To do this, we will use the same smf.ols function that we used to test our earlier multiple regression model. So we have our regression equation for which our NDsymptoms response variable is being predicted by the explanatory variables DYSLIFE, MAJORDEPLIFE, numbercigsmoked_c, age_c, SEX. We add our ethnicity race variable, ETHRACE, to the list of explanatory variables. But to tell Python that it is a categorical variable, we need to type a capital C and then put the name of the categorical variable in parentheses after the capital C. In this example, we want to compare the Hispanic group to the three other ethnicity race groups. So this will be our reference group. If you remember, our ethnicity race variable is coded 0 for Hispanic. The default and Python is reference group coding, which in python is called treatment coding. And the default reference category is the group with a value equal to 0, which is Hispanic in this case. Since this is what we're looking for in this example, we do not need to add any code to change the default. If we hadn't added a capital C with the ETHRACE variable in parentheses, Python would have assumed that our ethnicity race variable was a quantitative variable, so the regression coefficient would make no sense. Here's the output. Basically it is the same output that we see with the smf.ols function. But, if we look at our table of parameter estimates, we see that there are three regression coefficients for our categorical ethnicity race variable. Note that there is no estimate for the Hispanic reference group. The t dot and the number after it tells us that the treatment, that is reference group, parameterization was used and the number is the categorical variable code for the group. For example, the non-Hispanic white group in our ETHRACE variable was coded 1. So the t.1 indicates that it is the regression coefficient for the comparison of the non-Hispanic White ethnic race group to our Hispanic reference group. The three regression coefficients compare each of our ethnicity race groups to the Hispanic group. We can see that none of these three groups were significantly different from the Hispanic group in number of nicotine dependent symptoms because the p values all exceed our alpha level of .05. As with the previous regression analysis, we see that major life depression and number of cigarettes smoked are positively associated with number of nicotine dependent systems. If we wanted to make other comparisons, for example, to compare non-Hispanic White to non-Hispanic Black, then we would need to override the default reference group so that the value of 1 in the ETHRACE variable, which indicates the non-Hispanic White group, is used as the reference group. The code here shows how to do it. It's mostly the same code, but now because we are not using the default, we need to add some code to tell Python to continue to use the treatment or reference group coding and designate the reference group. We do this by adding a comma after the name of our ETHRACE variable in parentheses. Then treatment with a capital T. And then within another set of parenthesis, reference=1. This additional Python code provides a comparison of the three other ethnicity race groups to the non-Hispanic White group. Here's the output. Now the group code at 1, no longer has a parameter estimate and the other coefficients for t.0, t.2 and t.3 compare each of the other three racial ethnic groups to the non Hispanic white group. Participants in the non Hispanic other ethnic racial group had a significantly greater number of nicotine dependent symptoms compared to non Hispanic white participants. There are no significant differences for Hispanic and non Hispanic black participants compared to non Hispanic white participants.