So, in this section, we'll look at some real data examples looking at the results from simple logistic regression models where we either have a binary or categorical predictor, and we'll focus on interpreting the results in terms of the slope and intercept estimates. So, after viewing this lecture, you should continue your understanding and grow your understanding of how logistic regression relates a function of the probability or proportion of a binary outcome occurring to a predictor via linear equation. You should be able to interpret the resulting intercepts and slopes from a logistic regression model in which the predictor of interest is either binary or categorical. So, let's go back to our data set from the National Health and Nutritional Examination Survey collected in the waves in the years 2013 to 2014. The data include 10,000 plus observations on person zero to 80 years old, and we have a subset of data on 5,847 adults who are greater than or equal to 18 years where we have their Body Mass Index. So, body mass index is used to classify the physical health of persons that usually at the population level, because it's not a perfect measure of individual health, but there are cutoffs used to define different stages of size in obesity in adults, is defined by body mass index of greater than or equal to 30. So, we're going to apply that cut off and look at the relationship between sex of the person and obesity in this sample of persons. So, of these 5,000 plus people, there's 3,056 females of which over 1,200 are classified as being obese by that BMI cut off. So, the proportion of females that are obese is 1,253 out of the total of 3,056 females or 41 percent, and the proportion for males is 899. Obese males out of 2,791 males total, which is numerically lesser, it's 32.2 percent. So, we've got the estimated risks of proportions here of the outcome in both sex groups, we can compute the risk difference by taking their p-hat for females and subtracting the p-hat for males or in the other direction if we were so interested. We can compute the relative risk by taking the ratio of those p- hats but we can also compute the odds ratio by hand here based on those estimated p-hats. So, the odds of being obese for females is the probability or proportion who were obese 41 percent divided by the proportion who were not, 59 percent, and the odds for males is the probability or proportion who were obese among males 32.2 percent divided by the 67.8 percent who were not. We take this ratio of odds, it turns out to be about 1.46. In this sample, females have 46 percent greater odds of being obese than males based on that cutoff of BMI greater than or equal to 30. The resulting logistic regression equation for this analysis that we would fit, is again, on the log odd scale. Our outcome y equals one is being obese and we're looking at the log odds of y equals one to the log odds of obesity. If we fit these data using the computer, we get an equation that looks like this, an intercept of negative 0.74 plus 0.38. The slope times x1 and in this coding scheme I've made x1 equal to one for females and zero for males. We could certainly do it in the other direction, we get different results numerically but the overall picture will remain the same. So, I'll just put this out here, could you figure out what the resulting intercept and slope would be if we had coded one for males and zero for females instead and we'll go over that in the additional exercises. So, for this example, we're only estimating the log odds of the outcome for two groups. For females, who are coded x1 equals one the total log odds of being obese is the intercept of negative 0.74 plus the slope 0.38 times one, indicator of being female and for males who are the reference group in this comparison, x1 equals zero, so the log odds of obesity for the males is just the intercept of negative 0.74. So, if we take the difference in these log odds, it turns out to be the difference in the log odds for two groups who differ by one unit next females compared to males it's just that slope. So, the slope Beta one hat equals 0.38 estimates the difference in the log odds of obesity for females to males. It is the difference in the log odds of obesity for one unit difference in x1 and of course here, because we only have two x1 values one or zero. The only possible one unit differences between females who were ones and males who are coded zero. Now, difference in log odds doesn't sound that useful or intuitive. But recall, a difference in log odds can be re-expressed as the log of an odds ratio. So, Beta one hat equals 0.38, and it's not only the difference in log odds of obesity for females compared to males but it can be re-expressed as the log of the odds ratio of obesity for females to males. So, we were one scale removed from a measure of association we've used before the odds ratio. So, you get it off the log, come and get the estimated odds ratio, we can exponentiate that slope Beta one hat. So, if we take e to the 0.38 power turns out to be 1.46, which we saw already when we computed this by hand in the two-by-two table format. So, these two approaches will sync up perfectly because they are mathematically equivalent. Again, the resulting link just to regression for this equation is that the log odds of obesity is equal to the intercept of negative 0.74 plus the slope of 0.38 times x1, where x1 equals one for females and zero for male. So, if we focus on the males only, the group with x1 equals zero, their estimated log odds is simply the intercept. So, Beta-nought-hat equals negative 0.74 estimates the log odds of obesity for males. In other words, the log odds of obesity when x1 equals zero. If we were to exponentiate that intercept, e to the negative 0.74 power, we get the odds of obesity for males. So, the odds of obesity for males is the number 0.45, we could re-express this as a into b odds, but odds for the individual level are not that intuitive measure in my opinion, but will show shortly that if we have the odds estimate for any group like we have the odds of obesity for males here, we will be able to translate that back into a proportion or probability. Let's look at another example that we use throughout the first term and into the second term as well as response to antiretroviral therapy amongst 1,000 HIV positive individuals sampled from a citywide population of HIV positive persons. One of the goals of this study was to look at predictors of response to antiretroviral therapy. One of the things we were interested in looking at is their CD4 count at baseline, the time that they received the treatment. So, this was split into two groups, those who had baseline CD4 counts of less than 250 cells per millimeter cube and those that had CD4 counts of greater than or equal to 250. So, amongst the 503 persons who had lower CD4 counts,127 responded to therapy, for a total response proportion of 25.2 percent. The group with the larger starting CD4 counts, the CD4 counts greater than equal to 250 cells per millimeter cube 15.9 percent responded. So, if we were to do by hand from this two-by-two table set up in these estimated P hats, if we were to estimate the odds ratio based on these numbers. It turns out to be 1.78. The odds ratio of treatment response for subjects with baseline CD4 less than 250, compared to subjects with baseline CD4 greater than or equal to 250 point is 1.78. So, persons with lower CD4 counts when they started therapy have 78% greater odds of responding to therapy than persons who had lower CD4 counts. This doesn't quantify the increase in the probability on either the relative risk or risk difference scale, but we can say if the odds are larger for the first group than the second, then resulting probability or proportion who responded are higher as well. We already see that up here but if we only had the odds ratio, we can still make that general statement. So, the resulting logistic regression equation for this analysis is that the log odds of response to therapy, is equal to negative 1.67 plus a slope of 0.58 times x_1, where x_1 again is an arbitrary coding of one for baseline CD4 count less than 250 and zero for subjects with baseline CD4 count greater than 250. We could have coded it in the opposite way as well and the results would be different numerically for the equation but overall the end results would be the same. So, for subjects in the group with x_1 equals to 1, those with CD4 counts of less than 250 a log odds of response is equal to negative 1.67 plus the slope of 0.58 times 1. For subjects with baseline CD4 greater than or equal to 250 the log odds of response is simply equal to the intercept. So again, the difference in the log odds for these two groups two differ by one unit in the x_1 predictor is simply that slope of point 0.58. So, the slope beta one hat equals 0.58 again estimates the difference in the log odds for subjects with baseline CD4 count less than 250,, compared to subjects with baseline CD4 count greater than equal to 250 to groups who differ by 1 and the x value. Recall again, a difference in log odds can be re-expressed as the log of an odds ratio. So, another way to interpret this perhaps more intuitively is that this slope of 0.58 is the log of the estimated odds ratio of response for subjects with baseline CD4 count of less than 250, compared to subjects with baseline CD4 count of greater than or equal to 250. So, the resulting odds ratio based on this analysis with to get that odds ratio, we would antilog or exponentiate that log odds ratio. In other words, exponentiate the slope and E to the 0.58 power gives us an odds ratio estimate 1.78. Again, exactly what we got by doing it the old-fashioned way of computing the P hats for both groups and then doing the odds ratio based on that. Again, this approach is mathematically equivalent to what we would do if we had the two-by-two table you could show with sub-algebra that they sync up. So again, the resulting logistic equation for this analysis, is that the log odds of response to therapy equals the intercept of negative 1.67 plus a slope of 0.58 times x_1. Where x_1 is one for subjects with baseline CD4 count of less than 250 and zero subjects with baseline CD4 count greater than or equal to two 250. So, those, x_1 of zero, the reference group, the group with baseline CD4 count greater than or equal to two 250. They're estimated log odds of response for that reference group is simply the intercept at negative 1.67. So, if we would exponentiate the intercept, we'll get the estimated odds of response to therapy in the group with CD4 counts of greater than or equal to two 250. The estimated odds is equal to 0.19. Again, not particularly easy to interpret per say, but will show again, soon that we can translate that, easily into an estimated proportion or probability of those who respond in that group. What if we have a multi-categorical indicator? How will we handle that in terms of Xs? Well, we're going to handle that exactly like we did with linear regression. We're going to designate one of the categories as a reference group, and then create indicators for each of the other categories such that the slopes for each of those indicators compare that category to the same reference group. So, this is a study published in JAMA in two 2010 by the consortium on safe labor. It was titled "Respiratory admitted morbidity and late preterm births." So, the objective of this study was to assess short-term respiratory morbidity in late preterm births compared with full term births in a contemporary cohort of deliveries in the United States. So, they collected data from 12 institutions constituting 19 hospitals in the United States and over 230,000 deliveries occurring between the years 2002 and 2008. They abstracted charts for all neonates with respiratory compromise admitted to the NICU, and then the late preterm births were compared with term births in regards to various respiratory outcomes. What they ultimately did and something we're building up towards for this in this course is something called the multivariate logistic regression analysis. They were able to look at predictors of respiratory failure or multiple predictors at once with a particular eye to a gestational age, but they wanted to control for other factors that maybe related to gestational age and respiratory outcomes. We're going to start with a simple logistic regression where we only look at the unadjusted associations between respiratory failure and gestational age. Then we'll build up to multivariate analysis and use this as an example later in the course. So, just to give you a sense of what the sample looks like in terms of the main predictor of interest, I've categorized gestational age into four four here; 34 weeks, 35 weeks, and 36 weeks. These are all considered preterm. Then 37 to 40 weeks in the sample was considered term or full term. You can see the majority of the infants in the sample. Ninety percent were full term, another five percent had gestational ages at 36 weeks, another three three percent 35 weeks and the remaining two two percent 34 weeks. So, even though the gestational age categories are ordinal, the authors did not want to assume the relationship between the log odds of respiratory failure and gestational age categories necessarily linear. I'll speak to this the additional exercises section after we talk more about the idea of the linearity assumption linking the log odds to a continuous predictor in the next section. I'll come back and fully talk about that statement. So, what they did was they made four categories, even though again the Xs are ordinal, there are four categories as if they were nominal. So, they made one one the reference, the full term group those at gestational ages of 37 to 40 weeks as their reference, and then they made indicators, binary indicators for each of the other three preterm categories. So, they made a variable called x1 equal to one if the child's gestational age was 34 weeks. Zero if not, if it were in any of the other three categories including the reference. A variable called x2 is equal to one if just child's gestational age was 35 weeks, and zero if it was not, and x3 was equal to one if gestational age was equal to 36 weeks, and zero if not. So, the resulting logistic regression equations is that the log odds of respiratory failure looks complicated because we have three x's so it looks like a large linear equation, but you have to keep your eyes on the prize that there's only four gestational categories and were ultimately estimating four log odds from this equation. It's equal the intercept plus a slope times the indicator of being gestational age of 34 weeks, plus the slope times the indicator of gestational age of 35 weeks plus a slope times the indicator of being 36 weeks. So, if we were to parse this, the intercept is the estimated log odds of respiratory failure when all x's are zero. That's for the reference group or full-term children, it's the log odds of respiratory failure for full-term children. Beta one hat compares those with an x1 value of one one the same reference group of 37 to 40 weeks. This is the indicator for 34 weeks, so the Beta one is the log. The difference in the log odds of respiratory failure for children who are 34-weeks-old, compared to the reference children who are 37 to 40 weeks. Because it's the difference in log odds, we can reexpress it as the log of the ratio of those odds or log of an odds ratio. Beta two id similarly going through the same logic the log odds of respiratory failure for children with gestational age of 35 weeks to the same reference, and Beta three is the log odds of respiratory failure for those 36-weeks-old. The ratio of that to the odds of respiratory failure for the full-term group. Here are the results from these data, the intercept is negative 5.5. The slope for the indicator of 34 weeks is 3.4. Slope for the indicator of 35 weeks is 2.8. Slope for the indicator of 36 weeks is 2.0. So, you can see that each of the preterm groups has higher log odds and hence higher odds, and hence higher probability respiratory failure than the reference of full-term children. So, the log odds ratio or the slope comparing the log odds of respiratory failure for those who were 34-weeks-old to the reference group is 3.4. If we exponentiate that, we get the odds ratio and it's a very large number. It's 30 indicating that those children who were 34-weeks-old have a much larger odds, 30 times the odds of respiratory failure compared to full-term children. As we get into 35 weeks, the odds ratio goes down. The log odds ratio is 2.8, and the resulting odds ratio is 16.4, which is certainly lesser than 30. So, those children with gestational ages of 35 weeks have lower odds than those children who are 34 weeks, but they still have much higher odds than the full-term group which is the reference group. Those at at 35 weeks have 16.4 times the odds of respiratory failure as compared to the same reference of 37 to 40 weeks. Then finally, those who were 36 weeks also have higher odds but not as much as the other two groups when compared to the same reference. The resulting odds ratio is 7.4. In summary, logistic regression, again, just to remind you, is a method for relating a binary outcome to a predictor via linear equation. The predictor can be binary, categorical, we've covered those two two in this section, and then the next section we'll hit on how to interpret results when the predictor is continuous. When the predictor is binary, we have an equation that looks like this; the log odds of our outcome equals one is equal to sum intercept plus some slope, and the slope Beta one hat is the estimated log odds ratio Y equals one for the group with x1 equals one, compared to the reference group with x1 equals zero. This result, the slopes can be exponentiated to get an estimated odds ratio. The intercept Beta-nought-hat is the estimated log odds of y equaling one for the group with x1 equals zero, and this can be exponentiated to get the estimated odds for that reference group. When the predictor is multicategorical with p categories, we just extend the equation, we choose one of the groups to be our reference, and then we have p minus one indicators for the remaining p minus one groups. The slopes for these p minus one indicators or the estimated log odds ratio of y equals one for the respective group with x sub i equals one compared to the same reference group for all comparisons, the group with all x's equal zero. These resulting slopes can also be exponentiated to get the estimated odds ratios of the outcome for each of the non-reference groups compared to the same reference. The intercept Beta-nought-hat is the estimated log odds that y equals one for the reference group with x1 equals x2 equals xp minus one all equal to zero. So, this is the estimated log odds for the reference group, and this result can be exponentiated to get the estimated odds. So you might say, John, we've done nothing new here. We are just computing odds ratios between two groups at a time when we have a binary outcome and we're comparing two groups at the time, whether it be a single predictor of this binary or multiple binary comparisons when we have a multiple categorical predictors. So, why are we doing this? Why are we making things more complicated? Did not the regression shown in the previous sections that we just did here, didn't they represent analyses that were done in Statistical Reasoning one, analyses we started with with the first two examples where we had a two-by-two table? The end result are odds and odds ratios that could have easily been computed without losing logistic regression. I absolutely agree with you, but what we will be able to do with regression that we couldn't do when we've looked at things in just a simple two-by-two table, is we're going to be able to extend these models to include multiple predictors in one fell swoop. That will allow us not only to better predict the likelihood through the log odds of a binary outcome by taking into account multiple factors at once, but we'll also see that it provides an easy mechanism for computing adjusted associations between our outcome and each predictor adjusted for the other predictors in the model, and this is really useful in situations where we potentially have confounding.