This section, we'll now consider simple logistic regression models where our predictor is continuous. Some of the things that I hope you'll get out of this lecture set is you'll begin to understand why we need to put things on the log odds scale to estimate the logistic regression equation. Why we can't just start with the outcome 1 or 0, or the probability of the outcome being 1, and model that as a linear function of x when x is continuous especially. It will also bring back the idea of a LOWESS smoother plot, you get a snapshot, if you will, a rough scatter plot of the relationship between the log odds of y equaling 1 and the continuous predictor x1 to see if it meets the structural assumptions of the linear regression model. We'll also interpret the slope and intercept from the simple logistic regression models where x is 1 is continuous and also translate the estimated slope and do an estimated odds ratio, and also show that again, the estimated intercept is an estimated odds. Although when x is continuous, it's the estimated log odds when x equals 0, which may or may not be relevant to our analysis. Why model the log odds as a linear function of a continuous predictor x instead of modeling something more intuitive like the odds without logging it or the probability of proportion that y equals 1 in our equation where we estimate the log odds as a linear function of our predictor? Well, here's the reason for doing this, and this is particularly germane to situations where our predictor is continuous. Our outcome y equals 1 is binary, with the most useful summary measure of that perhaps for any single sample is p hat, or the probability or proportion of y's that equal 1. But if you think about this, the way we define proportions, the number with value equal to 1 divided by the total number in the sample, these proportions have to live between 0 and 1 in value, they're very restricted in their value. So if we tried to model the proportion as a linear function of x, we need to constrain our estimates of the intercept and slope such that all predicted values of the proportion for any given value of x were between 0 and 1. That's a very difficult mathematical challenge and it makes it hard to freely estimate an equation based on data that we've got. You might say, "Well how about transforming it not to the log odds, but just to the odds, the odds of y equal 1, and modeling that as a linear function of your predictor?" Well, the odds certainly lives on a larger range than the probability. The odds can live between 0, so if the proportion or probability outcome is 0, then the odds is 0, but as the proportion or probability gets closer to 1, p over 1 minus p hat starts to approach infinity. We open up the range here, but notice that it's only positive real numbers, it does not include negative numbers. Again, if we were estimating the odds as a linear function of our intercept plus a slope times x1, we'd have to do a constrained estimation so that all possible results when we predicted the odds for a given value of x were between 0 and infinity. Finally, if we take the log of the odds, we open up our estimate to all real values, negative and positive. The log of the odds can live on the entire real number line between negative infinity and positive infinity, and so any value of Beta naught and Beta 1 hat is technically allowable and it makes for a much easier estimation process. By transforming things to the log odds scale, we could estimate freely with the data at hand without worrying about constraining our estimates such that the intercept and slope have to be such that the end results for any given value of x1 adhered to our particular scale. With some algebra, you can show that, since the log odds of an outcome is equal to log of p over 1 minus p, where p is the proportion or probability of the outcome occurring is the log odds is equal to a linear function of x1. We do some transformations, we can show that in terms of the odds, the estimated probability of the outcome is equal to the odds over 1 plus the odds. In terms of our linear equation, the odds we get for any single group by exponentiating the regression equation e to the intercept plus the slope times x1 for any given value of x1, and then 1 plus the odds is just 1 plus that numerator. Just as you can see here, again with this constraint that all proportions we estimate have to come out between 0 and 1 because the numerator is always 1 less than the denominator. These things keep each other in check in terms of allow but estimates. Again e here, as we've said before, is the natural constant. You would find it on your calculator with the e to the X button, and p here represents the probability of y equals 1. In a subsequent section, we'll show that that based on this equation, we can estimate things from a resulting linear regression on a log odd scale for any value of x_1 and then translate that to the estimated proportion or probability of the outcome among individuals with that particular value of x_1. But for now we're going to continue to work on the regression scale and interpret our intercept and slope. Again, let's go back to data from the National Health and Nutrition Examination Survey, the wave from 2013-'14. We're going to be looking here at body mass index and how it's related to HDL cholesterol levels. You remember, before we used obesity in adults as the cutoff of whether the body mass index was greater than or equal to 30. We're going to do that again and see how that's associated with cholesterol levels. What we're going to be doing is modeling an equation of the following form. The log odds of obesity is equal to some intercept plus some slope. For the moment, we're going to treat HDL cholesterol level as continuous because it was measured continuously in milligrams per deciliter. But by doing this, we're putting a pretty strict assumption on the nature of the relationship between obesity and HDL cholesterol. Namely, we're assuming that the relationship between the log odds of obesity and HDL cholesterol is linear in nature. In other words, if we were to be able to look at a scatter plot of the log odds of obesity versus HDL, we would expect the points to line up reasonably well described by a line. But it's not easy to get a scatter plot of the log odds at face value, because log odds are aggregated across groups. It's a function of the proportion and hence the odds that y equals 1 in any given group of x_1. The way to do this is, we looked at this idea for continuous outcomes as well, but we use what's called a running mean or LOWESS smoothing plot. What this will do is it's not a confirmatory analysis, it's just descriptive. But what this essentially does is that across the range of x values, in their case across the range of HDL values, for each HDL value, it takes a window of planes around that HDL value, and in that window of points, it computes the proportion who have the outcome. The proportion of y equals 1 in that range, and then it computes uses that to compute the odds of the outcome and then the log odds for values in that range around our particular x value of interest and then it plots that log odds on the graph. It estimates the log odds of having been obese for persons with HDL levels of 80 by using, not just those who have HDL of 80, but those with HDL or close to 80. It actually gives more weight to the values closer to 80, just FYI, and it plots that log odds and then it moves this window over to the next observed value. I'm not doing a very good drawing here, but it does the same thing. It estimates the log odds for the next x value and plots that on the graph, and then this line connects the dots of those estimated log odds in small windows of HDL. We get a sense of what the nature of the relationship is. If we look at this picture here, and I will now show it without my marking so they didn't get in the way. It looks like it's a pretty straightforward linear assumption. In the end, we see this uptick here. But I have to tell you and I can tell you because I've seen the data and if you were doing these analyses, you would want to talk to the person who's running the data to ask them about this. Generally is because there's very few points out here with HDL measurements, because it's a very high level. This uptake is usually the influence of one or two points only. I'm going to go ahead for the moment and say that we've met the assumption of linearity between the log odds of obesity and the continuous measure HDL and we'll estimate going forward. Running this on the computer, the resulting logistic regression equation for this analysis is that the log odds of obesity equals an intercept of 1.2 plus a slope of negative 0.034 times x_1, where x_1 again, is HDL measured continuously in milligrams per deciliter. Just to show you again. That the slope compares the log odds of obesity for any two groups who differ by one unit next one. Since HDL is continuous, I'm going to set this up and generically and talk about comparing the log odds of obesity for two groups with two consecutive values of HDL. A value generically labeled H, and then a value one unit higher, h plus one. If we look at the first group, they're estimated log odds from this equation is the intercept of 1.2 plus the slope of negative 0.034 times h plus one. If we multiplied that out, that would be equal to the slope of negative 0.034 times h plus another copy of negative 0.034. For the group with HDL level equal to h, one unit less than the previous, the log odds of obesity is equal the same intercept plus the slope of negative 0.034 times h. If we look at the difference in these estimated log odds, we take the difference. The intercepts cancel, the negative 0.034 times h is cancel, and we're just left with that additional slope in the first estimate. The difference on the log odd scale in the estimated log odds of obesity for these two groups who differ by one unit next is just simply that slope value of negative 0.034. This slope estimate, beta one hat equals negative 0.034, estimates the difference in the log odds of obesity for two groups of persons who's HDL levels differ by one milligram per deciliter is the difference in the log odds of obesity from one unit difference in x1. In other words, when x1 is continuous here like HDL in milligrams per deciliter is the estimated difference in the log odds of obesity for a one milligram per difference in HDL. Again, as we've been talking about so often, difference in the log odds can be re-expressed as the log of an odds ratio, so this slope estimate of negative 0.034 is a log of an odds ratio. The odds ratio of obesity for two groups whose HDL levels differ by one milligram per deciliter. If we were to exponentiate this to get the odds ratio would be 0.967, indicating that each one unit increase in HDL is associated with a nearly slightly over 3 percent decrease in the odds of obesity because the odds ratio is 0.967, it's less than one, so as we increase x, the odds of obesity is decreasing. The odds ratio estimate again is 0.967 or so about 0.97, it suggests the odds ratio of being obese for two groups of persons who differ by one milligram per deciliter in HDL levels is 0.97 higher HDL to lower HDL. So roughly higher HDL subjects by one milligram per deciliter have 3 percent lower odds of being obese when compared to the lower HDL subjects. This estimate is for any two groups who differ by one milligram per deciliter in HDL in the population from which the samples were taken. That's the power of the slope, it compares any two groups who differ by one unit in our x value, across the x range in our data. Again, the resulting logistic regression equation for this analysis is the log odds of obesity equals 1.20 plus negative 0.034 for x1, where x1 is HDL and milligrams per deciliter. Our intercept here is again, just like it was before, the log odds of obesity, the log odds of the outcome when x1 equals zero. When x1 equals zero, the log odds of obesity is equal to 1.20, so beta-naught hat equals 1.20, and if we exponentiate that, we get the estimated odds of obesity when x1 equals zero. But again, here in this case the x1 is HDL, but there are no such groups were HDL equals zero. This is just a starting odds or starting point for all of our comparison. It's necessary to estimate the totality of the linear relationship. But on its own, this intercept has little scientific relevance, both on the intercept or the exponentiated scale. So you may say John, I'm not only interested in comparing the relative odds of obesity for persons who differ by one milligram per deciliter in their HDL levels. I also want to be able to compare groups who differ by other levels, and we could do that because all the information about any group comparison on HDL is contained in that slope. You might say, what is the odds ratio of being obese for persons with HDL of a 100 milligrams per deciliter versus persons with HDL 80 milligrams per deciliter. Using the equation we had before, we can write out the scenario for both groups. When x_1 is equal to 100, the log odds of obesity equals the intercept of 1.2 plus the slope of negative 0.034 times that value of 100. When x_1 equals 80, similar is the intercept plus that slope times 80. If we take the difference in the log odds, it is also the log odds ratio, if we take this, it's the slope of negative 0.034 times the difference of 100 minus 80 or the slope times 20. A log odds ratio of obesity for persons with HDL of a 100 versus 80 is the slope times 20 or negative 0.68. In order to get the corresponding odds ratio, we'd exponentiate that and that equals 0.51. Even though the decrease is slight per milligram per deciliter, it's three percent. When we accumulate that over 20 milligrams per deciliter, we get an odds ratio of 0.51, indicating that those with HDL of a 100, have 49 percent lesser odds of obesity compared to those with HDL of 80, so this accumulates pretty quickly. Notice, and I'll again speak this a little more in the additional exercises, we could've started if we knew this, we didn't have to go back to the log odds scale, we could have taken the original odds ratio per one unit difference, 0.967, and raise that to the 20th power and that would give us exactly the same answer. Those approaches are mathematically equivalent. Here is the math here but we'll detail this a little more in the additional exercises. I'd like you to just start thinking about that. Again, here the slope quantifies everything about group differences in the outcome for any difference in our predictor x. The only difference now is that it's on the log scale or a difference in log odds, so we would want one step removed from presenting on the scale we need to for people to comprehend. This difference in slopes can then be exponentiated to give us an odds ratio, the relative odds, the outcome for any two groups who differ by any multiple of x. Let's look at another example where x is continuous for logistic regression. We have data on a random sample of a 192 Nepalese children between one and three years old, so we're looking at a group slightly different than we have before, those between 12 months and 36 months. Information in the sample includes their breastfeeding status at the time of the study and the age of the child at the time of the study. Breastfeeding is coded as a one if the child was being breastfed at the time to study, and zero if not. The following model can be potentially used to estimate this breastfeeding age association. Because our outcome is binary, we are going to use logistic regression to estimate the log odds of being breastfed as a continuous function of age in months. We can do this if we meet roughly the assumption that the log odds of being breastfed tracks linearly with age in months. We look at our lowest smoothing plot that tracks the estimated empirical log odds of being breastfed in small windows of age and then connects the dots. It's not a perfect line, but again these are just estimates based on a window. If we change the width of the window, we make it a slightly different result. But it's certainly clear as we might expect that the log odds and hence the odds and the probability of being breastfed decreases consistently as a function of age. I'm going to suggest that the start is very appropriate to fit a line to approximate this association. If we go and do this, we get a result that looks like this; the log odds of being breast fed is equal to some intercept of 7.29 plus again here the slope is negative because we have a decreasing relationship between the log odds of the outcome and x once again, it's negative 0.24 times our age in months. Again, just generically if we compared two groups of children who differ by one month in age, a plus 1 versus a, the log odds of being breastfed for the first group is the intercept plus negative 0.24, the slope, times the quantity a plus 1. You multiply that out, it's negative 0.24a plus negative 0.24. For the group with age equal to a, the log odds of being breastfed is equal the intercept plus negative 0.24 times just a. We take the difference in these two, the intercepts cancel, the negative 0.24a part cancels and we're just left with that single slope of negative 0.24 here. Again, this difference, the slope Beta_1 hat equals negative 0.24, estimates the difference in the log odds of being breastfed for two groups of children who differ by one month in age. Again, a difference in log odds can be re-expressed as the log of an odds ratio. This beta one, negative 2.4 the slope is the estimated log of the odds ratio of being breastfed for two groups of children who differ by one month in age. If we exponentiate this, we get 0.787 and odds ratio 0.787. The odds ratio estimate is 0.787 or roughly 0.79. The odds ratio of being breastfed for two groups of children who differ by one month in age is 0.79 older compared to younger. This decreases pretty quickly as a function of age. In other words, older children by one month have 21 percent lower odds of being breastfed when compared to younger children who were one month younger. This resulting odds ratio is the estimated odds ratio of being breastfed for any two groups who differ by one month of age in the population of Nepalese children 12-36 months old. Let's look at what the intercept is, again, it's not going to make a lot of sense scientifically because our sample does not include newborns. But this is the estimated log odds of being breastfed when x_1 equals 0. Certainly age of zero is a relevant quantity. It is new-borns, but our sample only consists of children 12-36 months old. It's not relevant to our sample or population, but it's nevertheless a starting point for estimating the log odds for any other group given their age as well. The intercept estimates the log odds of being breastfed for newborn children, which that is a realistic one in real life. But it's not necessarily relevant to our data, which is only based on children 12-36 months old. If we exponentiate this intercept of 7.9, we get odds 1,466 which simply means that we're starting in a very high point. We'll appreciate this when we get into subsequent section where we crank out some of the probabilities or proportions of being breastfed as a function of age, you'll see they tend to be very high in the early ages, and that's fueled by this large starting point. For you to try, and we will review this in the additional exercise section, but you can build on what we did in the previous example. What is the estimated relative odds, the odds ratio of being breastfed for children who are 30 months old compared to children who are 24 months old? What if there is not a linear relationship? We look at this Lowess plot between the log odds of an outcome and our predictor x. We see something maybe that attenuates like this. If we fit a line to it, we may overestimate things for early x values and underestimate them for later x-values and we miss the key facet that just drop off early on or whatever you see something that reaches a threshold and then comes back down. We try and fit a line to that, basically the best fitting line will be relatively flat and odds ratio of one and a slope of zero and we won't see much of an association. How can this be handled? Well, one option is we can categorize in these situations, the continuous predictor into ordinal categories and then treat them as standard categorical variables with a reference group and indicator x's. For example, you may recall that example in the last section with respiratory failure in gestational age, gestational age was categorized and the results looked like this. Remember 36-40 weeks was the reference group and then we had indicators for children with gestational ages in the 34 weeks, 35 weeks, and 36 weeks, we got a result that looks like this. Just going to ask you this now and we'll come back and dissect it in the additional exercises, what do these results imply about the relationship between the log odds of respiratory failure and gestational age? Does it look like it's linear in nature? Or is there something else going on? Think about that and we'll revisit that in the additional exercises. In summary, we've now seen culmination this section, simple logistic regression, can be done with binary and categorical predictors as in the previous section, and also with continuous predictors. When the predictor x_1 is continuous, the model estimates a linear relationship between the log odds of the outcome y equal 1 and our predictor x_1. This assumption should definitely be investigated empirically either by Lowess smoothing or categorizing the continuous predictors prior to fitting regression model. The resulting estimated slope from logistic regression with a continuous predictor still has a log odds ratio interpretation. The intercept has a log odds when x_1 equals zero interpretation, although we've seen when x_1 is continuous, that's not always relevant domain to the data we're working with. In the next section, we'll take on the uncertainty aspect and look at getting confidence intervals and p-values for the parameters we're estimating mainly the slope and intercept from our logistic regression.