So, in this section, we will talk about applying the principles of the normal distribution to sample data to estimate characteristics of population data. So, upon completion of this lecture, you will be able to create ranges containing a certain percentage of observations from an approximately normal distribution, using only an estimate of the mean and standard deviation. Figure out how far any data point under a distribution is, any individual data point is, from the mean of its distribution in standardized units. In other words, in units of standard deviation, and we'll call this sometimes the z-score computation. Convert the z-scores to statements about relative proportions or probabilities for values that follow on approximately normal distribution. So, the normal distribution, just to reiterate is a theoretical probability distribution. To take this one step further, no real life data is perfectly normally distributed. For example, in a true theoretical normal distribution, despite the fact that almost all of the data values fall within three standard deviations, of the mean of the normal distribution, the theoretical normal distribution has tails. They go on forever. So, even though values on the low end and high end are relatively unlikely, theoretically, the distribution goes on to negative infinity. Takes on values going to negative infinity in the negative direction and positive infinity in the positive direction. No real life data spans arranged that great even though, again, most of the observations under a normal distribution fall within plus or minus three standard deviations of the mean. However, the distributions of some data will be well approximated by a normal distribution. In such situations, if we believe this is the case, we can use the properties of the normal character, of the normal distribution to characterize aspects of the data distribution. So, let's go back to our example with systolic blood pressures on a 113 men. We've done some exploratory data analysis. We computed the mean, the median, and the standard deviation, and we've seen some empirical evidence that the distribution of the values in the sample is roughly symmetric. We see that the mean and it's median are also similar in value. This may make us think that we have roughly symmetric bell-shaped data in the population from which we've drawn the sample, that the population distribution of blood pressures is approximately normal. What I've done here just to do a visual test is I've superimposed a normal distribution smooth histogram curve, the same mean for 123.6, and the same standard deviation 12.9. I've smoothed, put that smooth curve on top of this to see how well it fits the observed a 113 observations. While certainly not a perfect fit because this is the theoretical curve made up of infinite number of observations, it's not a bad fit to start. So, if we want to believe that these data come from a roughly normally distributed population of values, we can use only the sample mean and standard deviation to estimate quantities like the 2.5th and 97.5th percentiles of systolic blood pressure in the population from which the sample is drawn. So, again, we'll use our sample estimates to step in for the unknown population values. We talked before about how, in normal curve if we take the mean minus two standard deviations, that should give us roughly the 2.5th percentile for data that follows a normal distribution. Similarly, if we take the mean plus two standard deviations, we get the 97.5th percentile. If we do this on the lower end, we estimate a 2.5th percentile, 97.8 millimeters of mercury, and a 97.5th percentile of a 149.4 millimeters of mercury. So, one way to think of this based on the sample data and using the properties of normal distribution that we validated by looking at the histogram and concluded was reasonable to do, we estimate that most of the men in this sample, and hence in this population, have systolic blood pressures between 97.8 and 149.4 millimeters of mercury. The middle 95 percent of values would be between these two. Note, if we were to look at the actual 2.5th and 97.5th values of the a 113 sample values, these are a 100.7 and 151.2 millimeters of mercury respectfully. So, not exactly equal to what we get if we assume normality, but somewhat close. So, ordinarily, if I had all the data at hand and we're able to analyze it myself, I would still take the observed percentiles because I had a 113 data measures in front of me. But if I only had the mean, standard deviation, and some visual evidence like the histogram of approximate a normal distribution of the data points, I could use the mean and standard deviation to estimate these percentiles and others. Another question we might have instead of creating a range that includes most of our data values, we might say, "Well, suppose you want to use the results from the sample of a 113 men to evaluate individual male patients relative to the population of us all such patients". So, you estimate characteristics to the population from the sample of a 113 men, and then you use the results for analyzing characteristics of future men who were not used, they're not part of the original 113 to talking about their relationship to the values in the population. So, suppose a patient in your clinic has a systolic blood pressure measurement of 130 millimeters of mercury, he might want to know, "Where do I fall with regards to all men in this clinical population?" One way to ask that is what proportion of men at the clinic have systolic blood pressure measurements greater than me? Greater than a 130 millimeters of mercury? So, what we want to figure out, and suppose we only have the summary statistics to start, and this histogram, just showing us that the sample distribution reasonably approximates a normal curve, and so we can assume the underlying population is approximately normal, we want to know what proportion of men have blood pressures greater than 130? It's the shaded area here. We've got a mean of standard deviation, we're assuming underlined normality of the data in the population from which the sample is taken. So, what we can do is something called creating a z-score. We're going to translate this measurement of 130 millimeters of mercury into units of standard deviations. Then, we can find out how many sample standard deviations this person's systolic blood pressure is above or below the sample mean that we have. So, to do this, what we're going to do is take this individual observation. We're going to measure the distance in millimeters of mercury, between this person's value of 130 and the sample mean of 123.6, and so that relative distance in terms of millimeters of mercury is 6.4. Then, understand where this something, this 6.4 millimeters mercury greater than the sample mean, 123.6 is. In the distribution, we have to standardize it by how much variation there is in individual values across the distribution, and that's just the units of standard deviation, which are 12.9 millimeters of mercury per single standard deviation. So, you do the math on this. We get that this person's value is with little rounding, approximately point five standard deviations of the above the mean of the data distribution. So, now we can re-ask the question which was initially what proportion of men have systolic blood pressures greater than a 130 millimeters of mercury? To what percentage of observations in a normal curve are more than 0.5 standard deviations above their mean? So, let's go to R. We've got R and we can use this R standard deviation value we want to evaluate is 0.5 standard deviations above the mean, but remember what we're going to get from this pnorm command in R, is it's going to tell us on a normal curve, it's going tell us the proportion of values that are less than or equal to 0.5 standard deviations above the mean of the curve. So, what we get here is that six of these shaded areas 69 percent 0.69 or 69 percent and so that means conversely the proportion of values that are above greater than 0.5 standard deviations above the mean is a 100-69 percent or 31 percent, 0.31 as a decimal. So, again 69 percent of the observations described by a standard normal curve or less than or equal to 0.5 standard deviations above its mean of zero, and hence the remaining 31 percent are more than 0.5 standard deviations above zero. So, in terms of the original question posed this means that an estimated 31 percent of the males in the population have blood pressures greater than a 130 millimeters of mercury. In other words, using only the mean and standard deviation we have estimated the 69th percentile of these blood pressure distributions to be 130 millimeters of mercury. Again, just for context or comparison to see whether this gives us a reasonable result, if I looked at the empirical, if I looked at the 70th percentile of the 113 values found the value among 113, that was greater than equal to 70 percent of the data values and less than 30 percent. It is 130, so very close. It's the 70th percentile not the 69th but essentially very similar to that. Again, if I had all 113 data points, I would use the observed percentiles to figure out the answer to this question. But in some situations, I may only have summary statistics like the mean and standard deviation, and if I'm willing to assume that the data are roughly normally distributed in the sample because they come from a population of data points that are normally distributed or approximately normally distributed, I can use just those two values to get this. Another way to think about this 31 percent is that the probability that any male on the population has a blood pressure measurement more than 0.5 standard deviations above the mean is 0.31 or 31 percent. So, if I were randomly to select a male at random, from this population 31 percent of the time he would have a blood pressure greater than 0.5 standard deviations above the mean or greater than a 130 millimeters of mercury. So, this type of computation we did to convert the Systolic Blood Pressure of 130 for the number of standard deviations above or below the sample mean, is sometimes called a z-score. There is nothing special about a z-score. It is simply a measure of the relative distance and direction of a single observation in the data distribution relative to the mean of the distribution. This distance is converted to units of standard deviation. This is akin or similar to converting kilometers to miles, changing units of distance or dollars to rupees, changing units of currency. So, let me just give you a silly example to illustrate that this is all we are doing. Suppose, you're a American who is apartment hunting in an unnamed European city. You wish to find an apartment within walking distance plus or minus 1.5 miles of a large organic supermarket which is on Main Boulevard which runs East to West. You were only considering apartments on Main Boulevard and you're only considering going in East and West's directions. We ruled out North and South. So, kind of a silly example but let's think about this. The supermarket is two kilometers West of the main city square, you were interested in three apartments. Apartment one is six kilometers West of the city square, apartment two is 0.75 kilometers West of the city square, and apartment three is one kilometer East of the city square. So, here's a schematic looking at apartment number one. So, here's the supermarket as we've noted. CS stands for city square. The distance between these two is two kilometers, and the distance between apartment one and the city square is six kilometers. So, certainly we can figure out the distance between apartment in one in the supermarket by taking their observed distance from the city square and subtracting off our reference point relative to the city square of the supermarket. So, apartment one is four kilometers from the supermarket. This is what we call the raw distance but I don't know or understand the metric system, so, I would need to convert this into units I am comfortable with in this case miles to make sure that this is within my desired walking distance, or to rule it out if not. So, how many miles is four kilometers? Well, one mile equals 1.6 kilometers, so four kilometers is roughly equal to four kilometers divided by 1.6 kilometers per miles or 2.5 miles. So, my apartment in terms of miles is 2.5 miles West of the supermarket. So, I've converted first. I've gotten the distance between my apartment and supermarket by subtracting the relative distance of the supermarket from the city square, from the relative difference of my observation, from the city square and then I've converted it to kilometers so that I can evaluate whether it's close or not by my criteria. Similarly, apartment two is 0.78 miles East of the supermarket, and apartment three is 1.88 miles East of the supermarket. So, in some sense the z-score is the statistical mile. It allows us to convert observations from different distributions with different measurement scales to comparable units. When dealing with data that follow an approximately normal distribution, these z-scores tell us everything we need to know about the relative positioning of individual observations in the distribution of all observations. We can compute z-scores for data rising from any type of distribution doesn't have to be normal. However, for data from non-normal distributions, it will still inform us about relative positions, but this may not translate into correct percentile information and we'll look at some examples of this in the next section. So let's look at another example. Here's a basic histogram of weights for 236 Nepali children, males and females combined at one-year-old. The mean of this sample is 7.1 kilograms, the median is 7.0, so very similar in value and we can see that this histogram is the distribution of these 236 weights, is almost perfectly symmetric, not quite, and bell-shaped in some sense. So, it may be worth assuming that this data come from a population of approximately normally distributed child weights. Here I've superimposed a normal curve with the same mean of 7.1 and the same standard deviation of 1.2. Over this and we can see it doesn't exactly line up perfectly but it's not a bad fit, and then you can imagine if we had a larger sample would fill out some of these gaps, and the fit would be able to better ascertain the fit. So, I'm going to use only the sample mean and standard deviation assuming that these data come from a population of approximately normally distributed weights, and let's estimate a range of weights for most 95 percent of Nepali children who are 12 months-old, using only the data from these samples. So again we can estimate the middle 95 percent it's the range we would give is the distance between the 2.5th percentile, the 97.5 percentile. That interval would contain the middle 95 percent where they estimate that again is I can estimate, assume, because I've assumed these data come from a population with an approximately normal distribution. I can take the mean, in this case the sample mean and minus 2 standard deviations and that gives me a value of 4.7 kilograms for the 2.5th percentile, and doing the same thing but adding two standard deviations to the mean gives me the estimated 97.5 percentile. So, based on that computation we estimate that most 95 percent, the middle 95 percent and Nepali children who are 12-months had weights between 4.7 and 9.5 kilograms. So, again we're estimating on an underlying, assumed to be underlying population normal distribution. The value here is 9.5. The value here is 4.7. This cut off the 95 percent in the middle, and we're left with 2.5 percent that are half weights greater than 9.5 and 2.5 percent with weights less than or equal to 4.7. Just FYI if because I have these data I could actually use ahead all 236 observations, I could actually find with the computer the 2.5th and 97.5th percentiles of these 236 values, and those are 4.4 and 9.7 respectively. So very similar to what we see when we use the normality assumption and just the mean and standard deviation. Suppose a mother brings her child to a pediatrician for his 12-month checkup, his or her 12-month checkup, and wants to evaluate where the child's weight is relative to the population of 12-month-olds in Nepal. Her child is five kilograms and wants to know is this child close to average weight, way above average, way below. Wants to plot it him or her relative to other children 12-months-old Nepal. So the information we're trying to ascertain looks like this. We want to figure out using the sample data and assuming that these data come from data that is normally distributed. We want to figure out the percentage of children who are smaller in weight than this child. Who have weights less than or equal to five kilograms. So again, what we're going to do is create this z-score idea. We're going to translate this measurement of five kilograms into units of standard deviation so we can plot or find out where this child's weights compares to the mean of all such children in terms of standard deviations. So, what we do is take the observed child's weight of five and subtract the mean for all children in our sample of 7.1 and divide it by the units in the standard deviation 1.2 kilograms per standard deviation. This child is small, weighs less than the mean, by 2.1 kilograms. So, the difference in kilograms is negative 2.1. When we standardize that we find out that this child's weight is 1.75 standard deviations below the sample mean of 7.1 kilograms. So, the original question asked by the parent, "How does my child's weight compared to the other children of the same age" can be rephrased as, "What percentage of observations in a normal curve are more than 1.75 SDs below its mean?" Well, we're going to go to our friend pnorm and actually pnorm's perfectly set up for this because that's what it's calculating. We know if we sell the typing pnorm negative 1.75, what it's going to give us is the proportion of observations and the standard normal curve than more than 1.75 standard deviations below zero, below the mean. In that case we see that roughly four percent of the children would be that far below the mean if these data were coming from a normal distribution. So, four percent of the observations described by a normal curve or standard normal curve are less than or equal to 1.5 standard deviations away from the mean of zero. So, in terms of the original question posed, this means that an estimate of four percent of the children in the population have weights less than five kilograms. So in other words using only the mean and standard deviation we have estimated the fourth percentile to be five kilograms. Just for context in comparison, the observed fifth percentile of these 236 measurements is five kilograms. So, which is what we saw was the measurement we were looking at. So, we estimated it to be the fourth percentile. So, very close. Again I have the 236 observations so, I could compute the percentiles and use them, but if I only had the mean and standard deviation I'd get an answer that was very similar. We can answer a broader question though as well about the child who weighed five kilograms. What percentage of 12-month-old children in Nepal have weights more extreme or unusual than this child? Another way to think about this is what percentage of weights are farther than 1.7 SDs standard deviations from the mean in either direction either above or below? So, we would have z-scores like we saw of less than negative 1.75 or greater than. We can express this in terms of the absolute value of the distance in the absolute value of this distance z is greater than 1.75. This could also be phrased, "What is the probability that a 12-month-old Nepali children will have a weight measurement more than 1.75 standard deviations from the mean of all such children either above the mean or below?" Well, we've certainly seen that from our result, that if we were looking only in one direction, the proportion that's more than 1.7 standard deviation's below the mean, that's equal to four percent. So, to get the probability of being that far away in either direction you just double the result because of the symmetry of the normal curve. So, the total proportion of children who are more than 1.75 standard deviations, or have weights that are more than 1.75 standard deviations away from the mean is two times 0.04, or 0.08 or eight percent. So, in summary, the normal distribution is a theoretical probability distribution which can be completely defined by two characteristics, the, mean and standard deviation. No real-world data has a perfect normal distribution; however, some continuous measures are reasonably approximated by a normal distribution. When dealing with samples from populations of approximately normally distributed data, the distribution of sample values will also be approximately normal and that's why we can use the distribution or histogram of our sample data to make that judgment call. We can use the observed sample mean and standard deviation estimates x-bar and s respectively to create ranges containing a certain percentage of observations, or in other words, estimate the probability that an observed data point falls within a certain range of values. So, 95 percent of the values are within plus or minus two standard deviations from the mean for example. We can also figure out how far any individual data point and the distribution is from the mean of the distribution in standardized units, in units of standard deviation. We can convert these standardized units or z-scores to statements about the relative proportions or probabilities, and hence percentiles for values whose distribution is approximately normal. Now, in these examples I said well, if we had the all of the data for the sample we could compute the percentiles without assuming normality, although the answers would match closely if our assumption was correct. But we're going to see later in the course that we're only going to be privy to the mean and standard deviation, for some distributions of interests that happen to be normal and because we can estimate the mean and standard deviation of those distributions, we can use these properties of normal curves to make statements with only those two measures. In the next section we'll look at what happens if we improperly describe the properties of normal lead distributed data to data that is not normally distributed.