Welcome to the course five on Dealing with Missing Data: Weighting and Imputation are the topics that will cover here. My name is Richard Valliant. I'm a research professor at the University of Michigan in Maryland. There are also contributions to this course from Frauke Kreuter, who's a professor at the University of Maryland. We've got four modules that will cover in this course, one is general steps in waiting. The idea in waiting is to expand the sample to a finite population, and we'll talk about how to do that. Module two gives you the specific steps involved in doing that. Module 3 will tell you how to implement the steps, and there's an emphasis on software examples there, and in particular are software that can be used for this. Then module 4 is weighting for missing items, which is an important part of many surveys. We'll cover all those as we go along. Now, the first part of this module was just an introduction to weighting, which we'll cover now. Now, what are the purposes of weights. The general purpose is just to expand the sample to a full finite population. We've got this small world sample that we're trying to blow up to a larger world, the population, and we use weights to do that, and we'll talk about the purposes, or particular things that are being accomplished by weighting. One of them is to correct for coverage problems in a sample or frame. Many times you've got a defined universe, but the frame that you're able to draw the sample from is not the same size or same coverage as the desired universe. We need to adjust for that in some way. Then another important step in weighting is using auxiliary data if we've got any to create unbiased and more precise estimators. First, let's look at a picture of the situation. We've got here in this grayish ellipse, the universe U, that we are were trying to make inferences for, and then in this rectangular shape is the frame. That's what we're able to draw that sample from. You can see that we've got two areas here, that are none overlapping. There's this one right here, which is the part of the universe that's not actually in the frame, we're not able to draw a sample from this part for various reasons. Suppose we're surveying businesses, and our business list is out of date, there have been some new ones created that didn't get onto offering. Now, the other area that's problem is this F minus U section of the picture, this consists of units down here that are in our frame, but they're not part of the universe this great ellipse. I draw my sample, that's in the orange ellipse here. You can see it overlaps with both the universe here, and the part of the frame that's not in the universe. When I make inferences, I'm going to have to get rid of this little section of units down here that are ineligible. They're not going to be useful to me to make inferences about the rest of the universe. I will have to take the section of the sample that overlaps with the universe, and expand it both to the area that is in the frame here as Super c. That is not in the sample, but also to the area out here, that is in the universe, but not covered by my sample. There are big assumptions that need to be present for this expansion to make sense. But it's standard procedure to attempt to blow these units up to the full universe, even though my sample may not have covered all of them. Now, weights and estimators are intimately linked. The scale of weights can vary. One way to scale weight is that they estimate finite population totals. Another way to scale weights is just to force them to sum up to the sample size. You'll see both of these methods used. Weights that are scaled to sum up to the sample size are called normalized weights, and partly this is a holdover from the days when software for analyzing survey data was not readily available. The idea there was that if you report or you do an analysis that uses degrees of freedom, the software, the old software would tend to report that as the sum of the weights minus p, the number of parameters in a model or something like that. This sum of the weights, if these things were scaled to estimate finite population totals, would be enormous. In a country like the United States, the sum weights would be over 300 million if we're talking about people. You certainly don't have 300 million degrees of freedom. If you get a sample let's say size 1,000, say. If you normalize the weights, though, this sum of the weights minus p will be a more sensible number, will be the sample size minus p. That was the general idea in doing it. What we're going to deal with here in this course are weights that are scaled estimate population totals. That's the standard approach in most federal government agencies. Where finite population totals like the total number of people unemployed are of major importance in the survey. Now, why use the weights at all? This is a little example used to illustrate what the issue is. One way to do this, if you have a sample from a finite population, is just ignore the weights, entirely, and hope that things come out alright. In this example, we're going to estimate the prevalence of diabetes across a set of ethnic groups. Suppose the sample produces unbiased estimates for each ethnic group, but equal size samples are selected from each group and that the groups themselves are different sizes. This is referring to the US population, but the same thinking applies to other countries. I've got several race ethnicities here, non-Hispanic whites, Asian Americans, Hispanics, non-Hispanic blacks, and then American Indians and Alaskan Natives, which are a very small group in the US. Among those groups, let's suppose that our proportion with diabetes is found to be these numbers here. I've got about 0.076 or 7.6 percent in non-Hispanic whites, up to a high of 15.9 percent or 0.159 for American Indians and Alaskan Natives. Here are the population proportions in those groups. I've got 65 percent non-Hispanic whites, which is the biggest group in this table. Then, as you can see, American Indians, Alaskan Natives are a very small proportion of the total, even though their rate of diabetes is relatively high. If I take column B times column C, that gives me a population value of 0.0496. I add all those up and my prevalence of diabetes in the population is 0.093 or 9.3 percent. Now let's suppose further that the population is evenly divided among these groups. That's in column B here, and I've got two [inaudible] or 20 percent of my population in each of those groups. You can draw samples in that way. You just have to find the people and equalize the sample sizes. If I take an unweighted sample value, that would be B times E and sum those up, that's the equivalent of doing unweighted analysis in this case. The answer that I get if I do the unweighted analysis is 0.117 or 11.7 percent prevalence of diabetes in the population, but the actual population value, assuming that these proportions are really population values and that my sample estimates those exactly right, these prevalences, is 9.3 percent. Quite weighs off if I do the unweighted analysis. That's the reason to use weights. If you've got a sample that's disproportionately distributed among groups that matter in the population and they matter in this case because the rate or prevalence of diabetes, which we're estimating is much different in the groups. If you ignore the weights in a case like that, then you'll be getting biased estimates. That's why we do these weights, and in the later sections, we'll talk about the details of how to compute them.