Welcome to Course 5 on Dealing with Missing Data: Weighting and Imputation, are the topics that we'll cover here. My name is Richard Valliant, I'm research professor at the universities of Michigan and Maryland. There were also contributions to this course from Frauke Kreuter who is a professor at the University of Maryland. We've got four modules that we'll cover in this course. One is general steps in weighting, and idea in weighting is to expand the sample to a finite population, and we'll talk about how to do that. Module 2 gives you the specific steps involved in doing that, module 3 will tell you how to implement the steps, and there's a emphasis on software examples there, and in particular are software that can be used for this. And then module 4 is weighing for missing items which is an important part of many surveys. So we'll cover all of those as we go along. Now the first part of this module is just an introduction to weighting which we'll cover now. Now what are the purposes of weights? The general purpose is just to expand the sample to a full finite population. So we've got this small world sample that we're trying to blow up to a larger world, the population, and we used weights to do that. And we'll talk about the purposes or particular things that are being accomplished by waiting. One of them is to correct for coverage problems in a sample or a frame. Many times you've got a defined universe but the frame, that you're able to draw the sample from, is not the same size or same coverage as the desired universe. So we need to adjust for that in some way. And then another important step in weighting is using auxiliary data if we've got any to create unbiased and more precise estimators. So first let's look at a picture of the situation. We've got here in this grayish ellipse the universe, U, that we're trying to make inferences for, and then in this rectangular shape is the frame, that's what we're able to draw the sample from. So you can see that we've got two areas here that are non overlapping. There's this one right here which is the part of the universe that's not actually in the frame. So we're not able to draw a sample from this part for various reasons. Suppose we're surveying businesses and our business list is out of date, there have been some new ones created that didn't get on to our frame. Now the other area that's a problem is this F- U section of the picture. This consists of units down here that are inter frame, but they're not part of the universe, this gray ellipse. So I draw my sample, that's in the orange ellipse here. And you can see it overlaps with both The universe here in the part of the frame that's not in the universe. So when I make the influences I'm going to have to get rid off this little section of units down here, that are ineligible. They're not going to be useful me to make the inferences about the rest of the universe. I will have to take the section of the sample that overlaps with the universe and expand it both to the area that is in the frame here, S super C, that is not in the sample but also to the area here that is in the universe but not covered by my sample. So there are big assumptions that need to be present for this expansion to make sense. But it's standard procedure to attempt to blow these units up to the full universe even though my sample may not have covered all of them. Now, weights and estimators are intimately linked. The scale of weights can vary. One way to scale the weights is so that they estimate finite population totals. Another way to scale the weights is just to force them to sum up to the sample size. You will see both of these methods used. Weights that are scaled to sum up to the sample size are called normalized weights, and partly this is a holdover from the days when software for analyzing survey data was not readily available. The idea there was that if you report, or you do an analysis that uses degrees of freedom the old software would tend to report that as the sum of the weights minus p, the number of parameters in the models, or something like that. The sum of the weights, if these things were scaled to estimate finite population totals, would be enormous. You know, in a country like the United States, this sum of weights would be over 300 million if we're talking about people. So, you certainly don't have 300 million degrees freedom, if you get a sample with the size 1000 say. So if you normalized the weights though this sum of the weights minus p will be a more sensible number, it'll be the sample size minus p. So that was the general idea in doing it. What we're going to deal with here in this course, are weights that are scaled to estimate population totals. That's the standard approach in most federal government agencies where finite population totals like total number of people unemployed are of major importance in a survey. Now, why use the weights at all? This is a little example just to illustrate what the issue is. One way to do this, if you have a sample from final population, is just to ignore the weights entirely and hope the things come out all right. So in this example were going to estimate the prevalence of diabetes across a set of ethnic groups. And suppose a sample produces unbiased estimates for each ethnic group, but equal size sample are selected from each group, and that the group themselves are different sizes. And this is referring to the US population but the same thinking applies to other countries. So, I have got several race-ethnicities here, non-Hispanic whites, Asian Americans, Hispanics, non Hispanic Blacks, and then American Indians and Alaskan natives which are a very small group in the US. So among those groups let's suppose that our proportion with diabetes is found to be these numbers here, so I've got about 0.076 or 7.6% in non Hispanic whites up to a high of 15.9% or 0.159 for American Indians and Alaska natives. And here are the population proportions in those groups. So I've got 65% non Hispanic whites, which is the biggest group in this table. And then as you can see American Indians, Alaska Natives are very small proportion of the total even though their rate of diabetes is relatively high. So if I take column B times column C that gives me a population value of 0.0496, I add all those up and my prevalence of diabetes in the population is .0930, 9.3%. Now, let pose further that the population's evenly divided among these groups, so that's in column E here. And I've got two-tenths of 20% of my population in each of those groups, you can draw samples in that way you just have to find the people and equalize the sample sizes. If I take an unweighted sample value that would be B times E and sum those up, that's the equivalent of doing unweighted analysis in this case. The answer that I get if I do the unweighted analysis is 0.117 or 11.7% prevalence of diabetes in the population. But the actual population value, assuming that these proportions are really population values and that my sample estimates those exactly right these prevalences, is 9.3%. So I'm quite a ways of if I do the unweighted analysis so that's the reason to use weights. If you get a sample that's disproportional or redistributed among groups that matter in the population, and they matter in this case because the rate or prevalence of diabetes which we're estimating is much different in the groups. If you ignore the weights in a case like that then you'll be getting biased estimates. So that's why we do these weights. And then in later sections we'll talk about the details of how to compute them.