We spent a lot of time talking about how to come up with a question, how to frame it as a hypothesis that you can test and how to analyze data to test it, but we haven't said much yet about best practices for collecting data. In this video, we'll turn our attention to that. There are basically only two rules for data collection. Rule number 1, it's better to have more observations rather than fewer, as long as they're relevant to your question. Rule number 2, when you're collecting a data sample, it's crucial that the data be a random sample. Let's discuss these in turn. More observations are better than fewer. We've already talked about why it's better to have more observations. You get more precise estimates and you can be more confident about the results. There's less likelihood that your sample happens to be unusual set of observations. Simply put, more is better. This picture is just to give you a flavor for this. Imagine that you work in a bank that has several 100 thousand customers for wealth management, it matters a lot to know the average wealth of your customers. Let's assume that the true wealth is $200,000 but we don't know that and there's variation across people. This shows an example of the averages you're likely to draw from a sample of a 100 people, or 1000 people, or 10,000 people. Note that all samples are centered fairly close to $200,000. If you get a very unusual draw and you take a random sample of a 100 people, you could end up with a sample average that's tens of thousands of dollars away from the true average in either direction. As you moved to a 1000 person or a 10000 persons sample, even an unusual random sample, will have an average that's relatively close to the true average. Just two more thoughts about this, although more observations of relevant data always enhance precision, this improvement is not linear. In other words, adding ten more observations doesn't always add the same amount of benefit. Specifically, your precision improves by the square root of your number of observations, so if you increase your sample from 10 people to 40 people, that's four times as many people, your precision is going to be twice as good because the square root of four is two. If you then increase from 40 people to 70 people, which is another 30 people, increase, your precision, turns out to be about 30 percent better. If you then add yet another 30 people, so you end up at a 100, your precision improves and other 20 percent. The bottom line, if it's costly or time consuming to collect data, then you should not feel too bad about stopping after a few 100 or a few thousand observations. But remember, if you're going to divide your data into subsets such as white women with high school education, white women with university education, black women with high school education and so on. Then your end for each of these subsets will be much smaller than your total end. You should collect data with those ultimate ends in mind. In your own organization, it's often easy to get data about every single employee. In that case, I would recommend collecting data on everybody. But for external research, such as getting feedback from potential customers about a new product design, there are millions of potential customers and it's costly to get feedback from any of them. Here, is where you may want to think carefully about sample size. Now for point number 2, when you're collecting a data sample, it's important that the data be a random sample. What does this mean? For whatever population or sub-population you care about, the observations should be randomly drawn from that population. Ideally, this means that every person in the population or some population, is equally likely to be picked for your sample. If you make it more likely that some people are sampled rather than others, then you're introducing bias into your sample. This is crucial. There's a lot of very cool statistical theory as to why random samples are so wonderful. But I'll just give a few real-world examples to convey this. Let me give you a couple of examples and see whether you think they're random or not. Let's go back to 1936. The number 1 movie in the theaters is modern times, starring Charlie Chaplin. The number 1 song on the radio is Pennies from heaven sang by Bing Crosby. Franklin Delano Roosevelt is President and running for a second term to be President of the United States. He ran that year against Alf Landon, who was governor of Kansas. Literary Digest was a magazine with huge circulation back then. The magazine conducted a massive poll of Americans to see who they plan to vote for. They pulled 2.4 million people, and here's how they did it. They went through telephone directories and called people at random. They went through the membership directories of the American Auto club, which is the Canadian Auto Club in Canada, and randomly wrote to a subset of the members. What do you think of this study design? Let's start with the sample size, 2.4 million people. Does that sound like a lot to you? It's huge. Pollsters today survey typically 300, 500,1000 people. If these people are randomly sampled from among all US voters, and if they provide accurate responses than the true proportion of voting for a candidate should be within one or two points of the numbers that are generated by the pole. Going to 2.4 million respondents narrows this further. Remember though, the precision improves only with the square root of n so the improvement may not be worth the cost. Now, how about the randomness of the sample? They randomly chose people from the telephone directories and randomly chose people from the Auto Club. Is this a good random sample of voters? No. Remember this is 1936. Who owned Automobiles in 1936? Not many of the common folk, mostly the wealthy. The same was true for telephones. The majority of Americans did not have a phone in their home in the 1930's. Although this pole randomly selected from the population of car owners and telephone owners, the respondents were nothing like a random sample of voters. Literary Digest made his prediction in fall 1936, Alf Landon would win with 57 percent of the vote. Roosevelt would have only 43 percent. The actual vote, Landon, 38 percent, Roosevelt 62 percent. This was the most lopsided popular vote landslide and modern US history. Interestingly, a very young rookie pollster predicted the Roosevelt landslide. George Gallup rocketed to fame with his analysis published in 1936 that both predicted Roosevelt's when and meticulously pointed out all of the errors in the polling practice of Literary Digest and others. You might never have heard of Literary Digest, but I'm guessing you've heard of the Gallup Organization, which is still a preeminent pulling organization today. That's one example of the bias that can creep in when you don't select randomly. Here's another example where you have to think very carefully about whether nature is presenting data to you randomly. Sometimes we call this thinking about the data generation process. During World War II, British airplanes that returned from bombing raids over Germany were often riddled with bullets from anti-aircraft fire. The planes that returned typically had bullet holes in the wings, but not the tail. The Royal Air Forces engineers realized that they could better fortify a few parts of the planes with additional armor to protect from anti-aircraft shells. They couldn't fortify the entire plane because the extra weight would prevent the plane from taking off. Key question, a matter of life and death was which parts of the airplanes should receive the better armor? What do you think? Protect the wings or protect the tail? The answer comes down to what you make of the fact that the planes that returned typically had bullet holes in the wings but not the tail. One possibility is that for some reason all of the planes that are hit, are hit in the wings and rarely in the tail and some of those planes managed to make it back. This is like seeing a random sample of all the planes that are hit. The other is that if a plane is hit in the tail, it crashes while if it's hit in the wings then it can still fly home. This is like seeing a biased sample of all the planes that are hit. It's like only seeing the people with automobiles and telephones. The Royal Air Force recognized that the second of these was more likely to be true. Anti-aircraft shells could hit anywhere on a plane, and virtually all airplanes that were hit in the tale ended up crashing before making it back to England. The RAF added its protection to the tails of its planes. Planes still limped home with bullet holes in the wings, but more planes made it back because the tails we're no longer damaged so as to cause the plane to crash. The key point here is to recognize why you are seeing the data that you're seeing. Are you obtaining a true random sample? Are you actually reaching the population that you care about in an unbiased way? Within your organization, this may be less of an issue. You usually know the universe of employees, you have data or you should have data on them and you should know how to reach them if you want to get more information. Although even in this instance, we can have unrealized blinders that lead us astray. A few years ago, the Harvard Business Review carried it and insightful article about how the Cleveland Clinic dramatically improved its stature as one of the world's preeminent hospitals for excellent care. A key aha moment was when the executives who were trying to understand why the Cleveland Clinics patients had an ambivalent feeling about the care. Recognized that all aspects of the hospital, cleanliness of the halls and the rooms, clarity of the signage and directions, coordination of food delivery. All of these things contributed to the patient's perception of care. After this, the clinic actively engaged staff members who had previously been overlooked, such as janitors, catering staff, and guards, integrating them into problem-solving teams that had previously been solely the domain of doctors and occasionally nurses. This input turned out to be extraordinarily valuable, but it had never occurred to the clinics executives that they should be soliciting input from these staff before. We should try not to fall into that blinder trap. Outside of your organization, this is typically a larger issue. You may not know all of your customers, so it can be difficult to tell whether you're receiving input from a representative sample or not. You certainly don't know who all of your potential customers are, and if some types of people are more reticent to provide information or are more difficult to reach than others, then it's easy to come up with a seemingly random sample that's actually biased. For example, my impression is that in recent years, relatively few younger people will pick up the telephone as opposed to texting. If this is accurate, then a telephone survey is likely to underrepresent younger respondents. More generally, there can be a range of reasons why you might find certain groups under-represented in your data collection. It may be the case that those for whom English or French is not the first language will be reluctant to respond due to the feeling self-conscious about their language ability. My grandfather was a little bit like this, although he was also sufficiently opinionated that he ultimately overcame his reticence, it may be the case that those who have a greater desire for privacy will be more reluctant to respond. If women in general have a higher desire for privacy, let's say because they perceive or experience a higher degree of physical threat in general, then women maybe under-represented in an apparently random data collection effort. It may be the case that people who are suspicious of organizations or governments will be reluctant to respond. If visible minorities perceive or experience a higher degree of conflict with government, for example, then a government related organization might see visible minorities underrepresented in a random data collection. There might also be under-representation of people with libertarian beliefs if they are similarly wary of interacting with government officials. How can we deal with this? First, we have to anticipate or recognize such challenges in our data generation process. Next, we can make extra effort to collect data from the hard to reach or hard to engage folks. One final point. Occasionally, the two basic principles of data collection can seem to conflict when you plan to compare subsets of data. For example, imagine that you want to compare how people with different educational backgrounds value your proposed product. Toward this end, you would like to create subsets for women with PhDs, women with university degrees, women with high school degrees, and so on. Out of all of your customers, there aren't that many people with PhD degrees. If you randomly sample the entire population of customers, you may end up with a small number of PhD holders in your sample. This is tough. On one hand, you want to have a large and within each subset, including women with PhDs and men with PhDs. On the other hand, you want to sample randomly, and you don't want to have to sample 5000 people just to get a 100 with PhDs. One way that you can address this is to sample randomly within each category that you care about. You might take 1 percent of all women with university degrees, but 10 percent of all the women with PhDs. If your question is about how many people have different types of degrees, then you can't do this because you're choosing them based on their degree type. But if you're asking about anything else, then this is a way to get a large enough n for each of your subsets while still respecting the randomness principle.