[MUSIC] We'll conclude week five with a couple of applications of statistical inference using sample proportions. Specifically, we'll do an example of confidence intervals, and another example looking at hypothesis testing. And as we proceed through these, please do look back at the analogous cases when doing inference based on the sample mean when estimating a population mean from a normal distribution. So in the previous section we saw the central limit theorem, and a very special application of this when considering a songful proportion, i.e., when looking at the conditions of doing a Bernoulli sampling. So when might we seek to apply these kinds of ideas in real life? An excellent example would be that related to opinion polling in the run up, let's say, to some political elections. And indeed, next time you look in the media at the results of some opinion poll, please do look at the small print where it should explain in detail what the margins of errors are in any estimation of the popularity of particular presidential candidates or political parties. As well as the general statistical methodology employed. So we begin with confidence intervals. So let's imagine there was a presidential election in some country and, of course, we cannot feasibly ask all members of the electorate what their voting intentions would be. because the ultimate opinion poll is on election day itself when those who want to participate actually turn out, cast their votes. We then count the votes, and we can see and declare who the winner is. Of course, not everyone will undertake their sort of democratic right to vote. I mean, we'll need to consider turnout issues as well. But nonetheless in the run up to this election, let's suppose we'd like some sense of how the various candidates are performing in the polls. Indeed, we need to conduct some opinion polls. So usually in political science, we would tend to have sample sizes of the order of magnitude of about 1,000, why? Well, let's imagine we did a random sample of 1,000 voters and we asked them, would you vote for candidate A or would you not vote for candidate A? Of course, the non-candidate A voters could be voting for one of several other candidates. So let's say we're trying to asses the popularity purely of candidate A. And imagine 630 of the respondents said that they would vote for candidate A. So if we take 630 over 1000, that represents the sample proportion of those supporting candidate A, giving a sample proportion there of 0.63. Now of course, in most democracies that would be an almost unheard of level of popularity, but we'll work with it nonetheless. So 0.63, or if we multiply it by 100 we could express it as a percentage. 63% let's say represents our point estimate, our best guess of this candidate's level of support. But we know that there is a potential here for a sampling error. Who's to say our 1,000 voters who we've sampled, albeit sampled randomly, just how representative are they of the electorate as a whole? We may be somewhat suspicious about it being purely, perfectly representative. So what will we do? We will convert our point estimate into a confidence interval. For that, we're going to need a formula. So before we look at it related to our sample proportion, let's backtrack a moment and remind ourselves of the confidence interval formula when estimating a population mean. There we took our sample mean x-bar plus minus our margin of error, which took the form of z times sigma over root n. Assuming sigma was known, otherwise replace with the sample standard deviation, s. And we noted those three components to the margin of error, namely our level of confidence reflected in the z value. The variability in the population as a whole, represented by sigma, or proxied by s if we had to work with the sample standard deviation, and also our sample size n. Now, you may recall in our section on a hypothesis test for a single mean, we called the sigma over root n the standard error, i.e., the standard deviation of our sample mean. So using that terminology, now we can consider our margin of error as really being the product of two components. A confidence coefficient, this z value, which we said took different values depending on the level of confidence we wanted, multiplied by the standard error. So when we now relate this to the case of a sample proportion, all we're going to do is replace x bar, our generic sample mean notation, with p, our sample proportion notation, plus minus our margin of error. So remember with the central limit theorem, this approximation we said it held particularly well asymptotically as n tends to infinity. So in any kind of opinion polling, we will require quite a large sample size. Now, how large is large? We saw a few simulation results previously, where we saw that for about 50 observations or more, the normal approximation by the CLT was reasonably good enough. So let's say 50 or above constitutes large. So our margin of error has two components. Our confidence coefficient z, we can use the standard normal distribution due to our use of the central limit theorem, times the corresponding standard error, which was sigma over root n when estimating a population mean. The equivalent when estimating the population proportion, which is very much what we're doing in opinion polling. Estimating the proportion of support for some presidential candidate. The corresponding standard error here will be the square root of p(1-p)/n. So yet again, the margin of error has those three components, a confidence coefficient z, which will increase as we increase the level of confidence, but remember the costs and benefits, the trade-offs involved. Other things equally, it's good to be more confident, but it comes at a cost of a larger z, hence a larger margin of error. And hence, a wider and less precise confidence interval. The sample size n, again under our control. As the sample size n increases, the margin of error decreases and the confidence interval becomes less wide. And p times 1 minus b is really, that term reflects the amount of variation really in the the population. Or, technically speaking, the variation occurring within our sample. Now in most opinion polls, sample sizes tend to be of the order of magnitude of about 1,000, why? Well, in short, that tends to ensure a margin of error of approximately 3 percentage points. If we are interested in a 95% level of confidence, which is the general convention we opt to follow. And indeed, it's deemed the 3 percentage point margin of error, hence leading to a confidence interval of width 6 percentage points, because it's the sample proportion p minus 3% plus 3%. That margin of error is deemed to be acceptable, an acceptable tolerance on our estimation error within an opinion polling environment. So a little example there of point estimation leading into interval estimation of a population proportion. And we'll just round off our discussion with the analagous case of a hypothesis test for a population proportion. And relate it back to our discussion of the hypothesis test for a single population mean. So now we're not testing the mean of, let's say, of a normal distribution. Rather we're testing this Bernoulli parameter, i.e., the proportion of a population of a particular type. So our null hypothesis would represent some claim about the value of this parameter. E.g., that pi is equal to note 0.5, 0.4, whatever we wish to test against some appropriate alternative hypothesis. Now for the case of this introductory MOOC, we'll just look at an alternative hypothesis which considers all values other than that specified in the null hypothesis. So if our null is that pi is equal to not 0.4, our alternative hypothesis would be that pi is not equal to 0.4. Allowing for values both greater and less than the designated value in the null hypothesis. So when we did a test for a mean, we created a standardized variable where we standardized x bar by subtracting mu, the hypothesized value under the null hypothesis. Divide it by its standard error of sigma over root n. And we said this transformation led to a standardized normal random variable. Again, appealing to our central limit theorem approximation, rather than now standardizing x bar we seek to standardize the sample proportion p. But we proceed in the same way. Namely, from the approximate sampling distribution of P, we will subtract its mean, pi, and divide by its standard error, i.e., the square root of pi times 1 minus pi over n. So that will be our test statistic formula. And given some numeric data from our observed sample, we'll be able to calculate the value of our test statistic and, subsequently, its p value. So I'll just perhaps round off with a simple numerical example. Let's say we did an opinion poll. And historically it was known that some candidate achieved 40% support in the polls. So we may wonder whether this candidate has achieved any change in popularity. Of course, change can be in one of two ways, become more popular, or indeed, less popular. We want to perhaps try and detect any change based on a random sample of 1,000 voters. So our null hypothesis is that pi is equal to 0.4, so no change in support against the alternative hypothesis that there has been some change, maybe becoming more or less popular. And suppose, out of these 1,000 voters, let's say 44%, 440, indicated that they supported this candidate. So now we have a sample proportion p of 0.44. A hypothesized value of pi under the null hypothesis of 0.4, and a sample size n of 1,000. Do the number crunching and this will give you a standardized value, a so-called z score. And as we saw with our test of a population mean, a z score of this sort of order of magnitude, leads to a p value of around about 1% in that sort of ballpark. So based on our p value interpretation rule, remember the unit interval from 0 to 1, a significance level of 5%. If our p value falls far below that 5%, we are very comfortable of rejecting that null hypothesis. And hence in this case, this would be indicative that the presidential candidate's popularity has changed. Indeed, we may infer that it's been an increase in popularity given that sample proportion. [MUSIC]