Previously we talked about statistical significance. But in general, in genomic studies, you're often consider, considering more than one data set at a time. In other words, you might be a, analyzing the expression of every one of the genes in your body, or you might be looking at hundreds of thousands of millions of variants in the DNA, or many other multiple testing site, type scenarios. So in these scenarios what you're doing is you're calculating a measure of association between say some phenotype that you care about say cancer versus control and every single data set that you collected. Say, a data set for each possible gene. So in this case what's happened is people are still applying the hypothesis testing frame work. They're using P-values and things like that. But the issue is that, that framework wasn't built for doing many, many hypothesis tests at once. So if you remember when we talked about what a P-value was, it's the probability of observing a statistic as, or more extreme, than the one we, you calculated in an original sample. And so what is it, one property of P-values that's very important, and that we should pay attention to is that if there's nothing happening, suppose that there's absolutely no difference between the two groups that you're comparing, the P-values are uni, what's called uniformly distributed. So this is a plot of some uniformly distributed data histogram. On the x-axis you see the P-value, and on the y-axis is the frequency of the number of P-value that fall to that bin. And so, this is what the uniform distribution looks like. And so, what a uniform distribution means is that 5% of the P-values will be less than 0.05. 20% of the P-value would be less than 0.02, and so forth. In other words, when there is no signal, the P-value distribution is flat. So what does that mean? How does that sort of play a role in a multiple testing problem? And so here's an example with a cartoon. Imagine that you're trying to investigate whether jelly, jelly beans are associated with acne. So what you could do is, you could perform a study where you compare people who eat a lot of jelly beans and people who don't eat a lot of jelly beans and look to see if they have acne or not. And so if you do that, you probably won't find anything. And so, at the first test, people go ahead and collect the data on the whole sample, they calculate the statistic, the P-value's greater than 0.05, they conclude there's no statistically significant association between jelly beans and acne. But in the, you might consider, oh, well it might be just a kind of jelly beans. So you could go back and test brown jelly beans and yellow jelly beans and so forth, and in each case, most of the time, the P-value would be greater than 0.05. And so it would not be statistically significant, and you wouldn't report it. But then, since P-values are uniformly distributed, about one out of every 20 tests that you do, even if there's absolutely no association between jelly beans and acne, about one out of 20 will still show up with a P-value less than 0.05. And so a danger is that you do these many, many, many tests and then you find the one with P-value is less than 0.05 and you just report that one. So here's an example where there's a news article saying that green jelly beans have been linked to acne. So that's again, whether it's either reporting this with a statistical significance measure that was designed when performing one hypothesis test, but in reality they per, performed many. So how do we deal with this? How do we adapt the hypothesis testing framework to the situation where you're doing many hypotheses? Tests. So the way that we do that is with different error rates. So the two most commonly error rates that you'll probably hear about when doing a genomic data analysis are the family wise error rate and the false discovery rate. So the family wise error rate says that if we're going to do many, many hypothesis tests we want to control, control the probability that there will be even one false positive. This is a very strict criteria. If you find many things that are significant, and a false family wise error rate that's very low, you're saying that the probability of even one false positive is very small. Another very commonly used error measure is the false discovery rate. This is the expected number of false positives divided by the number of total discoveries. So what does this do? It sort of quantifies, among the things that you're calling statistically significant, what fraction of them appear to be false positives? And so the false discovery rate often is a little bit more liberal than the family wise error rate. You're not controlling the probability of even one false positive. You're allowing for some false positives, to make more discoveries. But it quantifies the error rate at which you're making those discoveries. And so to interpret these error rates you have to be very careful because they actually have different interpretations. You do different things to data, but you also have to interpret them differently. So just because you find more statistically significant results when you use the false discovery rate than when you use family wise error rate, it doesn't mean that magically, all of the sudden, there were more results that were truly different. It just means that there's a different interpretation to the analysis that you do. So i'm going to give you a very sim, simple example. Suppose you're doing an analysis with 10,000 genes. A gene expression, differential expression analysis. And you discover that 550 of those genes are significant at the 0.05 level. Now I said 0.05 level, but imagine the 0.05 level is one of three different scenarios. So suppose for example, those 550 were discovered just by thresholding the P-value to 0.05. In other words we said, we got those 550 by saying all P-values less than 0.05 are significant. In this case, remember we said the P-values are uniform under the case where there's nothing going on. So we would expect about 0.05 times 10,000. 500 false positives at a total number of discoveries of 550, so even though we found statistically significant results they might mostly be false positives. Alternatively suppose that when we declare those 550 to be significant we were using the false discovery rate. In this case we're quantifying among the discoveries that we've made the rate of errors that we would make then. So about 5% times the 550 things we discovered equals about 27.5 false positives. So in this case, we discovered the same number of things, but using a different error rate, it means that we control the error level much lower than if we just calculated P-values less than 0.05. Finally, suppose we use the family wise error rate. In this case, if we had found 550 genes differentially expressed out of 10,000, at a Family Wise Error Rate control of 0.05, that means the probability of even one of those 550 being a false positive is less than 0.05. So that means that almost all of them would probably be true positives. So in this case, we've sort of illustrated the three types of ways that you could sort of calculate statistical significance. In each case it means something totally different with statistical significance set. When you use those words it means something totally different depending on what error rate that you’re controlling. One last thing to consider when looking at multiple hypothesis tests is the inevitable scenario. So everybody who's done some real science has run into this scenario where the P-value that they calculated is just greater than 0.05. And the natural reaction is to be very sad and to think game over, oh, I've got to try all over again because my P-value's greater than 0.05. It's a really good idea not to do that. First of all, it's important to report negative results even if you can't get them into the best journals, to avoid what's called publication bias. But more importantly, it's a careful, it's important to be careful to avoid hacking. So a very typical email a statistician might get after reporting a P-value greater than 0.05 is this one that my friend Ingo got. So it said, curse you, Ingo! Yet another disappearing act! Because the P-value is greater than 0.05 after doing some correction. And so, the, while this is a joke and it was totally said in jest, in general, there can be pressure to try to discover more things at as, more statistically significant level. It's very important to avoid that temptation, because you'll run into something called P-value hacking. So in general, statistics hacking means doing things to the data. Or, changing the way that you do the calculations in order to manufacture a statistically significant result. Even when your original analysis didn't do it. So this is an example of a paper where people took a data set, a very simple, simulate, simulated data set. And made very sensible transformations to that data set with the statistical methods they used. And turn to almost any result into a statistically significant result. A way to avoid this, is to in advance of looking at the data, specify a data analysis plan and stick to it.