This lecture's about experimental design as well it's about sample size and variability. So if you remember from previous lecture, the central dogma of statistics is that we have this big population. And it's expensive to measure, you know, whatever measurement that we want to take genomic or otherwise on that whole population so we take a sample with probability. Then on that sample we make our measurements and use statistical inference to say something about the population. So we talked a little bit about how that best guess that we get from our sample isn't all that we get, we also get an estimate of variability. So let's talk a little bit about variability and what its relationship is to good experimental design. So there's a sample size formula that you may have heard of that's, if N is the number of measurements that you could take or the number of people that you could sample. If you're doing scientific research, you have to ask for grant money often and so N ends up being the number of dollars that you have divided by how much it cost to make a measurement. And while this is one way to get at a sample size, it's maybe not the best way. So the real idea behind sample size is basically to understand variability in the population. And so, here's a really quick example of what I mean by that. So here are two synthetic made up data sets. So there's a data set for Y and there's a data set for X. So the measurement values on the X axis and on the Y axis there's the two data sets, YX. And you can see, I have two lines here, the red line is the mean of the Y values and the blue line is the mean of the X values. And so, what you can see is that the means are different from each other but there's also quite a bit of variability around those means. Some measurements are lower and some measurements are higher and they overlap. So the idea is, if the two means are different, how, how confident can we be about that? If we know what the variation is around the measurement that we've taken and the mean that we have. How confident we can be that these two means are different than each other? So this goes through how many samples that you need to collect? How much variability you need to observe to be able to say whether the two things are different or not? So the way that people do this in advance in sort of experimental design is with power. So basically, the power is the probability that if there's a real effect in the data set then you'll be able to detect it. So, it depends on a few different things, it depends on the sample size, it depends on how different the means are between the two groups, like we saw the red and the blue lines. And it depends how variable they are, so we saw that there was variation around the means in both the X and the Y data sets. So this is actually code from the R statistical programming language. You don't have to worry about the code in this lecture but you can just see that for example, if we want to do a t-test, comparing the two groups which is a certain kind of statistical test. The probability that we'll detect an effect of size 5, that's what we have delta there with a variability of 10, the standard deviat, standard deviation of 10 in each group and 10 samples is 18%. So it's not very likely that even if there's an effect we'll detect it but what you can do is you could also go back and make the calculations, say, as is customary, we want 80% power. In other words, we want an 80% chance of detecting an effect if it's really there. So for a effect size of 5 and a standard deviation of 10, you could see that we could calc back out, how many samples that we need to collect? Here, in this case by doing the calculation, we see we need 64 samples from each groups in order to have an 80% chance of detecting its particular effects on us. But similarly, you can do that calculation by saying, how many do you need to have for one group if you're only going to be doing, or for each group, if you're only going to be doing a test in one direction or the other? So suppose, I know that the effect size will always be expression levels will be higher in the cancer samples than the control samples. Then it's possible to actually create, less, less samples and still get the same power because you actually have a little bit more information. Later classes and statistical classes will talk more about power and how you calculate it. But the basic idea is to keep in mind that you, the power is actually a curve. It's never just one number even though you might hear 80% thrown around quite a bit when talking about power, the idea is that there is a curve. So when there's no, in this plot, I'm showing on the X axis, all the different potential sizes of an effect. So it could be 0, that's the center of the plot or it could be very high or very low and then on the Y axis is power for different sample sizes. Black lines correspond to sample sizes of 5, blue line corresponds to sample sizes of 10 and red lines correspond to sample size of 20. So as you can see that, as you move out from the center of the plot, the power goes up. So, the bigger the effect, the easier it is to detect, also as the sample size go up, goes up, you see from the black, to the blue, to the red curve, you get more power as well. So as you vary these different parameters, you get different power and so a power calculation is a hypothetical calculation based on what you think the effect size might be and what sample size you can get. And so, it's important to pay attention before performing a study as to the power that you might have so you don't run the study. And end up at the end of the day without any potential difference even when there might have been one there. So there are three types of variability, we've been talking about variability in terms of the sampling variability that you get when you take a sample. And then look at how does that relate to the population but there's actually three kinds that are very commonly measured or considered when performing experiments in genomics. So, variability of a genomic measurement can be broken down into three types, the phenotypic variability. So, imagine you're doing a comparison between cancers and controls. Then there's variability between the cancer patients and the control patients about their genomic measurements. So this is often the variability that we care about, we want to detect differences between groups. There's also measurement error, all genomic technologies measure whether it's gene expression, methylation, whether it's the alleles that we measure in a DNA study. All of those are measured with error and so we have to take into account how well does the machi, machine actually measure the reads, how long we quantify the reads and so forth. There's also a component of variation that often gets ignored or missed which is natural biological variation. So for every kind of genomic measurement that we take, there's natural variation between people. So even if you have two people that are healthy, have the same phenotypes in every possible way, they're the same sex, the same age, they eat in the same breakfast. There is still going to be variation between people and that natural biological variability has to be accounted for when performing statistical modeling as well. An important consideration is that there's often a rush when there's new technologies to sort of claim that this new technology is so much better than the previous technology. One way they do that is by saying that the variability is much lower that may be true for the technical component or the measurement error component of variability, but it doesn't eliminate biological variability. So here I'm showing an example of that, there are four plots in this picture that you're looking at. The top two plots show data that was collected using next generation sequencing. The bottom two plots show data that was collecting with micro razing with older technology. Each dot corresponds to the same sample, so it's the same samples in all four plots. And so what you can see is for the gene on the left, you see that the pink gene, you can see that there's lower variability across people. So this is true, whether you measure it on the top with sequencing or on the bottom with arrays. Similarly, the gene on the right that, I've colored in blue here is highly variable when measured with sequencing or when measured with arrays. So what this suggests is that biological variation is a natural phenomenon that always is a component of non modeling data in genomic and it does not get eliminated by technology. So that's what we talked about here is the variability and sample size calculations and how those things relate. And one of the most important components of statistics is paying attention to how variation exists in both your sample and in the population that you measure.