Another important component of study design and experimental design are confounding and batch effects. So, what is confounding? I'm going to give you an example using a very simple data set. So, this is a picture of me and my son. So, my son is three years old, he has small shoes and he's not very literate yet. I have bigger shoes, and I guess you could say I am somewhat literate. So using this data set, we might conclude that shoe size is associated with literacy. But, is that really true? Do we really believe that small shoes equals low literacy and big shoes equals big literacy? The reason why we might not believe this is because there's actually one piece of data that we hadn't concluded in our analysis. My son is relatively young, and I'm middle aged. And it turns out that age is more closely causally related to literacy. So if you make a plot the, the relationship between shoe size, literacy, and age, you see that age is related to shoe size. When you're young you have small shoes, and when you're old you have la, bigger shoes. And it's also related to literacy. When you're young, you're not very literate, and when you're older, you become more literate. And so, this, this variable that's related to both shoe size and literacy is what's called a confounder. So the confounder is a variable that's related to two other variables, and may potentially make it look like there's a relationship between those variables, even when there isn't. So, this is actually a very common problem is genomics and the most common confounder, the one that trips up the most people, is what's called batch effects. And so, here's and example of a batch effect. So this is a paper whe, that was originally published that looked for differences in gene expressions between ethnic groups. So he identified that 78% of genes were differentially expressed between the two ethnic groups. So you can see that in the p value histogram on the lower left. There are tons of tiny p values. So that's lots genes that look like they're differentially expressed between the two groups. This seems like really important and big result because there's very little actually genetic variation between different ethnic groups. So, there, it's surprising that almost all of the genes are differentially expressed. So turns out if you go back and look at when the data were collected all of the samples that come from Europeans were collected in 2003, 2004 and 2005. Whereas all of the samples that come from Asians, were collected later, were collected in 2006. So it turns out there's just enough overlap that you can kind of distinguish between the results that are, are, the genes that are different because of the date and the genes that are different because of the population. So if you look at the between population differences, it looks like 78% of genes are differentially expressed. If you look at the differences between the years when the samples were taken, 96% of the genes are differentially expressed. And once you adjust for the fact that the samples were taken in different years, all the difference between the population goes away. So this is what's called a batch affect. it basically suggests that there's a confounder, which is the date that the samples were taken. Why would the date matter? Well the technology might change, the assays might change, the aliquot that they take might change, or maybe the freezer broke in between the two samples. There are a number of reasons that the date might be associated with differential expression, and in fact in almost every study, this is a major effect. So it's not true just in the gene expression studies, it's also true in genetic study. So this is another big picture of genetic studies that looked for relationships between SNPS, single nucleotide polymorphisms and human longevity. And so they looked and, and saw that, they claim there was a small set of genes that would predict whether you would live to be 100 or not. But it turns out they measured all the younger people with one technology and all the older people with another technology and this study was subsequently retracted. Similarly, there's a example of protiomics where a predictor was developed based on protiomic patterns to predict ovarian cancer or or in this case it also fell apart largely because of study design. The ovarian cancer patients were sampled at a different time than the healthy patients were sampled. And so it was impossible to distinguish whether it was due to the confounderate batch or whether it was due, to the actual difference in biology that we care about. This ends up being a huge problem, and it affects many technologies. And so, this is a paper where there's a discussion of how batch effects impact almost every genomic measurement. How do we deal with these potential confounders? One way is randomization. So imagine for example that we are trying to do a comparison here. And, so without randomization, th, this is what you might see. So, what I have is experimental units where the samples are shown as circles. And, so the treatments they might be given are the red and green circles around those samples, on th, right hand column. So suppose that there's another confounding variable. So it might be the age or the date or whatever variable you might consider. And so here in this case, the date or the age is related to the treatment. In other words the darker circles are more often get the green treatment, and the lighter circles more often take the red treatment. So one way that you could address this is by simply randomly assigning treatments. So every new patient comes in and you assign them to either re, red or green and you do it with the toss of a coin. This will break down the relationship between the treatment and the confounding variable. And since it's random, it will actually break down the relationship, regardless of the what the other confounding variable is. So randomization is one way to address the potential problem of confounding. Another example is through stratification. So, in addition to randomization, you can actually design your experiment around confounders that you may have already heard of and know about. So here's an example. It's a study in mice. There are 20 males and 20 females. Half are going to be treated, and the other half will be left untreated. And you can only perform this experiment on four samples per day, four mice per day. So the question might be, how do you assign individuals to treatment groups and to days? So a bad design would be to go ahead and basically do, you have the treated and the controls. Run all the controls as only females in the first week, and all the treateds to be only males in the second week. Here, you have all sorts of confounding. You don't know whether the treatment and the, so the treatment and the control are related to the data, the samples that are collected. They're also related to the sex of the mice, and so you, there's a very difficult ability to separate out the different sources of signal. A stratified sample, on the other hand, might do something like this. So you would run both treated and controlls in both week one and week two. You would make some of the treated be males and some of the treated be females. And some of the treated be control sorry, some of the controls be males and females and run in week one and week two. So when we balance out the variables in this way, since we knew the potential confounders, the date and the sex of the mice, we were able to sort of design the experiment around these confounders. And so we can estimate their effects independently of each other. So these are some other good study characteristics, and in fact there's a long class that can be given on experimental design. We'll talk a little bit more about it in the statistics class. But just to give you an idea of a couple of other things that are important for doing good experimental design. In general, it's better to have a balance design. In other words, if you're going to do treated and controlled, you should have about equal numbers of treated and control samples. A same, a study should be replicated. In other words, if you only take one sample from one person, you have no idea about the variability, both in the population and the inter-person biological variability. So it's a good idea to both take technical replicates, that is two, run two experiments using the exact same sample to try to measure how well your technology works, and biological replicates. Replicates where you take it from different individuals so you can take it from different individuals so you can measure the inter-person biological variability. Good designs also have controls, both negative and positive controls, to make sure that both your technology is working and that any effects that you've detected aren't just due to an artifact of the computation or artifact of the experimental design.