One of the main issues in the original motivating example I gave you was that the data weren't available. They couldn't be analyzed by other people. And so it's important to have a data sharing plan. This is an important component of the statistical analysis of any genomic data set. And so, a data set consists of four actual components. First is the raw data. In the case of sequencing data, this is often the raw sequencing reads. That might be something like a FASTQ file or in a line file, a BAM file of line reads. Then there's a tidy data set. The tidy data set, we'll talk about in just a minute, is a data set where you've actually done some processing and cleaning so the data is easily analyzable and easily made interactive. Then you have to have a code book. This code book describes each variable and its values in the tidy data set. And finally, you need an explicit and exact recipe you use to go from the raw data to the tiny data set and the code book. Without all of these four parts, a data set is incomplete when you're sharing it. So the first thing to keep in mind is the raw data. So the raw data here, I'm showing a FASTQ file, which is like, an example of raw data in genomics. And you know it's the raw data if you did no processing, no computing, no summarizing, and no deleting to the data set. You can't do any kind of analysis at all to the raw data in order for it to still be the raw data. It's important to also know that this is a relative term. So for example, you're seeing here the sequencing reads that you might get like, from the machine. But there are also images, as Stephen explained, underlying the actual sequence a lot, the sequence reads that you get here. You actually most often don't get access to those images. Those are the raw data to someone else. So the raw data is when it comes to you, the rawest form of the data that you have available. A tidy data set is described with these four terms. So it's one variable per column, one observation per row, and one table per kind of data set, along with a linking indicator if you have multiple data sets. So for example here, I'm showing a data set where each of the observations that we've collected is in a row. And the variables are the problem ID and the subject ID, and so forth. So in general, if you have a genomics data set, you might have the genomics part of the data, and you have that in one tidy data set. And you might have, say, metadata or phenotype data in another file. It's important to have a linking indicator between both, and that they're both tidy. So the tidy data set goes along with the raw data set when distributing your results. You also need a code book. So the code book should have things like the variable names and their descriptions and units that, things that you couldn't put into the actual raw data or in the tidy data. So for example, if you measured height in feet versus height in meters, you'd want to include that in the code book. And we know there have been major disasters, say, when we were trying to send a satellite to Mars when people didn't know what units they were working in. So all of this needs to be recorded in the code book. You also need a recipe. So the recipe has to take the raw data, execute some commands, and produce the tidy data set. The best way to create a recipe is to use some kind of script. So a script is a set of commands that a person can run without any interference from the original analyst and produce the tidy data set. These are typically written in R or Python code. So the input data is the raw data, and the output data is the tidy data. There are no parameters that are allowed in these sorts of functions. They have to be able to be run without the user interfering with them at all. If you're not comfortable creating a script, that, then you could also create an explicit list of instructions about how you went from the raw data to the processed data. There's a lot of danger here because you have to be extremely explicit. You have to list every parameter of every piece of software you ran, every version of every software you ran. If you did something in Excel, you have to be, maybe make a video of what you were doing in Excel when you did it so that people will have a record of everything that you did. And you have to distribute all of that along with the processed and the raw data. So this is, I've coded it in orange here, this is an okay thing to do. But you have to play, pay very careful attention. You have to be careful to avoid any vague instructions, any missing versions, or any skipped steps. These are common if you do the recipe outside of a script. So it's highly recommended that you use scripts for creating processed data from raw data. If you need a data sharing plan, I've created one here. It's available on GitHub, and you can go and use it. It explains a lot of what I've said in this lecture in even more detail, if you would like to get into how do you actually share data with people.