Let's start again by thinking about all the various molecules in the context of the cell. So we have the cell, we have its nucleus, a number of organelles. If you think about the genetic material, we have at any given time one copy of the genome shown in this green squiggly line. We have for every gene or for the genes that are being expressed one or several copies of the mRNA either in the nucleus or in the cytosol you see those represented with blue squiggly lines. And similarly have a number of copies for every protein both in the nucleus and in the cytosol. And the question is, what do these sequences look like, what do they do, and how many of them there are? And we'll address the question of sequence representation and generation here. So, you probably know by now, but I will say, nevertheless, Genomic sequences, we can think about them as being unidimensional objects of strings that are formed of combinations of four nucleotides, adenine, cytosine, guanine and thymine, and very simply said, ACGT. The mRNA gene, or gene sequences are strings over an alphabet of four letters, A C G U, where A is adenine, cytosine, guanine, and U is uracil. However, for convenience, we shall simply use T to represent U. And proteins are words or strings over a 20 letter alphabet, but we will not be addressing proteins in our presentation. So, this is the representation, the question is, how do we generate this sequence? How do we figure out what those strings are? Well, ideally we would like to have an instrument, let's call it a sequencing instrument that starts at the beginning of the sequence and tells us every letter all the way to the end. Unfortunately that is not possible with current technologies. So instead, what happens is current technologies would sheer, or split, the substrate, whether its DNA or RNA, into a number of pieces, so call fragments and then every fragment gets sequenced and those pieces get put together. So there are two processes embedded here. The first one is the sequencing. The biological, with the lab, or experimental part, which represents isolating the DNA or the RNA, then fragmenting in and then sequencing. The second part is the bioinformatics part which means assembly. So, let me go a little bit to each one of them. For the sequencing once we have the substrate it usually gets sheared into a number of fragments that are roughly the same in length. Roughly the same means that actually we can characterize it statistically by a normal distribution with a given. Average and the given standard deviation. Now, each one of those fragments is in turn, can be amplified if there's not enough material or not. And each one of them is sequenced either from both ends and that's what you're seeing there with the red arrows pointing towards each other, or from one end only. If sequencing happens from both ends, we call the resulting reads paired-end reads. If it happens from only one end, we're going to call that a single-end read. So that's, in a nutshell, the process. Once the sequences are produced, we have to put them together. We have to use them to reconstruct the sequence of the original molecule and not only going to give you a flavor of what's in the bioinformatics world. So imagine that you have two segments that are co-locating on the genome and they share a common portion. Well, by looking at the common portion for instance, the fact that the N sequence of one read Is very, very similar to the beginning sequence of another read that tells us they belong to a same position to the same location on the genome and it all also tells us something about the order. That read number one should come before read number two. So just imagine if you're putting all this information together for all the reads and what you create is a data structure that is called an assembly graph. In which the nodes, shown here in yellow, are the reads and in which we can establish a connection whenever we have an overlap like I've described. And the task of the bioinformatician is to write an algorithm to traverse this graph so that all the reads are being included. By putting that information together, then one can create a consensus sequence for the genome. So those are the two steps. Now how do we generate the sequences? Up until 2008 the prevailing method was sanger sequencing. So in sanger sequencing will produce relatively short reads. Relatively long reads 500 to 800 bp but it was expensive, it was slow and it required amplification into a cloning vector. It was also medium to relatively high throughput, especially in its last years. However, in 2008, Next Generation Sequencing marked a revolution in our ability to produce reads. At this point, new sequencing instruments, particularly Illumina sequencing instruments have started being able to produce shorter, much shorter fragments somewhere between 50 and 450 base pairs. However, at much reduced cost. I'm t about less than one dime per megabase, and very, very fast to the point where we can generate hundreds of millions of feeds within a matter of days on one instrument. But clearly, the next generation sequencing will produce a revolution in terms of our ability to generate sequences It also brought major challenging, especially for the biome of mapping genome assembly. So there's a story of how we generate sequences. Now how do we represent sequences? For Sanger sequences, the prevailing format is Fasta, or Pearson format and it's very simple, you can see it here. So there's one line that starts with a greater than sign that represents the header of the sequence. So you have the greater than sign followed by an identifier that very importantly uniquely identifies the sequence and followed optionally by some information about the sequence itself. In this case it tells us that this is the interlocking tool receptor gene in homosapiens and that it's an mRNA. The hidden line is then followed by one or multiple lines containing the actual sequences and one has to be very careful to not include any blank lines. Now clearly we might want to put multiple of such sequences together especially if we're talking about the genome. As we said there are many chromosomes, 22 plus x and y. In order to do so all we have to do is to attach these records together. So we have one header line followed by the actual sequence followed immediately by another header line, fasta header, and it's sequence and so on. Again, importantly, there cannot be any blank sequences or any lines that start with a space. So that's the multi-fasta format which was used and is still being used for Sanger sequences as well as for a presentations of genes Genomic chromosomes and so on. With the advent of next generation sequencing technologies, we started adding more information to the sequences, which becomes important in bioinformatics applications that need to take into account the quality. So how much can we rely on the sequences being produced. And what I'm showing you here is in fact not one but two so called fast a records. So let's break that into pieces. The first line, so every record contains four lines. The first line us a header line, and always start with an at sign followed by an identifier which can be fairly long. Usually these identifiers give information about the precise location and geometry of the cluster within the sequencing instrument. Often times you will see that the last two digits, the last two letters will be a slash followed by a one or two. That one or two tells us what is in the pair. In case we have paired. More recent formats also show additional information such as length at the end of the header. Now following the header is the actual sequence. You want to see A's, C's, G's, and D's, and occasionally an N, lower quality sequence or no. The third line starts with a plus sign and item repeats information in the header precisely or is just a plus sign. And lastly, the fourth line is a so called line of quality, of base qualities. So there's one, letter here or one symbol. For every single one of the bases that you see in line 2. And that tells us how much we can rely. So how much can we trust the reading, the automatic reading of the base calling at that particular position. As you can see, just like we've done with multify stay sequences, we can stack together many, many such records and we can create a Fastq file. Now I'd like to look a little bit more closely at the base quality scores and to explain to you what those mean. So let's assume that we have a mathematical model in base calling, and we have a probability to see a base b, let's call it the probability p b. Probability that the core at base b is correct, then the quality value is simple an integer that is obtained with a formula given here minus ten, log base ten of PB. There are a variety of systems for establishing quality scores, but systems now tend to converse to one set of criteria, one system, which is Sanger and started from the Phred quality scores. In this you can have a quality score associated with a base as a value that ranges between zero and 93. Obviously larger values will mean higher quality, lower values will mean poorer quality. These correspond, if you're writing them to ASCII characters between 33 and 126. And just to take a look at them, I listed them there. So you're going to see 94 positions where the lowest quality is represented as the exclamation point to the left hand side and the highest quality is that squiggly line that I showed you earlier. Now in practice you're not going to have quality values over 40. Lastly in terms of interpreting this data, quality values that are below 20 or sometimes even 25, depending on the application, are generally considered to be low. So, this concludes our first section on sequence generation and sequence representation. Next, we're going to look at genomic features.