So, suppose that we already have the sequence of the genome. Now, what is the next step? That would be to go along the genome and to annotate important features. So essentially we would like to determine the precise location and structure, meaning intervals or lists of intervals. Think about exons or maybe genes and your associated biological information all along the genome. We call it Genome Annotation. What can these genomic features be? They can be genes as I said, they can be exons, they can be promoters, protein binding sites, translation start and stop sites of DNA cutting and so on. So a variety of features. I'm going to give you one example that's going to come in handy which is Genome Annotations. So what do we call Gene Annotation? Well, when we're annotating genes we would like to know a variety of types of information. We would like to know where the gene is located. And how it's structured into Exon and Introns? So if you're looking at this example here, there are two genes. One of them is in red, the other one in green. The first one has three exons. We know the precise coordinates of th exons. For instance, the first exon for the red gene is started position 10,135 and goes to 10,600 and so on. And similarly for the red and for the green gene. We would like to know the strand, and you can see we have an arrow showing that the red gene is on the forward strand, whereas the green gene is on the reverse strand. And maybe you want to know something more about the gene structure. So for instance where does the translation start and where does it stop within every gene and so on? So that is what we call Annotation. Now of course we have a representation or multiple ways of representing genomic features. And the first type of representation that I'm going to tell you about is the BED format. It is the Browser Extensible Data format and B-E-D, BED is obviously an abbreviation. In the basic forma,t BED files only need to represent the data in three columns. Basically signifying the location within the genome. The first column gives the chromosome or the scaffold as the genomic substrate. The second and the third column shows the start of the feature and the end of the feature. And I'm going to say something important here, they are zero based. Which means the first base in the genome is considered to be zero. So there's the minimum that we need to specify in a BED format. However, we can add more fields to show more information. So in the representation here at the bottom, you'll see that in addition to the chromosome number, start, and end, we've also added a name for the feature. Exon one, exon two, exon three and so on. Escort, Strand, data values to numerical values that indicate thick_start and thick_end. And then finally an rgb cover representation. So what are these? I should point out that this really mimics, this refers to the example of gene annotations, the two genes that I've shown you on the previous slide. So, as you might see, we have three exons, Exon one, Exon two, and Exon three, which correspond to the three exons in the red gene. And then we have exon four and five, which have features corresponding to the two exons in the green gene. Now the score is typically a value between zero and a thousand, it simply indicates to a browser or visualization tool how intense the color should be. The strength can be plus or can be minus or reverse, depending on what strand of the genome the feature is located. And then the thick_start and think_end mark a visualization feature. Basically they tell the browser where the direct handle representing the feature can become thick. So these are common way of representing the start of the protein coding, of the coding portion within a gene. Lastly, column nine is an rgb, Red, Green, Blue, representation for the color. So, as you can see there, the first gene is shown in red, whereas the second gene is shown here, well, in blue. Now, I want to go a little bit more over the concept of zero based versus one based notation. On the previous slide you might look at the coordinates. And you might have seen that the coordinate of the first exon, the start coordinate, is 10,135. Whereas in this here presentation the first coordinate is 10,134. The end coordinates however match between the two slides. So let me explain a simple way of thinking about this representation. Let's assume that we have a sequence that is called, that reads A C A G C T A C A G. You can think about zero based coordinates is being in a system that counts the spaces. So there's one space, space number zero before the first letter. Space number one is between letters one and two and so on. So we have ten spaces here zero to ten. In contrast we can count basis, which is usually what systems do. We start with the first letter A corresponding to letter one, C two and so on up to G ten. So in a zero base system we count the space right before the first letter in the string we're interested in. And we're counting all the way to the space that immediately follows the last letter of the string we're looking at. So if we're looking to represent the string C, A, G, B in zero base coordinates. This will be listed as the interval between one and five. Whereas in one base coordinates, this will represent the interval between two and five. So that's it about how we represent intervals in the BED format. Now what I've shown you referred to the case where each interval was a separate entity or rather each entity was a separate interval. So we talked about exons as being separate entities. However, most often we would like to show exons that belong to a same gene together. So we would like to have a concept of clustering intervals, or features that belong to the same larger feature. And we can do so by putting additional fields at the end of the already existing BED record. So here's how we do it. We have two lines here. The first one corresponding to the first gene, the second one corresponding to the green gene. And you can see there, in bold font. Additionally we're giving the block count, where a block will correspond to an exon. So we have three exons in the first case, two exons in the second case. The sizes of each block, that is the size of each exon. In this case for the first gene 466, basis for exon one, 31 for exon two, and 903 for exon three. And lastly we give the coordinates in the relative system, the start coordinates of the features, the block. In the coordinates system where zero marks the beginning of the G, the beginning of the first exon. So this diagram at the bottom actually represents shows you the correspondence between the two notation systems. So you start, now, instead of on 10,135, we're marking that as position zero in the new system. You see the lengths of the three axons, 466, 31 and 903. And you see what the start coordinates are, 843 and 3,275 for exons two and three. So, that's fairly simple. In addition to the BED format, several other types of formats have evolved for representing genomic features. And some of them, especially the next two that I'm going to be talking about, were geared to representing gene information and transcript information. So, the GTF format, in the GTF format, let me just say, GTF stands for Genomic Transfer Format, we have more columns. It is structured, perhaps, in more. In the first column, just like in BED format, we show the chromosome or the scaffold. So there's the genomic access, an identifier for it. The second one tells us the source which can be a program or it can be a website or anything that you can think of. The third column tells us the type of feature. The fourth and fifth numerical columns simply tell us the beginning of the feature along the genomic axis, along chromosome seven and the end of the feature. Then we have the score, just like before, that represents the shading and perhaps represents the confidence that we have in that particular feature. This strand signifying the forward or the reverse trend that features locating on. The dot here, so almost anytime we can place a dot if the feature is not applicable or we don't have the information. But this column, had we had the information would represent the frame. Particularly for coding sequences frames zero, one, or two. And then comes the ninth column, and the ninth column is typically a composite column that has at least two different types of fields. The first one gives the gene identification, gene identifier. And it's preceded by gen_id. Followed by the actual name of the gene. Then the second piece of information is the transcript ID. In this case genA.1 meaning it's variant number one. There can be additional fields that refer to the abundance, to the exon number and so on. And within column nine the fields are being separated by spaces. As you can see for the GTF format as well as for the BED format, BED format. The columns themselves are separated by types. Now as to how genes are structured here. So, all the exons corresponding to one gene are grouped together. So, you might see here that the first three lines correspond to the three exons in genA, in that particular order. And you might see that the last two lines correspond to the two exons that form genB. Another important distinction from that format is the coordinates here are one base. So, we're going back to one base notation. So, the first position for the gene along chromosome seven would be 10,135. So that's the genomic transfer format. A format that is next, a format that is closely related to the GTF format and that also evolved from gene representation is the so called GFF now in its version number three. So GFF stands for Genomic Feature Format and usually needs to be preceded at the top of the line by a line that specifies the version, GFF version three. Then just like with the GDF format we have one line for every exon in the gene. We have column one representing the exon, representing sorry, the genomic substrate, whether it's a chromosome or a scaffold. Followed by the source, let's say the GF stands for some generic gene finder but it can be Cufflinks, it can be GenMark, it can be C4. Followed then by the type of feature in the exemplifying with mrna and exon. But other features can be there as well. Followed by in columns four and five numerical values that represent the start and end coordinates of the feature along the genome. A score in the following column, the strand, the common frame or a dot if we don't know that. And then in the last column some information about the feature and the cluster that it belongs to, the upper-level entity. So they are specific fields are identifier. So we're giving a name that uniquely represents this feature. For instance, our two genes are going to be named, mrna001 and mrna002. And then we're also giving the name of the gene, in this case genA and genB. Other fields can be included there as well, but that's a minimum representation. Now as you can see, so after the mrna line, that gives one line representing a summary of the gene, we have in the first case. Three exon lines, that each corresponds to one of the exons. Here, we have an ID for every feature, exon 0001, 0002, 0003. But, we also have a parent feature that identifies this as being a feature that's part of the parent mrna. So the parent for all the three is mrna0001. So, that's the way we can represent genes in the GFF3 format. And that concludes our brief section on representation of genomic features. In the next section, we'll talk about the specific type of feature, alignment.