So flag represents a combination of multiple pieces of information. All of which are represented in binary. So at the level of bits. So what we're seeing is a base ten representation, a number that represents this collection of information. So for instance, imagine that you're writing these in binary. And then the right most bit, 0x1, tells you whether there are multiple segments. In other words, the formula reads whether these are pairs of reads, pair then reads, or single end reads. Another piece of information that is contoured by the second bit form the write, is whether the segments are properly aligned. So this is the concordant information that I mentioned previously. The next piece of information is whether the segment is mapped or unmapped. Often times, FLAGs may contain red words for the sequences that cannot be mapped, just for the purposes of providing a unified view and a complete view of the original data set. The next bit, moving again from the right to the left, tells us whether the next segment in the template, basically its mate in the pair, is unmapped. Following this, we have some information as to the sequence for the current read have to be reverse complimented to be aligned to match the genome. Or whether the sequence of the next mate, or the next segment, had to be reverse complimented to match. The next two bits marked here with 0x40 and 0x80 in hex tell us whether these degrees that we're looking at is the first segment in a pair of reads or whether this is the last segment. Moving on, 0x100 tells us, so looking at that particular bit, tells us whether it is a secondary versus a primary alignment for that read, based on the equality of the mapping, the alignment score. 0x200, if that bit is set, then that means the read is not passing the quality checks. Maybe the base quality values are not high enough or maybe there are too many ends. On the line that's marked 0x400, the bit will tell us whether this is a PCR or optical duplicate. And lastly, the left-most bit in the representation will tell us whether this is a supplementary alignment. So let's take one example. Let's say on the previous slide we saw the code of 99. So 99 is a representation in base 10, which is what we use for numerical representations in day to day life. We're going to write that in basic still, decompose it, and then what you see on the right-hand side is a sequence of blocks in green, black, and red, you will see a binary representation. So we have 12 bits. We're going to start from the right and move to the left, and we're going to read. We're going to interpret the meaning of the flag 99 using the individual meanings of the bits as reflected in the table above. Remember, we're moving from the right to the left. So the 0011 or rather, moving from the right to the left, 1100 marked in red, gives us information that would reading on the first four lines. Multiple segments, segments are properly aligned, and so on. So 0011, or rather 1100, tells us that these are paired end reads. That the pair is properly aligned, the bit is set to one, that this segment is mapped, and its mate is unmapped. That's because we would have had one if the statement was unmapped. So that was easy, now let's check the next block. Again we're moving from the right to the left, but in this case it's a palindrome so it makes it easier. So what this information tells us that the sequence was matching the forward direction. Basically that it did not need to be reverse complemented. And that it's mate, on the other hand, had to be reverse complemented to match the genome. It tells us that this read is the first in the pair. I'm reading the line that corresponds to 0x40 first segment. It is not the second or not the last in the pair. So that's the second block. As for the third block, its all zeros and it tells us that they read past the quality check. It is not a PCR duplicate and it is not a supplementary FLAG. So that's how you can interpret each one of the codes. And there are specific ways in which you can use, for instance, Python or Perl commands to extract information automatically. Now the next and last piece of information, the last field that I will be talking about is the cigar field. Which is used to represent the alignment in a very compressed and rather lengthy form, longer form, a CIGAR-like form. The CIGAR from can be represented as a sequence of short strings. Each string, there's a number followed by a letter. The letter marks edit operations, and the number tell us how many such edit operations we have. And here are the types of editing operations that expand on the set that I gave you earlier. So M, for instance, says that it's one nucleotide match. I means insertion to the reference, just as we talked earlier. D, a deletion from the reference. N, it's a very special case that represents an intron. S marks a portion that is soft clipped. For instance, the sequence, the start of the end of the sequence could not be aligned. And it tells us how many bases were clipped. However, these bases will still be shown in the sequence field in the SAM format. H marks a hard clipping. So it's just like soft clipping except that this time those letters will not be shown in the sequence field in the SAM format. P is a padding for the first segment. Equal in some cases the M value which it was marked as a match is further identified as either an identity, a sequence match, so an A matches an A in the genome. Or is a sequence mismatched, for instance, an A is different from a C, in the genome. So that will be a substitution. And here are two examples. In the first example, I'm giving you the reference at the top, and then the read, and how the two align. If you're looking at the alignments starting from the left to the right, you will see that A C T map or match between the two sequences. So we're going to mark those at three M. Then we have one insertion, so one letter that is present in the read but not in the genome, so it's inserted to the reference. Followed by three letters, G A A, that again match between the read and the genome. We're marking that the three M. Now you notice that there is a C, a letter that is present in the reference, genome, but not the read. And that one marks a deletion from the reference. And, lastly, we have a block of five letters that are either matches or substitutions, and we mark those at 5 Ms. So M does not distinguish between substitution and identity. And in the second example, we're showing a spliced alignment where we have the read expanding two different exons. So we have that the first two letters, A C T will match at the end of the first exon and will have 3M, then that there will be a long intron, a long jump or gap of 1000 bases in the reference that are not present in the read. We mark that with 1000n followed by five matches, five Ms, for five bases at the beginning of the next exon. That's how we represent a spliced alignment. So this concludes our presentation of alignments as specific genomic features and of genomic features in general. And in the following few sections we will be looking at how we retrieve this information, and how we can manipulate it using basic or not so basic UNIX commands.