So far we've talked about sequences and genomic features. What they are, what they look like, how they are being represented. And right now we're going to be talking about, how we can retrieve the sequences, and how we can retrieve the features. Because once sequences are being generated, they are being deposited into databases, for the use of us all. I'll show you two different larger pool stories. The first one is GenBank. How to access sequencing, the sequences from GenBank. And in the second section, we'll talk about we can retrieve genomic features from the UCSC Table Browser. So let's get started with GenBank. So I'm going to ncbi.nln.nih.gov, and then we make the point that it is the primary report story of genomic information in the US, and it contains a variety of types of information. In a variety of databases, which are listed here. So Assembly, BioProject, Clone, EST, Epigenomics, Gene, and so on. So first let's assume that we would like to retrieve the sequence of one gene. And let's take the IL-2 gene in human, which is the example that I gave for a FASTA format, as you might recall in one of the previous sections. So, I'm going to go in the box here, and text box ,and I'm going to type IL-2, or Interleukin 2. I'm going to say that I want it in homo sapiens, and I want the mRNA, okay and let's see what happens. I have to select the appropriate database, and the appropriate database for extracting the Nucleotide sequences would be, Nucleotide. Let's see what happens, so it's a string based search, and you can see I am getting the number of results 226. And if you're looking at the first ones, you will notice that this is not very specific. The first sequence is a mouse sequence, followed by a frog, and so on genomic sequencing, that's not what I want. However, right here at the edit top, in a summary, it shows me that perhaps what I'm interested in, might be one of the genomic sequences either two. Transcript or Proteins, so I'm going to select Transcript. Let's run again. There are other ways, in which I can fill up the formation to reach particularly what I need to obtain. So, we have the Homo sapiens Interleukin 2 mRNA. This is shown here in the general format, which I had not described. So in addition to the sequence, which is listed at the bottom, there's information about papers that refer the sequence about the perfect log post, about the identifier, a specific identifier from the sequence and so on. Length and many others. So what I want to do is fix from the FASTA sequence so I'm going to click the little button that says FASTA Which will give me a rendition of the sequence in FASTA format, just like I've shown you earlier. And these I can now cut and paste since it's a fairly short sequence. For longer ones, we can just send the complete record, for instance, or just portions of it to a file. And this will download, the item in FASTA format. I just click that and it will save it into a text file. That's fairly simple, I can do cell, or there is already a number of databases of sequences that can be directly downloaded from NCPI, and which you can access from the entry page of NCPI. It's fairly simple so I'm not even going to show you that. I'd like to show you how to access one particular, one other type of sequences, and those are the FASTQ sequences. Let's say now that we would like to find something very specific data let's say strawberry. So I'm going to type strawberry and I'm going to type RNA-seq to make sure that it's RNA-seq. Then I'm going to go to SRA, short with archive. Let's see what we get. Search. And I have a number of results, 66. Which I can further filter down. You can see that there are three different species being represented, fragaria vesca, 37, fragaria ananassa, and one more. And I'm really interested in woodland strawberry, which is fragaria vesca. So I'm going to narrow it down to the 37. And that all of them, as we see here, are RNA sequences. So I'm going to pick one set here that is relatively short and it's going to be transferred in a short time. So lets look at this illumina sequencing of floral transcriptomes woodland strawberries. It's produced with 2000 and it has 15.5 million spots, that is 15.5 million bits, so that's not that much data. Now clicking on this is going to take me to a page that describes the particular experiment in that sample. So the experiment number is SRX, identify SRX 426518, we have some information about the run, the number of reads, the number of bases and how much they represent in downloads in binary. Followed by more information about the project because a project might have multiple samples and they usually do. The scope, or the goal of the project, larger bio project, or experiments and so on. Also, what this tells us is that these are single end weight. You only see one four here. And then we have in the table information and links to the actual data. So let's click on this number here and it brings us to a page that allows us to access various types of information about this particular experiment and sample. At the surface on this tab there's metadata, number of reads and so on. If we click on reads we are going to see that there is going to be one entry for each read but that's particularly the identifier and some information about where it came from, the sequencing configuration. What we're mostly interested in here is the download. And there are ways we can download this information by clicking and dragging. But what I'm going to show you will be a way to download this SRA data using command line utilities, which I found to be much more useful. So first let's write down some information which will come in handy. First of all, this experiment and the sample identifier. SRR1107997. And then you see that these also determined by the experiment, sample, and study identifiers, which are listed here. And which normally is a good idea to write on the piece of paper as well. Let's follow the link to information on how to download SRA data. This takes us to a help page on NCBI with a variety of ways using [INAUDIBLE] and so on. One command line utility that I've found to be very, very useful is WGet. I have to find that. Here's one example, So Wget can be typed from any UNIX type interface, and it will go and it will fetch, it will download the data, from the Internet address that's listed here, as the second common line argument. So I'm going to copy this, I'm going to go to a terminal And I'm going to modify this slightly. So the benefits of the Wget comment are that you can do the transfer remotely. It is convenient, it can happen in the background while you're running other commands, and most importantly because it deposits the data precisely in the directory where you want it to be located. So we have some fixed string here. Then we have SRR followed by the first three digits in the sample number which was 110 which were 110 and SRR followed by the entire identifier and digits, 1107997. This is the identifer I've asked you to write down on a piece of paper, SRR1107997. And let's see what's happening. So you see that it connects to NCBI and then it starts a transfer. And then we have a nice report port of the progress on the left hand side. And it's going fairly fast. We're downloading 554 MB, of sequence, of data and almost there. So we're downloading the SRA5. And obviously the time will depend on the speed of the download. So we save the file, but now let's take a look a little bit so SRR, so, remember that head will give us the top ten lines. And this is definitely not something that we can understand, and definitely not what we expected. So in order to produce the appropriate format, we have to run a new utility that's called fastq dump. Name of the file and because I know it's going to take a while, I'm going to write it in the background. To do so, I'm going to start with the comment NoHoc, which means No Hangup. That means, even if I'm going to close my environment on my laptop, this is still going to run on the remote terminal if I walk some place else. Followed by fastq-dump command and the SRA file and I wish this to be happening in the background while I have access to the shell for other commands. So I'm going to put an & at the end. So while command is being executed in the background any errors, any problems, are reported, are being written to automatically to the file nohow.out. Where, and in the meantime I can write other comments. So, I've done this previously, I know it takes a while, so I'm going to just show you the result, the final result. And the output, actually this one other head again, the top ten lines. And I've done this previously so this is what the output looks like. So it's the identifier, the header line, which contains the length, and so on. Now these were single [INAUDIBLE]. In case you have to process pair then weights, then you need to use a command line modifier, an option. And that would be fastq dump dash sleep three.. But here's another way in which you can usually see what the possible options are and use them accordingly. You can always type the command line fast-dump and typically help with a dash dash, and that can give you information about what other command like parameters are available there. So in this particular case. The split-3 would split the SRA files into mate one and mate two for paired end rates. So, as you can see, we were able to retrieve sequences produced with the Sanger technology, so maybe genes or genomic sequences. From the and I've also shown you can download fast queue type data, next generation sequencing data. And this concludes this section. In the next section we'll be looking at where and how we can download genomic features or annotations.