Hi everyone. I've stepped you through assembly, annotation, comparing protein families, building trees in PATRIC. There's a lot of clicking involved in a lot of analysis steps. Well, today I'm going to introduce you to a pretty new service that we launched that's really popular with our users, called the comprehensive genome analysis service. It's a streamlined analysis meta-service that accepts raw reads or contexts and it performs a comprehensive analysis that includes assembling or not if you just submit context. Annotation. It'll identify the nearest neighbor to the genome and it will perform a comprehensive analysis that includes a subsystem summary, a phylogenetic tree, and features that distinguish the genome from its nearest neighbors. In this particular video, I'm going to show you how to submit one that uses read files. If you looked at other videos I've done, especially for assembly, I've gone into great detail about how you can load your own read files into PATRIC. You could use the text box to start typing and search for read files that are already in your workspace. If you've reloaded them recently, you could click this down arrow here and it would show you the most recent read files that were loaded all the way back. You notice that the ends are fastq or fastqgz, which are zipped files. PATRIC accepts both of those. Single read would be the same thing, there are read files that are fastq or fastqgz. You can also, if you haven't uploaded it into PATRIC yet, you can click the folder icon here. You can either search for data in your workspace or you could click this upload icon here. First it opens another pop-up window that tells you the endings that are needed for these files for it to be acceptable for the comprehensive genome analysis service and then you click, "Select file," and that'll open a dialogue window with your computer. So If you've had a genome sequence at your assembly service they gave you the files, this is a way you can get that. But we're not going to do that today, because I've gone over in great detail how to do this in my other assembly videos. Today we're going to talk about loading data from the Sequence Read Archive. Now, I decided to get some files that were done by the FDA. There's a great resource right now, FDA-ARGOS, where they're doing all of these reference grade microbial genomes. At the Sequence Read Archive, I wanted to try a hybrid assembly which combines short and long grids and I found a particular file that does that from Staphylococcus aureus. If I were to click on one of these, if I were to click on all of them here. This is what opens up at the Sequence Read Archive. It's got the experiment member. This is the run number. PATRIC meets this run number. I mouse over it, I copy it, and I go back here to the comprehensive genome analysis page and I enter that run number in this text box, but I got to get it over into the selected libraries box. To do that, I click on this arrow and move it over. You notice a little pop-up message occurred that said, "We're validating, making sure that that number is correct." This is just for PATRIC's benefit to make sure you've entered in the right member. Now what we used to do is make it go there, sorry downloaded it. At one point I think I had a tutorial showing you how I'd locate the files at SRA, which was easier to search. But then I would go to the European Nucleotide Association, ENA to download the data because their download was easier because I'm just a biologist, I'm not a computer scientist so I have to do things the harder way. Now all of that has been skipped so that you can just enter the number here and PATRIC does it for you. What could be better than that? We've got one number in, but we need a few more to go. I need this SRI number for the second one. The first one was a long read. These are the short reads, so I just overwrite that with this guy and move it over and look for the message right here, validating that number and quickly enough it says, "Yeah, it's okay. We can do that." Then there was a third file related to this one genome with different run on the Illumina HiSeq 2500. I go here, put that little guy in there and move it over. Now I've got three. But wait, we're not done yet, because remember, it's doing assembly and annotation all on the same page. We have to choose an assembler and we have to choose taxonomy. There are some parameters that we have to fill out. There are a number of strategies that we offer in PATRIC: Auto, Unicycler, SPAdes, Canu, metaSPAdes, plasmidSPAdes, and MDA (single-cell). Let's start with the Unicycler. I'll discuss Auto at the end. Unicycler can use both long and short reads. It generally takes a longer time, but it does a really good job. What it'll do is it'll take the long reads and then improve, not polish, but bring on the short reads to improve the assembly. That's an important distinction between Unicycler and Canu. Canu also will take long and short reads, but it first builds the initial assembly with the long reads, and then it polishes it with the short reads. There's a little bit of a difference there. SPAdes takes only short reads, and there are several variations of SPAdes; metaSPAdes for metagenomic reads; plasmidSPAdes if you've sequenced to plasmid; or MDA. What's gaining in popularity is people isolating single cells and generating an assembly from that. MDA does that. What happens if you pick Auto? If you pick Auto, and you have only long reads, it's going to do Canu. If you have both long and short reads, it's going to pick Unicycler. If you only have short reads, it's going to pick Unicycler. I've actually generated this assembly before, but I did it with the Unicycler. This time I want to try something different, and I'm going to click Canu. Notice that when I click Canu, it tells me I have to estimate the genome size. Luckily, I've estimated this before, I've run it before, so I know what it is, might be a little bit daunting if I didn't, but it wasn't five million, it's three million, the end size of that genome. If you didn't know that, I'd just go look at other staph aureus genomes and give it some ballpark, may be a little bit bigger and see what happens. Trimming, do you want to trim or not before the assembly? Trimming uses Trim Galore. I'm sure these things have been trimmed. Then you see this thing, Racon iterations and Pilon iterations. What are those things? Those are what we call polishing the genomes. What they'll do is build the first assembly, and then if you have a polishing step, it'll go through and take the reads and try to improve in areas, make sure that there's not some discrepancy in the reads for individual locations. It's using the short reads or I guess long reads to do that. Those are two other options you can take. This, we have it set at two for each. I did some searching on this before, and generally, at least the Biostars, they thought that two polishing iterations were good. Now, Racon is for long reads, Pilon is for short reads, so we're asking it to be both. Now, you can also adjust the contig length and the contig coverage. Generally, GenBank has requirements that, as I recall, they only accept contigs of 300 base pairs or longer. You can mess with this and change that, and the coverage you can change too, if you want. But let's just set it at the default parameters. Bacteria is the domain. If I were to click down here, you'd see I could do archaea or viruses right now. Viruses is just for bacteriophages, not for anything else. But since we've recently joined forces with ViPR, the viral BRC, eventually we will have virus annotation service. Let's click Bacteria. We need to do a taxonomic name. I know that this genus is staphylococcus aureus. You notice it started giving me designations. This is the one I want, which is the species level. Now I have to give it a unique identifier. I'm going to give it one of these SRR numbers, just so I can remember who it's from, and then I'm going to put a tag on it, that it was Canu. I've got that, and then I have to put it in an output folder. If you've created the folder before, you can start typing it here, you could go in and look at your workspace here, clicking here to find folders that you've created. I know I have one in here for Comprehensive Genome Analysis. I'm going to click on that one, say OK. Now, look the blue button is ready, I can submit this job. So I'm going to submit this. Now in the later videos, I'm going to show you how to submit assembled context. Then we're going to talk about the files that come with the Comprehensive Genome Analysis job, and also how you find help when you have problems. So let's click on "Submit". I get a dialogue box that says it was being prepared and now it's been queued. Down here on my Jobs monitor, you can see that it's already running. The next one, we're going to submit a job with assembled contexts. Okay, everybody, it's time to roll up your sleeves. You've got some work to do. It actually quite a bit of work to do with this first assignment. So it's got a few steps. The first is to create a folder in your workspace. If you've heard some of the videos, you know that I bemoan the fact that my workspace is very cluttered. So I'm trying to save you from being me, and help you organize your workspace so that you can find everything easily. Remember that when you put things in folders, you can use folders within folders to keep it as neat and organized as you wish. Then I want you to copy some files into your workspace. Then I want you to submit assembly jobs. One of those files is PacBio, but then there are two pairs of read files that you'll need to do. Where do you find those things? You click on the "Workspace" tab, and then you click on "Public Workspaces". You'll scroll down until you see the MOOC PATRIC course. Double-click on that globe, that'll take you to the folders within the course. Click on "Comprehensive_Genome_Analysis", double-click on that folder, and then you'll see three folders: assignment, contigs, and reads. For this assignment, click on "Reads". Then you'll see five different files there. You can shift-click on all of them. Notice that when you do that, it populates the green bar with possible downstream actions. Then you click on the "Copy", this will open a pop-up window where you can move it to your workspace. So that's the easy part. Now here's the hard part, you need to submit 27 different assembly jobs. You're going to be busy. What's the reason to do this? I want you to start exploring how different assembly strategies change. If you start adding in the different Racon or Pilon iterations, does that change the assembly? Not only does it change the assembly, does it change the annotations at all. This will give you some sense of how it all works together to produce the annotation. Twenty-seven jobs, am I going to tell you what those are? Don't worry. Yes, I am. So go back to the Comprehensive Genome Analysis folder, and there's the assignment folder. So click on that. I've created an Excel template for you that lines out exactly what you have to do. In this case, you'll click on that "CGA_hybrid_testing_template", and this will populate the vertical green bar with other actions. This time, you download the file. This is what it looks like. I've created this for you. At the first part, you can see the reads and everything that I want you to choose. Then as the lectures go on, you'll start filling in all these other columns. There are a lot of them because a lot of data comes down with an assembly and with an annotation file. You'll need to decide on a unique name for each of the 27 jobs you submit. Then notice that we're going to be comparing Unicycler to Canu. Those two assembly methods. You're going to get a sense of how long these different strategies take and what differences occur in the annotation following them. Then we have differences in Racon and Pilon iterations. So we're going to be comparing all of these things. Here's a quick hint. After you submit an assembly job, you can go in and don't reload the page. Leave the read files already entered and you can go in and just change the name, change the strategy, and change the Racon and Pilon iterations, and [inaudible] the job so you don't have to keep uploading the reads. Once the job has been submitted, as long as you each change a few key things, the blue button will pop up at the bottom and allow you to resubmit the job. I know it's a lot of work, but I know you can do it. I know you're a dedicated PATRIC warrior, and this task is not beyond you. Good luck. See you next time.