Let's say you get your reads back from your sequencer, and you just want a sanity check that it was a pure sample. That is what you think it is. Well, PATRIC has a service that can help you with that. It's called similar genome finder. To locate it, click on the Services tab. Open that up, and click on Similar Genome Finder. This will take you to the landing page for that service. Similar Genome Finder runs on Mash/MinHash. That's not a program we developed at PATRIC, but it's one a lot of biologists are using. This is the publication on it. And basically, what it does is it takes large datasets and sequence sets, and reduces them to small representative sketches. You could estimate the global distances pretty rapidly. If you want to learn more about the program, you can look at it here. We like it in PATRIC. It's really fast for this. So in other videos, we've shown you how to look up individual genomes, or FASTA files. This time, we're going to look at a FASTQ file. I've got some that can be used here, so I'm going to go to this folder, and I'm going to look within a workspace that we have in PATRIC. So I'm going to click on Public Workspaces from up here, so I'm going outside of my private workspace to ones we have in PATRIC for public. And this is one we use at the PATRIC Workshop, so I'll click on that, and there's some reads I like in assembly. And then this particular folder in here's a very nice FASTQ file. And you can see the size of it, and when we put it in here. So I'll click OK. We may have to wait a minute. Sometimes it's a little bit tricky for loading. Click on it again. And you can choose the number of hits that you want. The P value threshold. Now, it's important when you're using genomes, or FASTA files, you can mess around and be more stringent with the P value threshold, and also with the distance. But when you're using reads, remember that reads are small, and the chances that them covering all of these K-mers is not as great. So don't mess with that. Just leave it at the lowest settings for both of these when you're using reads. And then you can say, do I want to look at NCBI reference and representative genomes, or do I want to look at all public chain on this? I'm just going to choose all public genomes here, so it's searching across all of PATRIC. Then you hit Search. Sometimes when you're looking at FASTQ reads, this can take a while. Similar genome finder launches the job on the fly on this page. If you close the page down, you'll kill the job. Unlike other services like assembly, or annotation, or taxonomic classification, variation of myriad of services that we have where you launch the job, you can go away from it, and you can monitor it on your jobs page. So we're going to have to wait and watch the spinning blue wheel for a while until it returns. Generally, it's lower with the reads than it is with the context and the genomes, because the files are bigger. Generally, read files are pretty big. It returns a table, and we set limited for 50 hits. It says there are 51. I imagine it's counting the first row too. Those from closest relative to least close relative, and so it shows you the name, the genome status, the context we bring in other metadata when we can find it. This particular genome was isolated in the United States for collected in 2016. Now, if you're used to looking at the genomes in the context, you're going to look at these values, the distance, and the P value in the kamer counts especially, you're going to go no, this is terrible. 71 K-mers out of 1000? But remember, it's reads, it's okay. It should be less, because reads don't have the combined coverage that the context do. And then if I wanted to select either one, or all of these genomes, there are a number of things I could do from that. I could download them. I could go to the Genome View, or I could create a group of that. I can even copy it and move it someplace. And you can also download the table here. This particular download will give you access either to the table and text, or comma separated value. Or you could get a number of other different combinations of the data using this. So that's the Similar Genome Finder. We've shown you how to look up by genome name using contig files, and now read files. Thanks for watching. This is very close to what you had in your second assignment. This time we're going to look at the read files, and try to see what the closest hits are to that. And you can imagine the power of this if you just had some sequencing done, and you just want to be sure, have a sanity check to see if it's what you thought it was. Or maybe you don't know what it is. You could submit it to Similar Genome Finder, and see who the closest genome is. Remember way back in assembly, which was the first set of the instructional videos you had to go into the PATRIC MOOC public workspace? Go into Assembly, and there were these test reads there. MOOC_test_1, which was on the forward strand MOOC. Test 2 on the reverse strand. You were supposed to copy those to your private workspace. So now, click on the folder underneath the text box that has Upload FASTA or FASTQ files to it. Navigate to that part of your workspace and do the Test 1 reads first, the ones on the forward strand. Submit the job. The FASTQ files take a little bit longer than contigs and genome, so be patient. When the results return, what's the closest genome? And once again, look at the distance, the P value, and the K-mers. I always like to have another tab open, and I want to go back and forth between these. because in the other tab for Similar Genome Finder, this time, use the test_2fastq files from the reverse strand, and launch the job. Do both the read files, hit the same genome, and which read file had a closer hit to the genome that was identified? I think you'll find this one pretty interesting. Good luck.