In the last instructional video we did, we had built a tree, we'd requested 100 genes and we had gotten less than a 100. So let's scroll down and look at these trees statistics here. So I asked for single copy genes 100 and it only found 52. I'm going to show you how to increase this if at all possible to 100. It involves using a different tool at Patric, the protein families order. So I've opened up a new tab because I want to keep that guy open. Let's go into services and find under protein tools, protein family sorter, so we click on that. Now we're going to add our genome groups to this. This is a tool that takes all the protein families within your selected group and allows you to see how many of them are shared in to visualize patterns within those protein families. So under genome group, I click the down arrow to find my most recent ones. Here's one with Buchnera. Buchnera Are those endosymbionts of aphids named for Paul Buchner, who was the father of endosymbiotic biology. Resia, want to add that group named for Eric Reese, who was a German zoologist who studied lice. Click down Blochmannia, Friedrich Blochmann, the Blochmann bodies that he had discovered back in the 1880's and ants, and we found out later that those bodies, those big clumps were big clumps of bacteria, that marks what's earned him, having a genus named after him. Move that over, we need to have these and this gets pretty much similar to the phylogenetic tree box, we need to add them all together. Then Wigglesworthia, anybody who's listened to this knows what a fan I am of Wigglesworthia. The endosymbionts of Tse Tse Flies. Move them over, and if you have a private genome, I have a private wigglesworthia I start to type it, and if it has a little lock in front of it, it indicates a private genome. So I'm going to add that too. Now because I have three different genera here, I need to use the cross genus families. In Patric you can use the local families, which are at the genus level or the cross genus families which are at the global level. As this is the one I want. So let's click search, and it's going to launch the tool. Now sometimes we have to wait a few minutes. But what we're going to see when this pops up is, here on the left will be a dynamic filter that will show us all the genomes in and also the metadata. Here on the right will be a table showing all the protein families. It'll open showing the pan genome for your selection. Meaning it'll show all the genes that they share and all the genes that they have uniquely or shared within different subgroups. When the table opens, you'll be able to see the protein family IB in the first column, it will tell you in the second column the number of proteins within that family and here we are. In the third column, the number of genomes that have those proteins, the name of the protein family, the minimum amino acid length, the maximum amino acid length the mean and the standard deviation. It's a pretty powerful tool and we have a webinar series on that that I would encourage you to watch. One of the things we can do pretty quickly is with this filter you can see present in all families in the first column, absent in all families and either mix, like I told you earlier, it's set at the pan-genome, so it's showing me everything. But the codon tree pipeline in patric looks for all the protein families that are shared, everybody in my selection has to have that protein family and moreover, the club is even more exclusive because they have to have only a single copy of the gene, if you have two copies, it won't use it unless we give it specific instructions. So to see how big the core genome, the one that they all share is here, I'd have to click here. Suddenly it's showing me that there are 85. So 85. Let's go back to this tab. How come it only found 52? Well, because, for example, here we had 76 genomes, 79 proteins. That protein family not used. It's out of here, with anything like that, it doesn't get used. Let me show you that you can look at the heatmap two, and be able to see that even more clearly when I click on the heatmap view and because we have ones that they all share, we don't see any of the cells with a black cell indicating zero. But you see any of these look like these mustard color or these tangerine colors, we can't use any of those genome. So that's a huge number of them. How can we use something like that? I want to use more of those. These are a gene that have more than one copy and there's another thing that we could do too, let me click back on the table. Actually I want to leave this here because this is a powerful demonstration of what I want to show you. I want to open up a new tab, because I'm all about all my tabs all the time and I'm going to do the same thing again and it'll be fast because I've already run it. But this will just cement it into your mind of using the tool and moving stuff over, watching me do it and I know that it's not at all boring to watch someone repeat the same actions again. We all enjoy that very, very much. Looking over, we launch the tool. Remember, if it's got more than one copy of a particular gene in a protein family, it gets eliminated. But what if it has less? We can also use this to say we want to see, so let's see if there are 76, let me see 75 number of proteins per family to 50. We don't want to go all crazy here. Right now we have 2612, let's filter that to 269, and another thing we can do is, I want to go from the most proteins to the least. So I can click on that and I think that got me to least. So click again and that'll get me to most, and why did I do that? Because it's going to look gorgeous in the heatmap, let's click the heatmap view and then you can see. These guys could all potentially, there aren't that many, but we could probably add. Here, let me open this up so you can see it better. I wouldn't want to add this guy that a lot are missing up, but these guys will just have single ones that are missing. That might not be bad to add. So we know that there are more protein families we can draw on if we wouldn't be so exclusive. I'm going to show you how to do that. I'm going to go back to my first tab where I have the job. I'm going to click on services, and then I'll click on phylogenetic tree. Here we go again. We're adding them again, and I know, well, I can't imagine that people would get tired of me. Each time I click on one of these things to say Buchnera, Aphids, Riesia, [inaudible]. You might get tired of me saying those things but there's no one who would get tired of me saying Wigglesworthia, tsetse flies and then my private genome, add that. I'm going to click on the down arrow there to open up the most recent one, save it in endosymbiont trees and I'm going to call it adjusted endosymbiont tree. Now this is a way if people listen to me talk about these things before that I name things, I'm going to ask for 100 genes. This time, I'm going to try, watch me while I click over here, to bring in some of these that have just one or two deletions. I'm going to say, 10 deletions and 10 duplications. The duplications would go to this group where you could see these guys that had duplications in there and will use one of those genes, okay, this is a good question. Which one are you going to use? Well actually the [inaudible] trees algorithm does a blast between the two and finds the one that's closest in homology and uses that. How cool is that? Pretty cool. I'm going to launch this thing. I'm hoping that this will get me to 100 genes. So I submit that and I get the message that the job has been submitted. But guess what? I already ran it earlier. So it's like Julia Child baking a cake. Here is the finished cake. I click on my jobs folder. I can see this one I just did is already running, but here's one I did earlier. I click on that then you can see I did several iterations. This is the one we opened up on 100 genes, zero duplications, zero deletions. I did 100 genes, zero duplications up to 10 deletions and then this one was 10 duplications, zero deletions, let's look at this. I highlight the row and I click view, and we're going to go once again to what my favorite document, the report.html. Highlight that and click the View button and look at this gorgeous tree. Look at all the [inaudible] and look, this is what's so cool about these endosymbionts. You see the names here, these are the names of insect hosts that it was found and there all clustering together, which is what you would expect, but for the most part, even the genus clusters together. I'm not even going to try to pronounce this one, but this is aphis, which is a type of aphid, scenario, these are aphids that are on trees like pine trees. Look at them now, they're all clustered, nice and neat together. This looks like my favorite tree, blockmania all nestle together, all snug in their clade. Wigglesworthia, they look so beautiful don't they? Here the endosymbionts of Weiss all nestle together and what's fun about this is some of these lice are from gorillas, some are from chimpanzees and some are from humans. cola [inaudible] is from humans, this genome is from a chimpanzee, and this one's from a gorilla. How fun is that? This is what I really wanted to show you. A hundred genes requested, 100 genes found. We have the number of amino acids that were used to build the tree, the number of alignments there, the number of nucleotides. This is a very powerful, beautiful tree. In the next video in this series, I'm going to show you how to take that new file and open it in big tree so that you can get a better picture because the bootstrap values here, overriding the names, that's not publication qualities, we want you to be able to publish this stuff and then in the last video of this series, we'll talk about how to cite it and put it into your paper, some of the words that you should use. Thanks so much and I will see you next time. Actually, this assignment probably should have been before the previous assignment when you had the measles rhizobium genome that was influencing the number of protein families that were used, probably even before you launch the tree this might be a good thing to do to see how many genes you expect to get. Hope we're going to do it now, that's going to be okay, I love the Protein Family Sorter. In fact, the tool is based on some early work that I did, so I'm very fond of it. Here's the assignment for you to do with Protein Family Sorter and this is just to get you comfortable with it when thinking about that with building trees, I want you to take that one genome, that bacteria contact genome that we've been using in the trees and all the groups into the Protein Family Sorter. If you wanted to pretend that you hadn't removed one of the measle rhizobium genomes, go ahead and use that group, but whatever. Have it said, PGFams, why? Well, because he had different genera here. Get a gallon and then the table returns and hear your questions. What's the size of the pan genome? How many protein families do all the genomes have at least one protein in? How many perfect protein families are there in a perfect protein family? Here's one protein, one genome. They would be those yellow boxes and the heat map viewer. What's your estimate of the number of genes that'll be used to build the tree that had the following, 100 genes with zero duplications, and zero deletions? This would be fine if you still have that measle rhizobium that complete group, if you hadn't remained it, but even if it did, you can play with it or the 500 genes with one duplication and zero deletions. The whole purpose of this is to let you know if you're genomes are even treatable together. You can even create a tree with them. It'll also let you know if you have a really bad problem genome in there and it'll help me because when your tree jobs fail, you send an error message to Patric and I get it, and I have to go look at it, so it'll benefit us both. It's better for you, it's better for me, it's better for all of us and it helps you move faster to that publication that you want. Good luck. Let me know if you have any questions. Bye.