In this lesson we're going to discuss decision trees. Decision trees are a type of supervised learning algorithm most commonly associated with classification. But they can be used for regression as well. They are capable of accepting both continuous and categorical data. Internally, they form a tree-like structure with different nodes corresponding to different features. Here is a fictional sample of a dataset containing ten samples from an officer's speeding stops. She always patrols the same stretch of highway where the speed is 90 kilometers per hour, or 55 miles per hour. Looking at this, we might start to derive rules by analyzing the impact each feature has on the outcome. For example, does the good weather feature impact whether someone got a ticket? How about the speed? Is there a threshold? Internally, decision trees are if-else structures. Here we see a decision tree for the previous data set. Root and branch nodes represent decisions based on some value in the data and leaf nodes represent an output. The first rule the machine found is regarding speed. If it's less than or equal to 62.5, there are other factors involved on whether a ticket was issued, if over, a ticket was issued. The next thing it checks was fought_with_spouse. If less than or equal to 0.5, then a ticket wasn't issued, otherwise, it was issued. Of course, fought_with_spouse was categorical, either zero or one, but the rule works. The gini value that you see is the score of the internal cost function of that node, 0.5 is the worst for a two class classification tree, meaning there was an even split with two class outputs. And zero is the best meaning that it only output one class, so it was a leaf node. The tree internally evaluates features and potential splits with this function over and over until it reaches a state in which each input will flow to a terminal leaf node. Keep in mind however, that often times decision trees aren't 100% accurate. This is a engineered example so we won't worry too much about this. And fully understanding the mathematics isn't necessary to use the power of decision trees. Now that you have an idea of what decision trees are and how they work let's dive in. As usual, we'll start by ensuring we have the necessary libraries installed and imported. Then we'll set up PyMongo. For this lesson, we will be connecting to the class atlas cluster. A link to the dataset in JSON format is included in the comment. You can also download the dataset from the UCI machine learning repository. The dataset consists of various measurements taken by radio telescopes. It contains eight continuous features and one class, whether it is a pulsar or not. After the setup, we'll issue a query to the database projecting away the _id field. We'll create a pandas data frame from that, and then look at the head of that data frame. Here we can see the first five results. We see the eight continuous features that I spoke about, as well as the one class feature. Here, we see the eight continuous features I spoke about, and the one class feature, 0, meaning it wasn't a pulsar, and 1 meaning it was a pulsar. Next let's use pandas describe method to get an idea of the distribution of our data. We've used the described method and here's our distribution. It looks like we have some good distributions of data for a lot of features, and not so much for a few others. For example, looking at the skew_dmsnr feature, we can see we have a mean of 104.86, with a standard deviation of 106.5. The min is negative 1.9 and the max is 1,191. That max is about 11 standard deviations from the mean. So, let's look at a distribution plot to get a visual indication. Here, I've extracted the data in the skew_dmsnr column, where it is not a pulsar and assigned it to a variable q and assigned skew_dmsnr column where it was a pulsar to a variable named r. I'll then plot each according to the distribution and overlay them on top of each other. And here we can see the results. We can see that the distribution does not line up. And we have these long tails. This may cause us problems in the future. Let's look at another visualization using a seaborn PairGrid. This is a pretty interesting visualization. And we can see that some features more evenly separate the class labels than others. For example, here it looks like we have good, clean splits at the ends but we can see overlap right here, and the same for this and this. Now, let's split our data, removing class labels from features, i'll assign the features to x_origin and the class to y. And then let's look at a correlation matrix. This magnify function I found on stack overflow is a great addition to viewing the correlation matrix. Remembering back to our correlation lesson remember that positive numbers indicate a positive correlation, with numbers approaching one, indicating a stronger and stronger correlation. Conversely, negative numbers indicate a negative correlation. And as they approach negative one, it means a stronger and stronger negative correlation. So, we would expect that we have a perfect correlation between pulsar and pulsar. We can also see some other pretty strong correlations. For example, ex_kurt_ip correlates to a pulsar with 0.79. And we can see that mean_dmsnr is .4, mean_ip strongly negatively correlates, so on and so forth. Now that we have an idea of how our variables correlate together, would we expect that our variables that both strongly and negatively correlate to being a pulsar, have a higher weight within this decision tree. Let's go ahead and find out. First, we use scikit-learn's decision tree classifier. Because it's supervised, we'll pass in the class labels. And here we use the train_test_split method to generate a training set and a test set. Here, we use our training model to generate predictions. We'll also generate a confusion matrix. The function plot_confusion_matrix is taken from scikit-learn's documentation on confusion matrices. We'll also print out a classification report for better metrics. This seems pretty impressive for a single decision tree with not very many training instances. Ideally, we'd prefer hundreds or thousands or even millions of samples. The columns of this confusion matrix are the predict labels and the rows are the actual labels. The top left is true negative, which means it wasn't a pulsar. The bottom left is false negatives or where it was a pulsar but was pretitted to not be. The top right is false positives where it wasn't a pulsar, was predicted to be a pulsar. And the bottom right is true positives, where our model correctly predicted that it was a pulsar. Let's look at the classification repor and discuss what precision, recall, f1 and support are. First, precision. Precision is a metric that gives us information about the performance of our model, in terms of how well it predicts true positives compared to false positives. The better the precision, the more picky our model is. If we have a cutting edge treatment for some catastrophic disease but the side effects are severe, we'd want very high precision in diagnosing the disease. Conversely, recall deals with negatives. It gives us performance information on our model in terms of how well it predicts true positives compared to false negatives. The better the recall, the more general our model is. If that same catastrophic disease is spreading and we're in charge of quarantine, we might want a model with higher recall. More people will be placed in quarantine and have to be screened, maybe they had a sniffle or a cough, but better than missing someone. f1 score is a score of both precision and recall and is a harmonic mean of the two. Finally, support. Support is the occurrence of the class label in this particular run. So we can see that 4866 were not a pulsar and 504 were a pulsar. So let's go over recall and precision with another example. Imagine we are an evil power programming a model to catch a very particular pair of hobbits traveling with an item that is our only true weakness. In this scenario we'd want a model with higher recall than precision. We'd get less false negatives at the expense of more false positives, meaning we'd capture and imprison more hobbits and perhaps anything that looked like a hobbit. Not all may carry the one ring, but we'll most likely catch the ones that do. Conversely, let's image we have a model that determines whether or not to buy a stock. A model with a higher precision than recall may miss some stocks that were worth buying, but the ones it does buy will be good, meaning it is less likely to gamble with our hard earned money. Okay. Lets go back to our notebook and look at pulsar some more. Looking at our first model we can see that in its role for predicting whether something is a pulsar or not it is more likely to incorrectly label an object as a pulsar than it is to dismiss an object. And, in all honesty, our current model has both poor precision and recall. Low scores in both recall and precision mean we are missing a lot of pulsars as well as making a lot of false predictions. I wouldn't be happy or comfortable putting this model as work. Is there anything else we can do to improve? Well, for starters, we could try to scale our data. Decision trees are much more resilient to non-standardized data but scaling it may help a little. So let's scale the data and see if we get any improvements. And the results are negligible. Let's try something else, enter the random forest classifier. The random forest classifier is an ensemble method, meaning it combines output from many models to produce a better model. It uses many decision trees and averages their best parameters. Using it is as easy as assigning it to a variable and fitting the data. I've chosen a value of 100 for n_estimators here based on playing with the data previously to produce the best results. This is the number of decision trees the forest will consist of and we can see much better results. Our precision has increased has increased substantially and our recall went up as well. At this point we might want to go back to the source data itself to clean it more or maybe to see if we can eliminate features or create new features from existing ones, otherwise known as engineered features. One thing of note with decision trees and random forests, we can ask the computer what relevancy it assigned to our features. Pretty neat. And these met very closely according to their absolute value in the correlation coefficients that we calculated earlier in the lesson. Now one last thing. The describe method is very valuable in getting an idea of our data. But sometimes we may just want that information before we fetch everything from the database. A recipe is included in the bottom of this notebook to calculate that information with an aggregation. First, we start by removing the _id field. Then, we find all possible keys in the data. Once we've found all those keys, we'll merge and flatten them into a unique set. After this, we need to do a few things. In order to calculate our percentiles, we need to use the bucket auto stage followed by a group stage to get the relevant information. We do that in this portion here, calling the segments and creating an entry for each key. We also perform grouping operations by key to get the total count, standard deviation, and mean. Ultimately, all of this is done within a facet stage, here. We also use a protect stage after the facet to build our results in an easy to parse format. Here's an example of the full aggregation pipeline the above produces. We can see we have a facet, our bucketAuto stage, our group stage, all of our keys. And if we scroll down to the bottom, we'd see the project cleaning everything before it's sent back to us. And here is the aggregation describe, and the pandas describe, nearly identical. All right, this is a rather long lesson, so let's wrap it up. In this lesson, we've covered a lot of things, so let's wrap it up and discuss what you've learned. We've talked about how decision trees work at a high level, how to use a simple decision tree model from scikit-learn, what precision, recall and f1-score are and why you may prefer higher results in one category over another. And how to use a random forest classifier for better results. Lastly, we included a bonus aggregation recipe that shows you how to describe your data without having to pull everything down to the client side.