As usual, we'll start by ensuring we have the necessary libraries installed and imported. Then we'll set up PyMongo. For the lesson, we'll be connecting to the class Atlas cluster linked to the data set in JSON format is included in the comment or you may navigate to the UCI Machine Learning repository to download the original data set. This data set consists of various measurements taken by radio telescopes. It contains eight continuous features and one class, whether it is a pulsar or not. Let's look at the data. As we can see, there are nine data points, eight continuous features and one class label. Let's use the describe method to get an idea of the data distribution. So, it looks like we have some good distributions of data for a lot of features and not so much for a few others. For example, looking at the skew_dmsnr feature here, we see a mean of 104.86 and a standard deviation of 106.51 with a mass of 1,191. That Max is about 11 standard deviations from the mean. So, let's look at a distribution plot of that to get a visual indication. Here, I've extracted that data from the skew_dmsnr columns into two variables. One, where it was a pulsar and one where isn't. Then, I'm going to overlay a plot with them so we can see their distributions. And we have a pretty neat visualization. Red is the distribution for pulsars and blue is the distribution for not pulsars. As we can see by this long tail, this distribution may cause us problems. Next, let's use the C1 PairGrid to visualize all the features. We'll specify the features to plot in the vars arguments and specify the hue in the hue argument where we specify the hue as the class. And here, we see our PairGrid where different features are plotted against each other by hue. We can see a clear separation on some of the features and a big overlap on some of the other features. We have a pretty good idea of how our features look like compared against each other but how do they correlate to whether something is a pulsar or not? For that, we use a correlation matrix. This function here taken from scikit-learns documentation will help us view a correlation matrix in an easier to understand manner. And here with my mouse over the pulsar column, I can see how the other features correlate against whether something is a pulsar. It looks like ex_kurt_ip has a strong positive correlation where as mean_ip has a strong negative correlation. So, now that we have a good understanding of how our data correlates and what it looks like, let's train the DecisionTreeClassifier. As usual, we'll start by splitting our data into a trending and testing set, calling the DecisionTreeClassifier method and assigning it to clf variable and then fitting that variable with our training data. Now, let's get predictions from our model and then we'll look at it in a confusion matrix to see how well that model bid. Again, this function here is taken almost verbatim from scikit-learns documentation, is only changed a little bit to improve the formatting. We'll also output a classification report that will help us judge how well our model did. And we can see we have average scores and precision of 0.97, recall of 0.97, and f1 score of 0.97. This seems pretty good. However, let's take a closer look at what that means. To do that, we'll look at the Confusion matrix. The columns of the Confusion matrix are the predicted labels and the rows are the actual labels. So, the top left is true negatives which means it wasn't a pulsar in this case. The bottom left is false negatives where it was a pulsar but was predicted to not be. The top right is false positives where it wasn't a pulsar but was predicted to be and the bottom right is true positives where it was a pulsar and was predicted to be a pulsar. Knowing what the Confusion matrix represents, let's take a look at our classification report again. Are we really getting 97 percent accuracy in precision, recall and f1-score? Technically, we are but it's kind of trash. Let's look at support which is the occurrence of each class within the testing data. We had 5,370 samples to test. Of those, 504 supplied were a pulsar, 4,866 were not a pulsar. So, if we divide 4,866 by the total, we get 90 percent accuracy. Let's discuss what precision, recall and f1-score are. Precision is a metric that gives us information about the performance of our model in terms of how well it predicts true positives compared to false positives. The better the precision, the more picky our model is. So, if we have a cutting edge treatment for some catastrophic disease but the side effects are severe, we'd want very high precision in diagnosing the disease. Recall deals with the negatives. It gives us performance information on our model in terms of how well it predicts a true positive compared to false negatives. The better the recall, the more general our model is. If that same catastrophic disease is spreading and we're in charge of quarantine, we might want a model with high recall. More people will be placed in quarantine and have to be screened. Maybe they had a sniffle or cough but better than missing someone. The f1-score is a combined score of both precision and recall and is a harmonic mean of the two. Let's reinforce this with a slightly different example. Imagine we are an evil power programming a model to catch a particular pair of hobbits travelling with an item that is our only true weakness. In this scenario, we'd want a model with higher recall than precision. We'd get less false negatives at the expense of more false positives, meaning would capture and imprison more hobbits and perhaps anything that looked like a hobbit. Not all may carry the one ring but we'll most likely catch the ones that do. Now, imagine we have a model that determines whether to buy a stock or not. A model with a higher precision than recall may miss some stocks that were worth buying but the ones that does buy, will be good, meaning it is less likely to gamble with our hard earned money. Back to our pulsar's model. We can see that our model is more general than it is picky and its role for predicting whether something is a pulsar or not, it is more likely to incorrectly label an object as not a pulsar when it was one than it is to dismiss the object. In all honesty though, our current model has both poor precision and recall for determining whether something is a pulsar. We should expect that our model could get 90 percent accuracy just by labeling everything as not a pulsar. Let's look at our support column. We can see that out of 5,370 test samples, 4866 were not a pulsar. If we divide 4,866 divided by 5370, we can see we'd expect a 90 to 91 percent accuracy rating. I wouldn't be happy putting this model to work. Is there anything else we can do to improve? Well, we can try and scale our data. Decision Trees are much more resilient to non-standardized data but scaling may help a little. Let's scale the data to see if we get any improvements. Here, I'll make a copy of all my features so that I can scale them without disturbing the original. I'll use the StandardScaler from scikit-learn and then fit and transform my features, then train a new DecisionTreeClassifier with those scaled features. Now, let's look at the results. It looks like the results are negligible and depending on your run and how the algorithm worked for you, they could in fact be worse. Let's try something else. Let's try a Random Forest Classifier. A Random Forest Classifier is an ensemble method, meaning it combines output from many models to produce a better model. It uses many Decision Trees and averages their best parameters. Using Z as assigning this to a variable and fitting it to the data. I've chosen a value of 100 for an estimators here based on testing and playing around with the data previous to the lesson. This is the number of Decision Trees the forest will consist of. Now then I've trained it and used it to predict. Let's see the results. These are much better results. Our precision has increased by over 10 percent and our recall went up as well. At this point, we might want to go back to the source data itself to clean it more and see if we can eliminate features or to create new features from existing ones, otherwise known as engineered features. One thing of note with Decision Trees and Random Forests, is that we can ask the computer what relevancy it assigned our features. Pretty neat. And these maps very closely, according to the absolute value of their correlation coefficients we looked at earlier in the correlation matrix. We've covered a lot of information in this lesson. So let's discuss what you've learned. You learned how Decision Trees work at a high level, how to use a simple Decision Tree model from scikit-learn, you've learned what precision, recall and f1-score are, and why you may prefer higher results in one category over another, and how to use a Random Forest Classifier for better results.