In the previous video, we discussed how regression models or columns can be evaluated for their performance. In this video, we'll be taking a look at how a classification model can be evaluated. By the end, you'll understand the most important aspects of evaluating classification models and when different functions are most appropriate for your problem. Remember that classification is about finding a model that labels examples with the correct class or category. Like with regression, classification models can be assessed by comparing the predicted classes to the actual labels that came with the dataset. The most common and natural way to assess a classification model is to calculate the accuracy. You're used to this from school and even Coursera. Your quiz grade is calculated by dividing the number you got correct by the total number of questions. In other words, accuracy measures the percentage of correct predictions. That's the ratio of the number of predictions that matched the labels to the total number of data points in the set. But we've already mentioned, accuracy doesn't always tell the complete story. Take the example of medical diagnosis. Many times, the cost associated with mislabeling a sick patient as healthy, is worse than mislabeling a healthy patient as sick, especially early on. It might prevent further diagnostic tests from being run or delayed the start of treatment. Mislabeling a sick patient as healthy is type two error and accuracy can't directly measure this. So let's introduce some more detailed measures, some of which will be familiar from previous lectures. Let's start with a particular binary classification problem. Labeling transactions as fraudulent or not. This means every new transaction is fed into our QuAM and labeled with a one for fraudulent or zero for not. There are four different outcomes in this case. We could correctly identify fraudulent transactions. Those are true positives. We could correctly identify non fraudulent transactions and those are true negatives. We could incorrectly label non fraudulent transactions as fraud. Those are false positives. Positive because we called them fraudulent, false because we were wrong. These are type one errors. Similarly, false negatives are transactions we labeled as zero when they are in fact fraudulent type two errors. This set of four errors is referred to as the confusion matrix because it provides details about how the model is confused not because type one and type two errors are so easy to confuse. If you want, you can review the video from course one on different kinds of wrong to refresh your memory further. So now that we have definitions for these different classification possibilities, we can see that accuracy is actually our true estimates, true positives plus true negatives, divided by the total number. We can use these distinctions to calculate some other important evaluation measures, in particular, precision, recall, and, F measure. Precision measures how much you can trust the accuracy of your positive examples. It's the number of true positives divided by the total real labeled positive. In other words, of all the things we called positive, what percentage were correct? Precision of one means we never call a transaction fraudulent inappropriately. When our QuAM called a transaction fraudulent, you can be confident it's correct. Recall measures how much you can trust the coverage of our positive labels. It's the number of true positives divided by the actual positive instances. The actual number of positive instances is the number of true positives plus or false negatives. In other words, what percentage of all the positive instances did we catch? Recall of one means we didn't miss anything. If you know what's most important not to call something category X when it isn't, then you should be evaluating your models using precision as well as accuracy. If you know what's most important not to miss something belonging to category X, then recall is the most important. But what if you want to balance between those two? Well, the F-measure gives us the ability to do that. When the F Score is one, it means we have perfect precision and recall. That also means we have perfect accuracy. So we do already have a measure for that. The usefulness of S QuAM comes into play when we don't have perfect accuracy. In other words, in pretty much any real application. It reports the average, specifically the harmonic mean, but you don't need to worry about that, of precision and recall. The default way to use F score is specifically F1 score and that puts equal weight on precision and recall. The weighted F beta score let's use smooth between precision and recall with beta less than one putting more weight on precision and beta greater than one weighting recall more heavily. Scikit-learn lets you use either of these, with the default beta of one in the case of F beta. If you look at the metrics package in scikit-learn, you'll see a handful of other error measures appropriate to different kinds of classification or putting different emphasis on different kinds of errors. Although these measures were defined for binary classification, all of them can be extended to evaluate multi-class classification problems as well. There's another case where plain accuracy may not be good enough when you have imbalanced classes. This means you have many more examples of one class, say healthy individuals, than another, say those with some rare disease. Then misclassifying those rare instances counts just as much as misclassifying the other. But if there's only a few examples of the rare case, overall the significance of those mistakes is way less than the significance of the other. In the worst case, the best hypothesis will be the one that completely ignores the rare cases and just predicts the majority class. This may be fine in some circumstances, but not if the QuAM building is meant to detect those rare cases. When you want to know how well classifiers performing specific to a class label, you can use class wise accuracy. In the scikit-learn metrics package, class wise accuracy can be found with the classification report function, which not surprisingly gives details broken down by class. Evaluation measures are specific to the domain. So you should think about how your model's going to be used to determine what the most useful measure is. The evaluation measure you use on the test data is what you use to determine how well your model is going to actually perform in the real world. Now that you've seen a handful of classification error measures and understand what they actually report, you're ready to make an informed decision about evaluating your QuAMs. In the next video, we'll discuss one more sophisticated tool for evaluating generalization capabilities. See you there.