Okay. So now that we know what is a theoretical understanding of text classification, let's see how to build one in Python. There are quite a few toolkits available for supervised text classification. Scikit-learn is one of them. For those of you who have gone through course three of the specialization, you have seen scikit-learn before. The other toolkit is something that we have seen in this course. That's NLTK. And in fact, NLTK interfaces with scikit- learn and also has interfaces for other machine learning toolkits like Weka. We will not be covering Weka in this course, but I would encourage you to check it out as well. Let's first start with scikit-learn. Scikit-learn is an open-source machine learning library. It started as a Google summer of code in 2007, but it has very strong programmatic interface, especially, when you compare it against Weka which is a more graphical user interface or primarily driven as the user interface model. Scikit-learn is used extensively as a machine learning library in Python. Scikit-learn has predefined classifiers. The algorithms are already there for you to use. So you could use the Naive Bayes Classifier if you want to learn that. So let's go through some steps about what functions you'd use, what calls you'd use, when you're using the Naive Bayes classifier. First, you need to import Naive Bayes from sklearn. Then, you're going to call this naive_bayes.MultinomialNB()=clfr and that would be your Bayes classifier. Now we have seen earlier that there are two big ways in which Naive Bayes models can be trained. One is a multinomial model, other one is a Bernoulli model. And you also have a Bernoulli model here, you have naive_bayes.BernoulliNB() if you want to use that model. Once you have defined a Bayes classifier, you can train that classifier on the training data. You would use classifier.fit and then give the data and label as a two parameters. If you are completely comfortable with this, you could actually merge them two together. You could say naive_bayes.MultinomialNB.fit(train_data, train_labels). Once you have trained the model, you would predict the label for a new dataset using predict function. So you'll have classify that predict and pass the test data that has been get of, you have already extracted the features and so on. And then, when you call it with predict, you're going to be at the labels. And then once you have the labels, you can see how well you have done in your classification. Especially, if you have label test data, you would use metrics.f1_score, that is one of the measures that be use and give the test labels. That is the goal set, the actual labels, the predicted labels and define what kind of averaging you want to do: micro averaging and macro averaging. You have covered some of these concepts already in course three, so I'll not repeat them here. But I would point to some of the reading material around: what kind of measures you could use instead of f1_score, what does f1 mean and what kind of averaging you could do, like micro averaging and macro averaging and so on. Sklearn, that's Scikit-learn, also has SVM classifier. So how do you train as support vector machine or SVM? Well, the goals are very similar. In this case, you are going to import SVM from sklearn and call svm. SVC as the classifier. SVC stands for support vector classifier. As we see, you need to pass some parameters. Typically, for text classification models, you're going to focus on linear classifiers. So it's a linear kernel to consist kernel as linear. And then you can specify the C parameter. We have talked about that earlier in one of the earlier videos and that is the parameter for soft margin. The default values for kernel is RBF, a radial basis function, kernel and the default value for C is one, where you are neither too hard not too soft on the margin. Once you have defined this bayes classifier, you can fit or you can train it the same way as you train the Naive Bayes one. So you're going to see classifier.fit and pass the data and labels as two separate parameters. And then you'll be able to predict the same way as last time you have classified or predict on test data. Now, we need to talk briefly about model selection. You'd recall that there were multiple phases in supervised learning task. And we talked about it earlier. There's a training phase and an inference phase. You have labeled it as that is already labeled, so in this case, it's green and red. And you split that label data into the training data set and the hold out validation data set. And then you have the test data that could also be labeled that would be used to say how well you have performed on unseen data. But typically, the test data set is not labeled. So you're going to train something on a label set and then you need to apply it on an unlabeled tests. So, you need to use some of the labeled data to see how well these models are. Especially if you're comparing between models, or if you are tuning a model. So if you have some parameters, for example you have the C parameter in SVM, you need to know what is a good value of C. So, how would you do it? That problem is called the model selection problem. And while you're training, you need to somehow make sure that you have ways to do that. There are two ways you could do model selection. One is keeping some part of printing the label data set separate as the hold out data. And the other option is cross-validation. So for the first one, if you're doing that in scikit-learn, you're going to save from scikit-learn input model selection. So that will give you the options available to you. And then, first we'll see how you could use that train test split. So I'm going to say model_selection.train_test_split. Give that train that untrained labels and then specify how much should be your test size. So for example, suppose you have these data points. In this case, I have 15 of them and I say I want to do a two third one third split. So my test size is one third or 0.333. That would mean 10 of them would be the train set and five of them would be the test. Now, you could shuffle the training data, the label data, so that you have a randomly uniform distribution around the positive and negative class. But then, you could say I wanted to keep let's say 66 percent in train and 33 percent in test or go 80 20 if you want to. Let's say four out of five should go in my train set and the one out of five, the fifth part as a test. When you do it this way, you are losing out a significant portion of your training data into test. Remember, that you cannot see the test data when you're training the model. So test data is used, exclusively, to tune the parameters. So your training data effectively reduces to 66 percent in this case. The other way to do model selection would be cross-validation. So the cross validation with five full cross-validation, would be something like this where you split the data into five parts. These are five folds. And then, you train five times basically. You train every time where four parts are in the train set and one part is in the test set. So you're going to train five models. Let's see. First, you're going to train on parts one to four and test on five. The next time you're going to say I'm going to train on two to five and test on one and so on. So you have one iteration where five is the test and the regression where one is the test, a third iteration where two is the test and so on. So when you're doing it in this way, you get five ways of splitting the data. Every data point isn't the test ones in this five folds. And then, you get average out the five results you get on the whole test set to see how we'll perform, how the model performs on unseen data. The cross-validation folds is a parameter. In this explanation, I took it as five. It's fairly common to use 10-fold cross-validation especially when you have a large data set. You can keep 90 percent for training and 10 percent as the cross validation hold out data set but because you're doing it 10 times, you're also averaging on multiple runs. And in fact, it's fairly common to run cross-validation multiple times. So that you have reduced variance in your results. Both these models are trained to split and cross-validation are fairly commonly used and critical when you're doing any model selection. Okay. Now let's move to NLTK. How do you do supervised text classification in the natural language toolkit that we have seen in fair detail in this course? NLTK has some text classification algorithms. So for example, it has a naive bayes classifier. It also has decision trees and condition exponential models and maximum entropy models and so on. But the real interesting thing is it has something called Weka classifier or Sklearn classifier that gives uses of NLTK a way to call the underlying scikit-learn classifier or underlying Weka classifier through their code in Phyton. Specifically, if you are using the naive bayes classifier that is available in NLTK, we are going to say from nltk.classify import NaiveBayesClassifier. You're going to say the classifier is now naivebayesclassifier.train. So you're directly going to train on the train set to know that there are no two functions really as common as I could learn where you have a based model and then you have a training function. Here, you are going to say that you have naivebayesclassifier.train and you train this model and you're going to classify it using the classify function. So it's classifier.classify (unlabeled instance). If it's one instance you're going to use the classify function, if there are many, I would going to say classify many, and give a set of unlabeled instances. You also get the accuracy of the performance of the sklearn classifier using nltk.classify util function and then call the accuracy function there where you're giving in the classifier and the test set. And that will give you how well, what is the accuracy of this classifier that you have trained. You can also use other utility functions like labels, classifier.labels tells you all the labels that are there that this classifier has trained on. And you can use some features like this where you have show_most_informative_features that gives you the top few features and again said how many features you want. Say top five or top 10 features that are most important or informative for the classification task. It's especially useful in naive bayes classifiers, when you want to know which features have the most information in them. Which ones are most informative for the following classifier. For support vector machines, there is no native NLTK function. But as I said, you can use the scikit-learn SVM function through NLTK. So here, you're going to say nltk.classify import scikit-learn classifier so SklearnClassifier. And then, you're going to actually use both naive bayes models from scikit-learn using a sklearn.naive_bayes import MultinomialNB or BernoulliNB. And you can use the SVM model that you have seen earlier. So from sklearn.svm import svc. You'll call the function very similar way to how you do it in scikit-learn. And so you have a sklearn.classifier, give the classifier there, the name of the classifier, and.train and then give the train set. Now for MultinomialNB, there was no parameters that you need to pass. That's okay. But for support vector machine, there is one, right? So you need to specify the kernel for example. So you can specify that inside this sklearn classifier function. You're saying that I'm going to call as we see and I'm going to pass parameters where it's linear kernel and the C parameter, for example, can also be specified here. And then you are going to see.train(train_set). The rest are very similar to how you would do in sklearn. You're going to use the predict function and so on. So, the take home messages here are that: scikit-learn is most commonly use machine learning toolkit in Python, but NLTK has its own implementation of naive Bayes and it has this way to interface with scikit-learn and other machine learning toolkits like Weka, by which you can call those functions, those implementations through NLTK.