In this hands-on lecture, we'll discuss about document classification. And use one technique called logistic regression to try and classify documents. On top of logistic regression, as I explained in lecture note, we have LIBSVM classifications available in Y text minor. But the LIBSVM is a huge amount of data collection to meaningfully predict target class. We only going to supply the LIBSVM classification module for your exercise. But for the purpose of live session, I'm going to stick with link pipe version of logistic regressions. If you look under test.main, the main package, you'll see LibSVMClassificationMain.java, LibSVMMain.java. Difference between two is LibSVMClassificationMain.java employs the third part library called Java ML and LIBSVM.java simply use LIBSVM.char. So those, except for that difference those two classification modules works very similarly. So you can pick one over the other, as I said before, I'm going to stick with LinkPipe Logistic Regression Java. We'll use the NY, New York Times dataset for this exercise and the NY beneath the times data studies found in the past. The second thing is nytimes_news_json.txt in order to grab a label for each news article. What we do is, we use the section label, section heading of each news article. Because of that we need to parse this new times JSON.txt so that each news article will be assigned each section label. We'll use the collection class, if you look at the collection class, the constructor used are list of string objects for this document and a list of string objects for the document lists. So let's look at collection, so for the simple data classification or TFI def calculation we use the first constructor. But for logistic regression based .compress based classification we're going to use a second constructor. It takes number of documents, and then number of classes. The classes means categories, so let's go back to link pile logistic regression main.Java. And you have two array lists because one for one for cluster labels of documents. Then next code is for parsing news articles and then. Also pricing JSON file to grab correct category of the particular news article. And then you close them, two scanners, and then from lines 67 and downward. Describes first do you transform or you calculate the value for each feature or each term and then you simply create case of model. Which is based on logistic regression, which is built in lingpipe library and after you successfully build the train or train classifier, then you test them, okay? By doing testing, you will see how well a trained classifier performs. If you recall, what you learned in lecture note, we can apply 10-4, or 5-4 cross validation. But the purpose of this exercise is creating logistic regression classifier. And then how well it forms, so we simply use the measure of accuracy. Okay, let's take a closer look at the code. Since news article's choose for this demo, we only use hundred, okay? If we reach 100 then it exit out of one loop, so we're going to process 100 news articles. And here, we're going to grab sections in which is a crest label in formation or particular news article. We simply use readily expression, what you can do instead of relying on readily expression you can use JSON. Third pile library called JSON which is simply call JSON object and then you parse them. Instead of relying on simple regular expression. Anyway in this case, I'm using regular expression to crypt label information, after that. Yeah, I already explained getting TFI, that value for each feature. And then here, what you do is, you divide data sets, into train and test. So you have test data, train data and test data at 90%, which means you're using 90% of data for training, and the other 10% for testing. And after that, you build logistic regression model, classification model and then after that, you simply call the model. So you basically train the classifier and then build the model and serialize it to local file system. Then call this and create link pile logistic with some and then basically simply predict price label of particular news article. So let me execute this. You process hundred document and left column is label section heading which is class label for each document. And then you pre-process hundred documents of course if you have a million records it takes a while. I believe that tf-idf is highly correlated with your computer's memory size. So dealing with high dimension feature space you're probably going to run into memory problems. So here executing features which means extracting set of terms and their values and then, you basically train the classifier which is a logistic regression. It takes some time depending on the size of training data. If it's more than several hundred thousand documents, and takes at least a couple of hours depending on the memory size of your computer. So, let's go back, yeah. So what it happens here is it builds a model, let's go back. So it builds a model and then you simply loop the trained model into computer memory and then at the end it basically prints per each document. The correct insert and predictive value of class for the particular document. If predict and actual ones are the same, that means it correctly predict If actual and predicated different than that means it gives or wrong prediction. So based upon this information you simply can calculate accuracy or precision, recall measures.