In this hands on lecture, I will talk about how to assign item values to tokenize terms In order to do that let's create at Java file called tfa.main.java which we are going to place it, we create it in test.main package, And edu.answer.test.main package we are going to have tfi.main.java, In the main function of tfi.main.java, we need to add a sort exception or you need to do try catch block. So, I'll just simply throw the exception as part of function signature or main function. This is because we handle some text file input and output. So some exceptions are need to be thrown in order for that. As far as data is concerned, we are going to use news article, New York Times news articles. And this data is found in the past of yTextMiner and data folder corpus and then text file. Same as before we're going to use scanner object and when you a scanner, at the end you need to close out a scanner because you handle file open. Since for Tfidf to work its calculation properly, it would need multiple, many documents. We need to accept multiple news articles from our text data. And to handle multiple documents, we're going to use collects and objects. The collection objects which is in “util”, if you double-click on it. It's basically extension of array list, of object. So basically it's an array. So the constructor of the collection object use a list of string for this document. As you see here, you create document and then add the created document inside the document to collections object by using add function. For just demonstration purpose, let's limit our number of documents to ten or some small number, so that we can immediately confirm the result. The next block of codes which is from line number 114 to 137, this is where you pass the text file and then grab each sentence and create document collection. Okay, you simply parse text into add then insert them into documents. Which document is what, document is array of string. After that, what you do is you simply instantiate collection object by passing one argument, it's called array list or a list of string. And after that, you pre-process, Doing pre-process same as before, Stanford CoreNLP, you just kick off the process of Stanford CoreNLP, okay? Once you pre-process text data would stand equal NOP, then what you do is you simply call collection's get tf-idf, that method. Let's go there by clicking Open Declaration. What the get tf-idf function does is inside this simple function, instantiate tfidf and then simply instantiate and return them. So what tf-idf? tf-idf is, let's go there. If you look at tf-idf constructor does majority of jobs. Preprocessing and calculating. Remember, the formula of tf-idf, which is tf multiplied by idf. It's basically calculating inside the constructor of tf-idf object. That's how this logic works in my text miner. Once you calculate tf-idf, then tf-idf bject contains that information about tf-idf value for each term for a given document. What you do then is you create a two dimensional array or double. And this two dimensional ray of double means it's basically document to metrics. So, column is a vocabulary term and row is document. After that, what you have to do is simply print out what kind of columns we have. So, basically this mean inside the folder, this means the only print out first 100 columns. First 100 columns is first 100 vocabulary terms, unique terms. Even this 100 terms, 100 columns or 100 dimensions or 100 features, whatever is best fit for you is interchangeably used. You can see what kind of terms contained in the vocabulary. And then as you see down below here, we have a two follow ups, which means you go through this two dimensional array, which is metrics, and then print out per each column and per each document, I is first array is document position. The second array is term position, which is a column. The first one is the row position, the second one which is J, is column position, top position, and you simply print out this document to metrics. And all it does is you preprocress data using simple query p and then using Tf-idf transformation, and you simply print out the metrics of document collection. Let's simply execute this, I right-click on tfi.main.java and select Run it, and select Java application. As you see, we have ten documents. And then, this red log information comes from Stanford CoreNLP, which means doing pre-processing based upon Stanford CoreNLP And then you pre-process document zero, one up to nine which means you have ten documents to process. After this phase is done, as you see, We have ten rows, which is ten documents, and we have one, two, three, four, five, we have ten just columns prints. This is only I want to see what metrics looks like, but you can go that this metrics column can goes as big as the size of vocabulary, it can as big as the size of vocabulary. So, basically based on this information zeroes mean. That means Washington in document one, Washington doesn't have any Tf-idf value. So…and month has some small Tf-idf value in document one and so on and so forth. By using this information, we can simply apply this to document classification or document clustering. But for our next exercise, we are going to apply this to document classification.