In this hands-on lecture, I will discuss about tokenization and lemmatization and look over them through the code base yTextMiner. Suppose that you already open Eclipse, and through Eclipse, you open yTextMiner. Let's go to src. By clicking on the left-hand side radio button next to src, you can expand and shrink the tree. For this exercise, we need to look at the package called edu.yonsei.util. Under the package, there is a crest called Token.java. By double-clicking on Token.java and you will see the token place. So we are going to use this Token.java for today's exercise. As you see, Token.java has several global variables. Five string variables called token, dilemma, pos, ner, stem. And one Boolean variable called E stop. There is one constructor called token, and it accepts six arguments, token, blairmont, qs, ner, stem and E stop. By using this token class, we are going to do tokenization and lemmatization. As you see, token is token itself. Lemna is the lemmatized version of token. Pos is part of speech tag to the token. Ner is named entity form of the given token. Stem is stem form of the given token. And stub is Boolean variable whether the token is stub or not. Aside from there are several get and set method implemented in this Token.java, there is one particular method called initializeTheTokenData So initializeTheTokenData is called preprocess. So preprocess calls Core NLP preprocess class, which I implemented in this yTextMiner package. The Core NLP preprocessed Java is implemented in the package called edu.yonsei.preprocess. We'll talk about this later on. So for now what you need to do is, go to edu.yonsei.test.main and then right click on the main Package > New > Class. Here the name of this new Java class you'll type in Normalization Main. My case I already implement this NormalizationMain.java class. So I'm going to cancel, but you need to do normalization main and then click on Finish button. So assuming that you have NormalizationMain.java is already created, under the main package. Then you're have only skeleton globalization main class in your case. So what we're going to do is we're going to implement one simple main function. So main function, as most of you may know, has the function signature like public. It's a public class. It's static. Then written type is void, and then the function name called all small letters main, M-A-I-N. And it takes one argument called string array, and then you name whatever you want. But the most conventional way is A-R-G-S. So you create one function called main. And next thing you need to do is find the input data and then open it up and read the contents of the file. For this exercise, I'm going to use Java Class close scanner, it's included in JDK. The scanner, I create instantiate scanner, and I give the variable name S. I instantiate the scanner with new result were and takes one argument called Nu phi reader. So the Nu phi reader object requires one argument called filename. So filename if you look under data, and there is corpus subfolder. On the corpus subfolder, you have twitter_stream.txt. This data was collected by stream API of Twitter. So now once you finish this line, what we need to do is let's say we only care about first ten lines. Okay, and what we need to do is, we're going to go through each line of the scanner object, it's called next line. So the nextLine, when you call nextLine, it gives you the corresponding line of the particular file. So once you take the input data, input line, and you need to pass it to sentence object. So sentence object, let's look at sentence object by clicking on Sentence and right-click on Sentence object and find Open Declaration. So you select Open Declaration, and it takes you to Sentence.java, which was implemented in edu.yonsei.util. So this sentence subject, basically what it does is it contains each sentence or each line, and then you will do further analysis from this point on. So let's go back to normal NormalizationMain.java. So once you instantiate the sentence object by supplying each sentence, and then what you do is you call preprocess. Okay, for preprocess, let's go to preprocessing in the same manner by click on preprocess and select Open Declaration. It takes you to preprocess. So Sentence, at the very beginning of this lab session, I showed you Token.java. Same as Token.java, Sentence.java has function called preprocess. What it does is it instantiates CoreNLPPreprocess, and then it, for the core's preprocess functional CoreNLPPreprocess Java class. Let's go to CoreNLPPreprocess function by clicking on Open Declaration. So this CoreNLPPreprocessor.java instantiate, stands for CoreNLP sentence. By passing each line that you just collected from input file, and then once you create this Stanford CoreNLP Sentence object, then next line is you call preprocess of CoreNLPPreprocess.java, which is right below preprocess call. So this is function overloading, or as you know. So, you have the same function name, but has different sets of arguments. In this case, Sentence is what we implemented. The second argument is Stanford Core NLP Sentence. So under this preprocess function, what it does is it tokenize and creates our yTextMiner’s Token, on the full loop. This for loop is basically, you tokenize Sentence and then create token object. All right, let's go back to NormalizationMain.java. So you create Sentence object and you call preprocess. And then all you need to do after that is, you simply print out what's in the Sentence object and what's in Token object. Okay, so here you print out each iteration, and then you print out each sentence, and after you apply tokenization, what you have is Sentence has Token, a set of tokens. Okay? Set of tokens, you have a set of tokens, and you use for loop and iterate through sentence, and then print out token object's getToken. For this tokenization, you call getToken. For another task, you'll call another function provided in token object. So I will do other examples in next lab session. So once you finish this line of code and you close for loop, and then you close scanner, and you close main function. Now, what you have to do next is, you save this object, function object. Either you can do by selecting option File and then selecting Save, which is equivalent to shortcut Ctrl+S, okay. So click on this, and then you save what you developed so far, okay? So now what you need to do is, under edu.yonsei.test.main, you have NormalizationMain.java. So right click on NormalizationMain.java, and then select Run As, and then select Java Application. It prints out some logging information, okay. And it loads several pipeline data, and then it prints out each sentence, okay. And then under each sentence, you call tokenization. And for each sentence, you call sentence, and then some case you call tokenization, and then it prints out each token. Okay, one thing I'd like to share with you is, since Stanford CoreNLP has huge size of model, okay, huge size of model JAR file. So you probably run into this problem. The problem is called out of memory error. What you need to overcome is by doing this. So, right click on NormalizationMain.java, and you will select Run As, and this time do not select Java Application. But instead you select bond configuration. And on this pop up window, you will select arguments. And then the second half you see VM arguments. That's for virtual memory arguments. On this box, you type dash and capital letter X and small letter m and small letter x, and then let's say we'll get 1300 megabyte, which is equivalent to 1.3 kilobyte. After this, you click on apply and then you click on run. As you see, it will execute the program. But some of you may not able to do this because your computer doesn't have enough memory. In the case, you need to adjust the memory size. But the bottom line is Stanford CoreNLP 3.6 requires a huge amount of memory. So, you need to find the right computer to execute sample as well as y text miner. Now, let's move on to lemmatization. It's very simple. By simply changing, Or simply calling get lemma, which I implemented a token to Java then you would be able to see lemma form of a given term. So let's say, You call, get token or token object, you simply copy the line and then place right under. Okay, let's look at some other variable, master of token. And here, as you see, it can call get lemma, it can call get NER, you can call get parse, get stem, get token, and so on and so forth. But instead of calling get token, you will call get lemma, by simply replacing the token with get lemma, you will be able to see the lemma form of token. By changing this, you save your file, java file. And then either this time you can use this radio button on top menu bar or by clicking on normalization, right clicking on normalization, Drop down menu, then run as, java application. Let's select, Normalization main from top menu. As you see, there's some, Examples of lemma form. As I told you before, you can call other information that token, the java has already obtain which is get parse, get stem and so on and so forth. So by doing this, you will be familiar with what Y text miner can do as far as normalization is concerned.