In this hands on lecture, I will discuss about stopwords and stemming, and look over them through the code base by text miner. If you recall correctly, there are several utility Java classes for this exercise which I mentioned last session. One of them is Token.java. So why don't you open up Token.java by clicking on Token.java? By double clicking on Token.java, you see Constructors and some functions provided in Token.java. Stemming is a porter stemmer based stemming class. So let's look at porter class located at the pre-processed package. In the lecture note, I covered porter stemmers. This porter class is Java implementation of porter stemming algorithm. Or not directly use this, but it will be used by core NLP pre-process Java class, what we call the pre-process method. So inside, CoreNLP pre-processed Java class, there is several pre-process functions. Inside one of pre-processed functions there is One line of code which calls porter. Okay, this porter. If you right click on strip of fixes and select Open Declaration, then it will take you, let's Let's do this again. If you right-click on stripAffixes, and select Open Declaration, and it will take you to porter stemmers class and the function, of course, stripAffixes. And this string affixes will be based on the several rules that define in this porter stemmer. It simply implements porter stemmer algorithm in Java. After cleaning and then going through six steps the original form of where will be reduced into the stem form of the word. So when you call pre-process, by passing sentence object to this preprocess function of CoreNLPPreprocess.java. And inside this function, it calls another preprocess which the type is private. And in there, it calls porter stemming algorithm. And then, this porter stemming algorithm is based on Original text. So since this original text is stemmed, and then it will pass to the token object. So token object gets stemming form of original token. So let's go back to token.java. The token.java when you call preprocess of token class then the course CoreNLPPreprocess and the next line is preprocess function of coreNLP preprocess. So now let's create one simple mail function. By now, you must be able to create very simple function of this exercise. And let's call this stem and stop main. So let's create that simple test driver class on the main package. Right click on the main package and select new. And then select class, and then you type, StemAndStopMain. My case I already created this main class, but in your case you won't see this error. So by simply clicking on Finish button, you will have skeleton of stem and stop main the Java. Let me close this out. Okay Very simple structure, same as before, instantiate scanner. I pass the sample data set. In this case data collected by stream API. The first ten lines I create sentence object as before. And then sentence object's pre-process function is called and executed. And after that, we'll call get stem of the token. For this get stem of token, You will see the stem form of the token. Well, let's say I want to see the original token, original format of token, then let's Okay then token, Token dot then I'm going to select get token. so the original form So then let's save this. The first thing is the very similar as previous example, so you must be follow this very easy up to here. And once you get the sentence tokenized and preprocessed, then you can call each token from sentence and the sentence has a number of tokenized terms along with stem form. Whether it is a stop word or not, you will be able to print out all of those valuable information by doing this. After that, so, Let's call this Is stop word, okay? Is stop word, so for the same set of tokens, we're going to examine whether a token is stop word or not. By calling tokens is stop word function. Then if this token, this is a stop word then you will say. Stop word, okay, if not, then you simply say stop word. Okay? So by doing this and say, after that, you just simply right-click on stem and stop the Java. And then select and select run Java application. And the same logging messages, and after the end of logging messages you will see the tokenized word. So let's say yellow, okay, is not stop word. Is not stop word, okay. And, Let's say This hashtag not Stopword and Coldplay not Stopword. And Beyonce is not stop word and so on and so forth. So you will check whether it’s tokenized term is stop word or not. So, after this you're probably going to create bag of words after removing stop words, after stemming or lemmatization. And create the text then you will do other text mining tasks, such as clustering, classifications and so on and so forth. So let's take stemming, okay, this is a stop word. Another one is stemming. So stemmings are very simple, as I explained before. You will get the result token in its stemmed form. So let's look at after stemming 44 is 44, yellow, yellow. Okay so we intentionally make this word lowercase. The original form, yellow, starts with capital Y. When you, the reason is we want to normalize the words to minimize the variation of terms. In this case, help is stem for help, and I must include another white space. But I didn't even do that. So that's why you see this two terms concatenate each other. But let's say it's separated, come is stem form, coming is original form. Which means ing is stemmed into e. So once you're familiar with this stop words and stemming, preprocessing stage you will be much more comfortable with dealing with constructor text data.