In this video, we are going to talk about basic NLP tasks and introduce you to NLTK. So what is NLTK? NLTK stands for Natural Language Toolkit. It is an open source library in Python, and we're going to use it extensively in this video and the next. The advantage of NLTK is that it has support for most NLP tasks and also provides access to numerous text corpora. So let's set it up. We first get NLTK in using the import statement, you have import NLTK and then we can download the text corpora using nltk.download. It's going to take a little while, but then once it comes back you can issue a command like this from nltk.book import * and then it's going to show you the corpora that it has downloaded and made available. You can see that there are nine text corpora. Text1 stands for Moby Dick, text2 is Sense and Sensibility, you have a Wall Street Journal corpus in text7, you have some Personals in text8 and Chat Corpus in text5. So it is quite diverse here. So as I said text1 is Moby Dick. If you look at sentences, it will show you one sentence each from these nine text corpora and sentence one, Call me Ishmael is from text1. And then, if you look at how sentence one looks, sent1 and you see that it has four words. Call me Ishmael and then full stop. Now that we have access to text corpus and multiple text corpora, we can look at counting the vocabulary of words. So text7 if you recall was Wall Street Journal and sent7 which was one sentence from text7 is this, [ 'Pierre', 'Vinken', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.' ] You have already these words passed out. So you have comma, separate, and you have full stop separate. So the length of sent7 is the number of tokens in this sentence and that's 18. But if you look at length of text7 that's the entire text corpus, you'll see that Wall Street Journal has 100,676 words. It's clear that not all of these are unique. We can see in the previous example that comma is repeated twice and full stop is there and words such as "the" and "a" and so on, are so frequent that they are going to take up a bunch of words from this 100,000 count. So if you see the unique number of words using the command length of set of text7 you'll get 12,408. That means that Wall Street Journal corpus has really only 12,400 unique words even though it is a 100,000-word corpus. Now that we know how to count words, let's look at these words and understand how to get the individual frequencies. So if you want to type out the first 10 words from this set, first 10 unique words, you'll say, list(set(text7))[:10]. That would give you the first 10 words. And in this corpus, the first 10 words really in the set are, 'Mortimer' and 'foul' and 'heights' and 'four' and so on. You can notice that there is this 'u' and a quote before each word. Do you recall what it stands for? You'd recall from the previous videos that 'u' here stands for the UTF-8 encoding. So these have been automatically UTF-8 encoded. So each token is represented as a UTF-8 string. Now, if you have to find out frequency of words, you're going to use this command, frequency distribution, FreqDist and then you create this frequency distribution from text7 that is the Wall Street Journal corpus and store it in this variable called "Dist" you can start finding statistics from this data structure. So you have length of Dist and that will give you 12,408. These are the set of unique words in this word corpus, this Wall Street Journal Corpus. Then, you have dist.keys that gives you the actual words. And that would be your vocab1. And then if you take the first 10 words of vocab1, you will get the same 10 words as we saw up there in the top of the slide. And then, if you want to find out how many times a particular word occurs, you can say, "Give me the distribution of this word four," that is UTF encoded and I'll get the response of 20. That means in this Wall Street Journal corpus, you have four appearing 20 times. What if you want to find out how many times a particular word occurs and also have a condition on the length of the word. So if you have to find out frequent words and say that I would call a word as frequent if that word is at least length five and occurs at least a hundred times, then I can use this command saying w for w in vocab1 if length of w is greater than five and dist of w is greater than 100. And then, I'll get this list of words that satisfy both conditions and you see million and market and president and trading are the words that satisfy this. Why did we have a restriction on length of the word? Because if you don't then words like the or comma or full stop are going to be very very frequent and those will occur more than 100 times and they would come up as frequent words. So this is one way to say, "Oh, you know, the real unique words are ones that are fairly long, at least five characters and occurs fairly often." There are, of course, other ways to do that. Now, if you look at the next task. So we know how to count words, how to find unique words. The next task becomes normalizing and stemming words. What does that mean? Normalization is when you have to transform a word to make it appear the same way or the count even though they look very different. So for example, there might be different forms in which the same word occurs. Let's take this example of input1 that has a word list in different forms. You have it capitalized. You have it plural, lists, you have listings and listings and listed as a verb in the past tense and so on. So the first thing you would want to do is to lowercase them. Why? Because you don't want to distinguish the capital list with small case list. So lower would bring it all to lowercase. And then if you split it on space, you'll get five words, list, listed, lists, listing, and listings. So that was normalization. Then, comes stemming. Stemming is to find the root word or the root form of any given word. You can have multiple algorithms to do stemming. The ones that are quite popular and used widely is Porter stemmer and NLTK gives you access to that. nltk.PorterStemmer would create a stemmer and we call it Porter. And then, if you stem a word using the Porter stemmer, you will get the word list for all of them. So no matter whether it is list or listed or listing, it still gives you the stem of a word as list. This is advantageous because you can now count the frequency of list as the list word occurring itself or in any of its derivation forms, any of its morphological variants. Do you want to do it that way? That's a call that you have to make. You really want to distinguish list and listing, which has slightly different meaning. So you may probably not want to do that but you may want to do list and lists to be merged together and just count as one word. So it is a matter of choice here. Porter stemmer has a particular algorithm to do it and it just makes all of these words the same word, list. A slight variant of stemming is lemmatization. Lemmatization is where you want to have the words that come out to be actually meaningful. Let's take an example. NLTK has a corpus of the universal declaration of human rights as one of its corpus. So if you say nltk.corpus.udhr, that is the Universal Declaration of Human Rights, dot words, and then they are end quoted with English Latin, this will give you all the entire declaration as a variable udhr. So, if you just print out the first 20 words, you'll see that Universal Declaration of Human Rights and there is a preamble and then it starts as whereas recognition of the inherent dignity and of the equal and inalienable rights of people and so on. So it continues that way. Now, if you use the Porter stemmer on these words and get the stemmed version, you'll see that it takes out these common suffixes. So universal became universe without really an e at the end and declaration became declar, and of is of and human right is the same, rights became right, and so on. But now you see that univers and declar are not really valid words. So lemmatization would do that stemming, but really keep the resulting tense to be valid words. It is sometimes useful because you want to somehow normalize it, but normalize it to something that is also meaningful. So we could use something like a wordnet lemmatizer that NLTK provides. So you have nltk.WordNetLemmatizer and then if you lemmatize the word from the set that you've been looking so far, what you get is universal declaration of human rights preamble, whereas recognition of the inherent dignity, so basically all these words are valid. How do you know that lemmatizer has worked? If you look at the first string up there and then the last string down here, rights has changed to right. So it has lemmatized it. But you will also notice that the fifth word here, universal declaration of human rights is not lemmatized because that is with a capital R, it's a different word that was not lemmatized to right. But if you had them in lower case, then the rights would become right again, okay? So there are rules of why something was lemmatized and something was kept as is. Once we have handled stemming and lemmatization, let us take a step back and look at the tokens themselves. The task of tokenizing something. So recall that we looked at how to split a sentence into words and tokens and we said we could just split on space. Right? So if you take a text string like this text11 is, "Children shouldn't drink a sugary drink before bed. " And you split on space, you'll get these words. Children shouldn't as one word, drink a sugary drink before bed, but unfortunately, you have a full stop that goes with bed. So it's bed full stop. Okay. So you got, one, two, three, four, five, six, seven, eight – you got eight words out of this sentence. But you can already see that it is not really doing a good job because, for example, it is keeping full stop with the word. So you could use the NLTK's inherent or inbuilt tokenizer, the way to call it would be nltk.word_tokenize and can pass the string there and you'll get this nice tokenized sentence. And in fact, it differs in two places. Not only is full stop taken away as a separate token, but you will notice that shouldn't became should and this "n't" that stands for "not", and that is important in quite a few NLP task because you want to know negation here. And the way you would do it would be to look for tokens that are a representation of not. So "n't" is one such representation. But now you know that this particular sentence does not really have eight tokens but 10 of them because you've got "n't" and full stop has two new tokens. So we talked about tokenizing a particular sentence and the fact that these punctuation marks have to be separated, there are some unique words like n apostrophe t that should also be separated and so on. But there is even more fundamental question of, what is a sentence and how do you know sentence boundaries? And the reason why that is important is because you want to split sentences from a long text sentence, right? So suppose this example of text12 is, this is the first sentence. A gallon of milk in the U.S. costs $2 99. And is this a third sentence, a question mark. And yes, it is with an exclamation. So already, you know that a sentence can end with a full stop or a question mark or an exclamation mark and so on. But, not all full stops and sentences. So for example, U dot S dot, that stands for US is just one word, has two full stops, but neither of them end the sentence. The same thing with $2.99. That full stop is an indicator of a number but not end of a sentence. We could use NLTK's inbuilt sentence splitter here and if you say something like nltk.sent_tokenize instead of word tokenize, sent tokenize and pass the string, it will give you sentences. If you count the number of sentences in this particular case, we should have four. Yey! We got four. The sentences themselves are exactly what we expect. This is the first sentence, is the first one. A gallon of milk in the US cost $2.99, that's the second one. Is this the third sentence? That's the third one. And yes it is, is the fourth one. So, what did you learn here? NLTK is a widely used toolkit for text and natural language processing. It has quite a few tools and very handy tools to tokenize and split a sentence and then go from there, lemmatize and stem and so on. It gives access to many text corpora as well. And these tasks of sentence splitting and tokenization and lemmatization are quite important preprocessing tasks and they are non-trivial. So you cannot really write a regular expression in a trivial fashion and expect it to work well. And NLTK gives you access to the best algorithms or at least the most suitable algorithms for these tasks.