During this green cast we will try to find the most distinctive words for the most famous books of Charles Dickens. You will learn two things how to pre process documents and how to calculate the f idea of values. I borrow wrote several ideas from Fr Julius Zilch and her amazing work on tidy text. You can read more about her work on this website, and it is a great resource to learn R and D text. So, let's start with this step, let's load the necessary packages into the current art session. We'll need these packages. If you don't have any of these packages, you need to install them with install dot packages do text library, snowball C. Library udi pipe and library gutenbergr. So, we have loaded these packages. Now, let's find the following books in Gutenberg meta data data set within Gutenberg. Our packages, these books are of interest, we need to extract the values from Gutenberg idea from Gutenberg underscore meta data, with full function for these books. And assigned these values to take in Santa school books and a school I D variable, being it Gutenberg and the square the only for those cases where has underscore text is true. And well where these books are right and living in English now. Gutenberg meta data, this is the data set that has all the books that we are interested in. This data set consists of nearly 52000s of rows, 52000s of books, and eight columns. Gutenberg idea, this column will actually help us to download the content of these books. Title of the book. Its offers idea language of the book, bookshelf within Gutenberg rights to this book. And this logical victor whether this book has whether the Gutenberg has text for this book or not now we need these books. So, let's do it like this filter, title percent, in percent C, let's create a factor of these books titles, titles for these books. Yeah, doc one more and let's add this book also. And after that, that's sad. There's a book this one. Oliver West, that's also a bleak house. David Copperfield. And great expectations and great expectations. Now let's see what this code will return. This code returns A data set of 20 rows and eight columns. Languages public rights has taxed. And if we look at the second page, as you can see most of these observations on the second page are cooperated. And gotten bed doesn't have texts for these books and one of these books is actually written in French which is not helpful for us. So, let's do several operations. Let's also add language equals two English, and let's also add has text and now we have ten observations. ten books, all of them are written in English and we have texts for all of these books. So, the next thing that we need to do is to pull Gutenberg underscore I D come. So, let's use pull Gutenberg underscore I D. And we need to assign these values to Dickens underscore books, underscore idea name. So, now we have a doctor of ten elements. Now let's don't load these books with Gutenberg underscore download with their two titles included and assigned the results to dickens. Underscore books. Let's use the help to see how we can add titles and how we can actually download these books. So, Gutenberg underscore download, helps you to download one or more books by their project, Gutenberg kids into a data frame with one year old Caroline paperwork. All right, it has nearer argument, which helps you to specify from where do you want to download these books? And it also has met the fields, felt argument which helps you to specify. What hell do you want to add from Gutenberg underscore meta data describing each book blood said title. So, we know what we want to do now let's just do it. We need to use Gutenberg download on taken the books I D. With no down fields title. However, I will add one more argument mere argument. Why is that? Because when I tried to download with the default mirror, I had several problems so I will add this mirror. If this mirror doesn't work for you can see mirrors here. And let's see yes. Books to dickens underscore books. Let's wait for I mean till download these books, we have downloaded the books, now let's see how this data set looks like this data set has 300 thousands of observations. It has as you can see here text gotten back on the score big and titles and we yes can see here lines of the book if we will use view function. This data set you will see one more thing that we cannot see worth the basic output of our mark down. Several of this car rose have empty elements like this second role for for all etcetera etcetera. How to solve this problem. Well it is quite simple. All we need to do is to use dickens and the square books filter text is not equal to empty element. And then, let's use view on the result now and now we don't have any of these rules with empty strings. So, we can do it this part with the view function and rewrite dickens underscore books data set now, how many euros are there per book arranged? The output by the number of rows in descending order. You already know how to do it. We have to group by title and then we need to count how many roles like news summarize and let's call it number of lines and and then this is the result. We have a number of lines. However, we need to arrange the output by the number of rows in descending order. So, let's do it range by now girl blinds in descending order. And now, as you can see a bleak house has the most. Has the largest number of lines there. This her book and the tail to city has only wealth. Hundreds of line, 12000s of lights. All right.