A simple data processing process can be roughly divided into 4 steps- data collection, data cleansing, data description and data analysis. These steps may overlap sometimes. The order may also vary sometimes. But, data collection or acquisition is mostly the first step for data processing. You should still remember that,, in previous sections, I discussed how to acquire local data and network data. There, we acquired the information of 30 Dow Jones Industrial Average stocks. Actually, that was too troublesome. You'll get it after watching this lecture. Previously, we introduced to use functions like "open()", "read()", "write()" and "close()" etc. to open, read and close local files. And we also introduced how to use something like the "Requests" third party library the "Beautiful Soup" library and the regular expression module to acquire network data, like data from these two websites. Let's demonstrate this program. This program adopts the "get()" method in the Requests library, and the "findall()" function in the regular expression to acquire the basic data of Dow Jones Industrial Average stocks. And, in the end, put them into a DataFrame. Let's run the program. This is the result of program running. It includes such data as company codes, long names and the latest trading prices. Next, look at the second program. This program aims to acquire the historical data of stock of a company over the past year. say, here, about the company of American Express. Similarly, we use the "get" method in the Requests library and the "findall()" function in the regular expression. Let's run it. This is the returned result. Also, let's convert it into a DataFrame. It includes a lot of data: data of 1 2 3 4 5 6, 6 columns. Here, we also deleted data of a certain column. We'll talk about the specific method later. Well, these url addresses might not be fixed. These are only the currently available url addresses. As for the two DataFrames, we call them "djidf" and "quotesdf", respectively. Apart from this method to scrape and parse webpages is there any more convenient way. For example, can we easily, conveniently and rapidly acquire historical data of corporate stocks on finance websites. Yes, we can. For instance, we've found that data can be downloaded from a finance website. What's downloaded is often a csv file or a json format file. We've already talked about the json format files. Then, how about the csv format. It's a kind of plain text file with commas to separate its values, often used to store list data. By default, it is opened with Excel. Now, let's have a look at the csv format file downloaded from this website. This is the historical data of stock of American Express Company we downloaded from the website. It contains data in many columns. Since the comma is used as the separator in the csv format file, data columns will naturally split when it is opened with Excel. Then, how can we convert these data into DataFrame? Let's have a look. Quite easy. We may use the "read_csv()" function in pandas to very conveniently create a DataFrame from a csv format file. Let's execute it and see the result. Since there are many data columns, they're displayed in several lines. Is that simple? We'll talk about this in details in the section of data access later about some relevant functions. Apart from using the directly acquired data and converting them into DataFrame, we may also use the Web API of some websites to conveniently acquire data. Some individuals and third party agencies also encapsulate these Web API's to form a new API. Well, why do we say it's more convenient to acquire data with Web API? The reason is we've acquired the data in such a way, instead of the HTML file. An HTML file, as we know, needs further parsing to acquire the actual contents inside. Let's look at an example of the API of book.douban.com to understand the effect of API. For instance, we're to scrape some basic information of the book the Little Prince. Previously, we directly scraped its HTML page, and then parsed it to acquire the contents inside. With the API, it would become simpler. For example, we use the book API, say, to get the book information, then, we only need to write like this " get()". Look, the acquired result is not an HTML file but very clear data. How about its specific way of use? Have a look. If we're to acquire book information, we only need use the "get()" method. At this position, as we see, does "id" represent the ID of our book? Say, the ID of our book is 1084336. We only need to change this "id" in the url to be the specific ID. Have a try. At first, let's import the Requests library, and then, we get such a url. The front address is a fixed API address, and the following part is the book information we're going to get in such a way "v2/book" followed by our "id". Let's check its result. Have you got it? The data are in the json format. Quite convenient, right? No additional parsing is needed. Apart from acquiring book information, as we see, it may acquire much more information. Detailed explanations and demo programs are provided at the website. In addition to the book API at Douban.com, a ton of diverse API's are also available for us to make use of. Well, if a website provides Web API for developers. we may try to use such a way, since this way is easier and more convenient than the method of scraping and parsing we talked about before. Sure, there're also downsides in data acquisition directly with Web API sometimes. For example, it does not provide all the capabilities we desire, and, it might not be quite convenient to acquire big data. We should decide which way to follow based on the specific situations. Besides, it's also possible for us to directly use some corpuses. One example is the well-known natural language toolkit of Python – NLTK, which includes Gutenberg Corpus, Reuters Corpus and Brown Corpus etc. Let's take the Gutenberg Corpus for example. It contains a small portion of texts from electronic documents of Project Gutenberg. We might as well have a look at Project Gutenberg. This is the official website of Project Gutenberg. It contains dozens of thousands of electronic books. Here, we may click in and then see it has various books, like history, children history which are further classified. The NLTK only includes a small portion of it. Let's look at more details about how to download corpuses of NLTK in Python. At first, we import the NLTK package, and then execute such a function for downloading. After executing the "download()" function of NLTK, such a downloader will be opened. Here, we find the tab, Corpora. These items have already been installed. It says "installed" in the end. Select the corpus you want to download. For downloading, we only need to click "download". After downloading, data are stored under such a directory. Here, as we see, abundant corpora are available. Since the corpuses of NLTK are downloaded to the local, they should be loaded in first. If we'd like to load the Brown Corpus, quite simple, just change it to be "brown", and then import this corpus in NLTK. We can then use some functions provided inside to conduct various statistical analyses. This function "fileids()" ,say, may view the list of books of Project Gutenberg included in the Gutenberg Corpus like some books we might be familiar with: Emma by Jane Austen and some books authored by William Shakespeare. This one, Hamlet, for example, is quite familiar to us. Besides, the "words()" function may conveniently list the words in a book. Very convenient, right? We'll explain some other instances of NLTK in details in next sections. If interested, you may download them and have a look first all with detailed explanations inside.