Data processing can be roughly divided into 4 steps data collection data exploration and preprocessing data analysis and mining and result evaluation and presentation And data collection or acquisition is mostly the first step for data processing You should still remember that we have talked how to acquire local data and network data There, we acquired the corporate stock information of Dow Jones Industrial Average stocks Actually, that was too troublesome. You'll get it after watching this lecture With the development of demands and technology for data analysis and mining the 4 steps of data processing has taken on more connotations and capabilities From data acquisition to analysis and presentation tons of work need to be done during decision-making and evaluation like data processing, analysis and mining The included steps often have different names In methodology, the common statistical analysis method may be adopted and machine learning models may also be utilized for data mining In this module, we'll follow the 4 steps as the main line for our lecture To be specific, I'll discuss convenient and fast data acquisition basics of plotting with Python which is a necessary means for data exploration and display methods of using several important links of data preprocessing data exploration and statistical analysis of data with KMeans clustering analysis as an example discussing the methods of model building and evaluation for machine learning and application examples of Python in humanities, social sciences, science and engineering They'll be discussed in two chapters In this part, we'll first discuss the common ways and methods of convenient and fast data acquisition Previously, we introduced the use of functions like open(), read(), write(), and close(), etc to open, read and close local files and we also introduced how to use something like the Requests third party library the Beautiful Soup library and the regular expression module to acquire and parse network data For example, on the two websites based on the current website structure we utilize "requests" and "re" modules to scrape the source code of webpage and create appropriate regular expressions to parse and acquire our desired data Let's demonstrate the program The first program utilizes the get() function in the "requests" module and the regular expression as well as the findall() function in the regular expression module to acquire from the first website the basic data of Dow Jones Industrial Average stocks and finally store them into a DataFrame Let's run the program This is the result of program running The data inside includes company codes names and the latest trading price Next, look at the second program This program aims to acquire from the second website the historical data of stock of a company say, American Express here whose stock code is AXP for the recent year Similarly, we use the "requests" and "re" modules Let's execute it This is the result of execution Also, let's convert it into a DataFrame It includes a lot of data columns data of 1 2 3 4 5 6, 6 columns There are 7 data columns in the raw data Here, we've also used the drop() method of the DataFrame object and deleted data of a column we don't need The drop() method is safer than direct deletion of data of a certain column with "del" since it doesn't directly modify the raw data and is more frequently used in actual use Moreover, if a certain day is the pay day the data of that day includes the "type" attribute whose value is DIVIDEND Such data have no trading information we need so we don't collect such special records You may output it and have a look Besides, we also utilize this statement to conduct reverse order processing to the records so that the latest data rank at the very beginning As for the results acquired by the two programs i.e., the two DataFrames, we call them djidf, and quotesdf, respectively They are part of the records of the two groups of data Apart from this method of webpage scraping and webpage contents parsing is there any more convenient and faster way For example can we easily, conveniently, and rapidly acquire historical data of corporate stocks on finance websites Yes, we can Some websites may provide data for direct downloading or downloading through an Application Programming Interface (API) First, let's focus on direct downloading of data from website For example, we've found that data can be downloaded from a finance website and there's a link of Download Data What's downloaded is often a csv file or a json format file We've already talked about the json format files Then, what's the csv format It's a kind of plain text file with commas to separate its values often used to store list data If we open it through the notepad we can see the commas between data By default, a CSV file is opened through Excel Now, let's have a look at how a csv format data file is downloaded from this website After some necessary selections at this website like selecting information like the time range and frequency of data After "Apply", click "Download Data" to download This is the historical data of stock of American Express Company we downloaded from the website It contains data in many columns Since the comma is used as the separator in the csv format file when it's opened in Excel data columns will naturally split Then, how can we convert these data into DataFrame Let's have a look. Quite easy We may use the read_csv() function in pandas to very conveniently create a DataFrame from a csv format file Have a look Quite convenient, isn't it Sure, if the default value of the sep argument i.e., the separator between data, comma is changed to any other symbol, the pd.read_csv() function may read other types of text files It's quite frequently used in daily practice Other arguments may be viewed through the help(pd.read_csv) function with a lot of help information As we see, the beginning part has many arguments making use quite easy For instance, this argument: index_col may designate a certain column of data in the file as the index column in DataFrame What if we wanna write a DataFrame object into a csv file Just use the to_csv() method in the DataFrame object Specify the file path and the filename in the parentheses Let's think again What if we wanna read the data in an Excel worksheet and output into a DataFrame I'm sure you must have got it Just use the pd.read_excel() function and then set the argument value of sheet_name as 0, 1, 2, 3, etc Specify the worksheet to be read and write a DataFrame object into an Excel file naturally, with its to_excel() method Similarly, it's possible to use help(DataFrame.to_excel) to view the use of these functions and methods like this What we just talked about is data downloading from finance websites Tons of other websites offer similar uses Let's view the famous Kaggle website In the datasets on the Kaggle website a wide variety of datasets with diversified capabilities are provided for downloading like the open data of Airbnb in New York You may give it a go Apart from using the directly acquired data and converting them into DataFrame we may also use the API's of some websites to conveniently acquire data to conveniently acquire data Well, why do we say it's more convenient to acquire data with API since what we acquire in this way are cleaned data instead of the source code of webpage As we know, the source code of webpage needs further parsing to acquire the actual contents inside For example, let's have a look Use the API functions in the two modules in pandas_datareader to acquire the data from multiple Internet data sources Pay attention that the source may change Please follow the list on the official website For example, we may use the API function of DataReader() in the module pandas_datareader.data to acquire from the source of Stooq website the historical trading data of stock of American Express Company over the past few years The first 5 records are listed here. Isn't it quite convenient The well-known "tushare" third party library in China also provides similar uses On the Internet, quite a few other free or charged API's may conveniently and fast acquire data in multiple fields When in need, you may explore them By the way if a website provides API for developers if a website provides API for developers we may try to use such a way since this way is easier and more convenient than the method of scraping and parsing we talked about before However, there're also downsides in data acquisition directly with API sometimes For example, it may not provide all the data we desire or it may be troublesome to acquire a lot of data Some may charge high We should decide which way to follow based on our needs and the specific situations Besides we may directly use some datasets or corpuses for data mining and exploration and embedded in plotting modules For example, let's look at two classical modules: sklearn, and nltk Now, look at the first one: sklearn sklearn is the most famous machine learning package in Python which provides some standard datasets These datasets are all classical We often use them for some tests and modeling learning Let's have a look To utilize the sklearn module to acquire the classical Iris dataset first, import the "datasets" module in the "sklearn" package View this module through dir(datasets) As you see, these functions may load classical datasets to acquire the Iris data through the load_iris() function and the Iris data are stored in the ".data" attribute The data contains size observation data on 150 iris flowers For each data record, there are the sepal length (cm), the sepal width (cm) the petal length (cm), and the petal width (cm) these 4 feature names, aka attribute names Print "iris" or iris.feature_names and we'll see the corresponding feature names The species contained in the data are stored in the ".target" attribute There're 3 species in total, namely, setosa versicolor and virginica 50 records for each species The class name are expressed as 0, 1, and 2 For complete data information of Iris directly input "iris" to view it Similar things like Boston housing price data such classical datasets may also be acquired with similar methods Then, let's look at the famous natural language toolkit of Python NLTK which contains some classical corpuses like Gutenberg, Brown, and Routers as well as dictionaries like WordNet Take Gutenberg corpus as an example It contains a small portion of texts from electronic documents of Project Gutenberg We might as well have a look at Project Gutenberg This is the official website of Project Gutenberg It contains dozens of thousands of electronic books Here, we may click in There are different ways of classification The NLTK only includes a small portion of it Look at some details about how to download corpuses of NLTK in Python First, import the nltk module and execute the download() function It opens the downloader: NLTK Downloader() In the downloader, we open this tab: Corpora They are all downloadable corpuses here Here, Gutenberg, for instance, says "installed" meaning it has been successfully downloaded Select the corpus you want to download For downloading, just click "download" After downloading data are stored under such a directory under your own user directory There's a nltk_data directory under these subdirectories Look at this folder: corpora Inside it, there are corresponding corpuses I've downloaded Since the corpuses of NLTK are downloaded to the local let's see how to load and use them To use the Gutenberg corpus, for instance just do this Then, just with some functions provided by the module and some functions supported by Python we can conduct various statistical analyses This function, fileids(), say is to view the corresponding corpus which, here, is the Gutenberg corpus for the books collected in the Gutenberg Corpus Another example is, like the book we might be familiar with Hamlet by William Shakespeare The words() function may list the words in this book Very convenient, right What if we wanna load the Brown Corpus So easy Just change "gutenberg" into "brown" In later lessons, we'll introduce in details some application examples of NLTK If interested, you may download them and have a look first all with detailed explanations inside In this part, we've introduced many ways to conveniently and fast acquire data How about some other ways? You may think about them