Prior to data processing and analysis, we need to select data In real practice, data often consists of multiple columns and rows In Python, we may select the most representative columns and rows often with only one or two functions Amazing, isn't it? Let's see how they work in details Data reduction is an essential step in data preprocessing Datasets we meet in practice may even have hundreds of dimensions namely, attributes or features High dimensions lead to a dimensional disaster making data samples sparse To put it simply, "sparse" means there're lots of zeros in samples and difficulties in distance calculation, etc so we need to reduce the dimensions aka, feature reduction, to alleviate the dimensional disaster Apart from dimensions the quantity of records contained in data may often be quite big so we also often need to select some data from the dataset which is known as numerosity reduction Feature reduction and numerosity reduction are two parts of data reduction Reduction of features and numerosity may help us acquire a reduced expression much smaller than the raw dataset Sure, that's an ideal state Whether we select parts of attributes or parts of data we hope to be close to the completeness of original data and, when mining the reduced dataset achieve almost the same analysis result Common methods for feature reduction include forward selection, backward deletion, decision tree, and PCA Forward selection starts with a null attribute set and the currently optimal attribute is selected each time to add inside It stops when no optimal attribute can be selected or a certain threshold constraint is satisfied Backward deletion is just the opposite It selects the worst to delete from the current attribute set and it stops when it's impossible to delete or a certain threshold is satisfied Decision tree is a dedicated algorithm for machine learning PCA (principal components analysis) is the most frequently used method of linear dimensionality reduction Through a linear projection, it maps high-dimensional data to low-dimensional data for representation and it's hoped that the data variance at the projected dimensions is the biggest Fewer dimensions are selected, i.e., principal components and unimportant components of data description are ignored Also, characteristic of raw data are retained as much as possible Numerosity reduction consists of the parametric method and the non-parametric method The parametric method uses a model to evaluate data and only parameters, rather than actual data, need storing Common ones include the regression method and the log-linear model The non-parametric method needs storing actual data Common methods include histogram, clustering, and sampling Similarly, with consideration to the foundation of our course in this part, we'll discuss the PCA algorithm in attribute reduction and the histogram and sampling methods among non-parametric numerosity reduction methods First, let’s look at the PCA (principal components analysis) method You may explore the detailed mathematical computation steps on your own and we here focus on how to utilize Python functions to realize the PCA algorithm for dimensionality reduction In Python the PCA() function in the sklearn.decomposition module may be used to reduce attributes. Look at actual operations First, import the PCA() function Here, we still use the Boston Housing Price dataset First, normalize the data that don't contain the target attribute and then call the PCA() function The most common argument in the PCA() function is n_components It may be used to set the number of principal components retained in the algorithm is the number of features as None by default i.e., retain all the features If its value is set as 2 it means to retain only 2 features it means to retain only 2 features If its value is set as 'mle' the algorithm will automatically select the number of features satisfying the required variance percentage Here, we set the number of principal components to be retained as 5 and then train the PCA model with Data X with the fit() method Here, another essential attribute is explained_variance_ratio_, which returns the respective variance percentage of each component, i.e., the variance contribution of variable The bigger the variance percentage is, the greater the weight of vector will be Let's calculate it The sum is 0.8073 That means, the first 5 components account for 80.73% of the change in the data By contrast, if it's discovered that the accumulative contribution of components whose number is smaller than the actual selection is already very high we may depend on a reasonable number of principal components to re-set the value of n_components for calculation For example, if the argument value of n_components is set as 5 however, it is discovered in calculation that 3 features alone may account for 99% of the changes in data then, the argument value may be set as 3 Of course, we may set the argument value of n_components as 'mle' so that the algorithm may automatically select the principal components Have a try As we see, 12 of the 13 features are selected Calculate the accumulative contribution. 0.995. Really very strict After introduction to attribute reduction, let's look at numerosity reduction First, look at the histogram A histogram appears quite like a bar chart yet with different concepts A histogram is essentially binning We introduced binning when discussing discretization of continuous attributes before First, let's use a group of data to plot a histogram and understand its essence based on the plot say, based on 50 random integers between [1,10], like this say, based on 50 random integers between [1,10], like this Let's do actual operations Generate 50 random integers first and then, based on this group of data, use the hist() function in the pyplot module to generate a histogram Observe the generated histogram and guess the values and meanings of x-axis and y-axis Look at them with reference to the data Have you found out that each bin, aka, bucket, in the histogram actually represents an attribute-frequency pair For example, it represents the attribute 9 here and the corresponding bin height, i.e., Y value, is 3 indicating that 9 appears for 3 times in the dataset Data reduction with histograms is to reduce the quantity of bins in the plot from the quantity of observed value – n to k so as to represent the data in sections Furthermore, the hist() function also has, like the previous cut() and qcut() function in pandas, an argument "bins" The meaning of "bins" is similar to what we mentioned before For example, if we wanna put the data into 2 bins we may set the value of "bins" like this Utilize the linspace() function to find the 3 split points of the 2 bins from the minimum value to the maximum value including the stop values Output "bins" and have a look The value of "bins" is [1, 5, 9] If this value is sent to the "bins" argument of the hist() function how to explain it Like this Based 3 points (1, 5, and 9), the interval is split into two one is [1,5) and the other is [5,9] left-closed and right-open Both the starting value and the stop value are included Sure, if the split point can be determined we may directly write such a list We use a new "bins" argument value to re-plot a histogram and set the width and edge color of bin As we see the data are split into two bins There are around 27 to 28 Attributes 1, 2, 3 and 4 and there are around 22 to 23 Attributes 5, 6, 7 and 8 Let's compare them You may also try putting the data into, say, 3 bins Sampling is a highly common method in numerosity reduction It randomly collects samples from the raw dataset to form a subset so as to reduce the data size and then reduce the data Common ways include random sampling, cluster sampling, and stratified sampling Random sampling is further divided into that without replacement and that with replacement The concept is the same as the sampling with or without replacement the numpy array we introduced before Stratified sampling refers to dividing of a dataset into non-overlapping parts, i.e., stratum and then random sampling is conducted to each stratum to obtain the result Let's base on the Iris dataset to discuss the methods of random sampling and stratified sampling of DataFrame object Look at random sampling first The sample() method of DataFrame object may be used to conveniently realize random sampling Inside it, the value of "replace" argument is set as "True", meaning "with replacement" Each sampling may select the same record The value of "replace" argument is "False" by default meaning "without replacement" Each sampling will not select the same record Let's try it in practice First, generate data Convert the Iris data into a DataFrame Use the sample() method of DataFrame object Two arguments are often used in random sampling one is "n" to specify the number of sampling records For example, as for 10 records for sampling without replacement just write it like this It's also possible to use the "frac" argument to specify the sampled percentage of dataset data For example, to sample 30% of data, write like this 45 records in total To conduct sampling with replacement just set the "replace" argument as "True" Let's try 10 records for sampling with replacement Such a writing may sample the same record Then, look at stratified sampling The first step of stratified sampling is stratification and the subsequent sampling may still be conducted with the sample() method For example, what if we wanna extract 30% of the data subset whose category name is 0 (i.e., the value of iris.target is 0) from the Iris dataset, 15 records in all First, we add the target attribute into iris_df to acquire the needed data, followed by sampling A lot of methods are available A simple writing may be based on the screening capacity of DataFrame object which can select 50 records whose iris_df.target equals 0 Then, just use the sample() method Quite easy, right If we go on extracting 20% data from those whose category value is 1 does it work like this 10 records in total What if we wanna combine the sampling results of the two strata So easy Suppose that we assign the result of the first stratum to Variable A and assign the result of the second stratum to Variable B We use the append() method to return the combined data DataFrame object also has the append() method, just like the list Isn't it surprising In later lessons, we have dedicated parts to introduce these senior data processing methods In this part, we've introduced feature reduction and numerosity reduction Have you found out that the functions and methods in Python are really fantastic simple and easy-to-use Life is short and let's use Python