After data cleansing is well done during subsequent processing or when viewing final results we often need to browse data Sometimes, the result of a program may be found to be incorrect At this time, you might get stubborn with the program feeling overwhelmed with sorrow and unable to fall asleep Thus, to avoid such tragedies let's waste no time in seeing what the common task of data preprocessing – data transformation needs Data transformation is an essential step for data preprocessing To put it simply, it is to transform data into an appropriate form Common ways include normalization, discretization of continuous attributes, feature binarization, etc as well as some attribute structuring Look at data normalization first As data often consist of multiple attributes to put it simply multiple attributes mean multiple data columns Different attributes may have different dimensions, like cm for stature kilogram for body weight possibly with different ranges Such differences make data incomparable so it's needed to conduct dimensionless operation to all attribute indexes After eliminating the influence of dimensions and data value range they are then applied to various algorithm modules and actual applications Common methods of normalization include the Min-Max normalization aka, dispersion normalization Z-Score normalization aka, zero-mean normalization or, as we normally call, standardization as well as decimal scaling normalization Different types of models and applications often need different normalization methods For example, for clustering as it needs distance to measure similarity the use of z-score normalization will produce better performance Next, let's take the Boston Housing Price dataset as an example to illustrate data normalization Like the Iris dataset, the Boston Housing Price dataset is also a classical dataset for machine learning mainly for regression analysis Its statistics include the medians of housing prices in suburban Boston in the mid-1970s with 506 records in total including details on houses and house surroundings such as the per capita crime rate by town (CRIM) the nitric oxide concentration (NOX) the average number of rooms per dwelling, etc, 13 feature attributes in total and 1 target attribute: the median value of house prices (MEDV) Let's look at the detailed way of loading Its way of loading is like the loading of Iris dataset and then the function of load_boston() of datasets is used Assign the result to the variable "boston" Through its "data" attribute, we may view the detailed data records There're 506 rows in 13 columns in total The attributes/features of the 13 columns may be acquired through the "feature_names" attribute of object Through the "target" attribute of object we may view the specific value of the target attribute, i.e., MEDV (median value of house prices) The detailed information of these attribute can be directly output to "boston" There're detailed introductions to attributes To facilitate observation we select 3 features (NOX, RM, and AGE) for normalization processing we select 3 features (NOX, RM, and AGE) for normalization processing They represent the nitric oxide concentration, the average number of rooms per dwelling and the proportion of owner-occupied units built prior to 1940 Let's construct a DataFrame for the 3 features They are located in 3 columns: Columns 4, 5, and 6 First, import the "pandas" module To make it simpler, we name this DataFrame "df" Apply slicing to this array and add values of columns to it Look at this df which correctly selects the 3 columns we need In this way, this DataFrame is properly constructed Next, based on it, let's conduct various normalization treatment First, look at the Min-Max normalization in normalization Its calculation formula is simple the value minus the minimum value of the attribute where the value is located divided by the difference between the maximum value and the minimum value of that attribute As the DataFrame object supports vector calculation it is also easy to write the code manually, isn't it Let's write it Is this way OK (df-df.min())/(df.max()-df.min()), where min() and max() methods are used to obtain the minimum value and the maximum value, respectively These are the first 6 records in the processed result It's easy to find out the values of data processed like this are all in the interval of [0,1] This form of normalization is quite appropriate for scenarios where no distance measurement is involved Its problem lies in that, if future numbers exceed the range of min or max there would be "out of range" requiring redefining In addition, if a number is huge, the normalized value will be close, all close to 0 Have a try This is the result of normalization Apart from manual writing of code for processing the "preprocessing" module in "sklearn" is also used for normalization The "preprocessing" module has a lot of powerful functions capable of conducting various preprocessing tasks For example, we may do the Min-Max normalization in this way First, import and use the minmax_scale function in "preprocessing" As we see the result is correct Next, look at the Z-Score normalization Its calculation formula is that the different between the value and the mean of the attribute where it is located divided by the standard deviation of this attribute This way of normalization is the most frequently used The mean of processed data is 0 and the standard deviation is 1 Is it also easy to write the code manually with this formula (df-df.mean())/df.std() Let's execute it and see the result Write this formula below This is the result of standardization Similarly, we may use functions in the "sklearn" preprocessing module for processing A convenient way is to directly use the "scale()" function It's OK like this Then, let's look at the decimal scaling normalization This is the raw data A common way is to make the data fall into the interval of [-1,1] like this It is realized by moving the location of decimal point The digits to be moved depend on the maximum value of absolute value of attribute Suppose that the absolute value of AGE attribute is 90 and it has 2 digits it's only needed to divide all the values of this attribute by 10 to the power of 2 (i.e., 100) so that the data fall into the interval of [-1,1] The formula is like this How to write this formula The key is to indicate the digits of movement – j Think about it Can we utilize the function of log10() in the "numpy" module to calculate which power of 10 is the maximum value of absolute value of attribute Sure, calculated in this way, it is one hour Just use the rounding up function: ceil() For example, 1.95 would be 2 after processed with the ceil() function The complete formula is like this Please well understand them Let's look at the data after processing This column of data must have been divided by 10 and this column of data, divided by 100 Next, let's look at the second common way of data transformation discretization of continuous attributes Common methods include binning and clustering First, the binning method Apart from noise smoothing, binning may also be used for discretization of continuous attributes And it may be further divided into equal width and equal frequency As we may guess from the names the equal width method is to equally divide the data interval according to the set quantity of bins while the equal frequency method is to put equal numbers of records into each bin The data falling into the same bin have the same label regardless of the method we adopt We often use the cut() and qcut() functions of pandas to conduct equal-width and equal-frequency discretization of continuous attributes Let's run it Suppose that we're to process the AGE attribute of the Boston Housing Price dataset With either method, we divide the data into 5 bins and each bin is labeled from 0 to 4 Look at the equal width method first 5 is the quantity of bins which is the value of the argument "bins" Suppose the labels to be 0 to 4 This is the result with the equal width method As we see, with this method, the quantities of labels in the same area may be different In the data shown here, it seems that the category of 0 appears quite few Moreover, through the setting of the value of argument "bins" we may determine a range of finer granularity For example, we set the value of bins as 5 just now we may also set like this: bins = [1, 2, 3, 4, 5] Suppose the original range of data is 1 to 5 excluding 1 but including 5 it means dividing the range of data into 4 namely, (1,2], (2,3], (3,4], (4,5], respectively left-open and right-closed intervals Then, in the pd.cut() function assign the variable "bins" to the argument "bins". That's enough Sure. Such settings require certain experience in data-related fields Then, let's look at the equal frequency method Just change "cut" into "qcut" In the result of equal frequency method, it seems the quantity of each category is equal To better illustrate this we may look at the first 20 records only Look at it We won't count in details but you may discover that each category contains the same quantity of records 4 records each By the way, let’s realize the equal width method based on the 20 records As we see, there're 3 zeros and 3 ones different from the equal frequency method Both methods are frequently used and also easy Both of them require certain experience to determine the quantity of bins also with some shortcomings For instance, the equal width method is prone to influence of outliers and, after binning, some areas may have many values and some, quite few The equal frequency method is prone to putting the same values into different bins These shortcomings are sometimes not quite friendly to models Another common way of data transformation is feature binarization Let's briefly look at its meaning The core of feature binarization is to set a threshold Assign 1 to those who are greater than the threshold and assign 0 to those who are smaller than or equal to the threshold It's quite suitable for the target attribute to convert multi-class questions to two-class questions For instance, suppose there's a movie dataset and this dataset contains multiple feature attributes of many movies such as the genre of movie release month Chinese movie or foreign language movie The target attribute is the mean of movie ratings by multiple users Suppose the range of mean is [0,10] If we're to judge whether a new movie will be recommended or not we may, based on experience convert user ratings into two categories – recommended or not recommended say, recommending those with 6 points or above labeled as 1 like such records not recommending those below 6 points labeled as 0 After determination of label, we may use specific classification algorithms sure, such as algorithms like clustering which depends on the actual data and task How to realize such feature binarization We may also resort to the "preprocessing" module in "sklearn" A lot of methods are available such as functions like Binarizer() or LabelEncoder() For example, let's look at the Binarizer() function To make it simpler, let's now do it with the target attribute of Boston Housing Prices However, as the task is different this target attribute normally requires no binarization We only need to well prepare data and then set the value of "threshold" argument of Binarizer() function before we start learning Use the fit_transform() method Let's execute it This is the result of binarization It's worth mentioning that what's inside the target attribute of boston.target is the median of house prices for which the threshold value we just set was 20 That means we suppose those greater than 20 belong to the same category and those less than or equal to 20 belong to the other category Categories here may be high-priced houses and non-high-priced houses By the way we wrote the reshape() method in the program Why Because the raw data of boston.target contain 1 row in 506 columns As we need each record to have a category it's necessary to convert it into 506 rows We just need use reshape(-1,1) The result is our desired dimension What if we wanna more than two categories, say, three categories Next chapter will provide a complete case for this point and also introduce some other common functions and methods Well, that's all for the three common ways of data transformation normalization discretization of continuous attributes feature binarization We've listed some common methods and they are tasks you might need to do when solving actual problems in future Many methods may be available possibly with some variations What's most important is that we need to have the awareness of data transformation