[MUSIC] Hi, in this video clip, let me give you an example data preprocessing. You can combine or divide our data frame. So pandas library is widely used for data preprocessing. In order to give you an example, first, we need to take our dataset from scikit-learn embedded dataset. Scikit-learn already have lots of data reading is library. In order to get the data from scikit-learn embedded dataset, we need to first input scikit-learn library. This is what I already explained in previous video clip. Now using load_iris dataset function, we can take the iris dataset and assigned to iris_data object. So this is a way of assigning iris data existing reading, sci-kit learn dataset to this object. And, You use keys function in order to see what kind of data are existing in iris_data. So let's execute this cell. Then it means it returns this information data, target, frame, target_names, DESCR, and feature_names, and filaname. So those names are contained in this iris_data set. To return the data, simply names of keys. It means that the dataset actually takes dictionary format. So those are key names for each dataset. So iris dataset can be retrieved by using data keyword. If you want to see the description of iris_data, the DESCR, this keyword actually returns the description of the dataset. So let me ask you this cell again and we can see the print, Printed information from iris_data.DESCR. So DESCR ,by using this command line, you can get the basic information of iris dataset. There is attribute. Attribute information is, attribute is another name of variable or feature, and there are three classes iris, setosa, versicolour, and virginica. Those are the basic information contained within the iris dataset. So now the whole data set is contained in this iris_data. Now we can unpack some data from this loaded dataset, iris_data, that is the whole information of iris dataset. From that iris dataset, we are taking data and assigning to iris_feature. It means that column dataset will be assigned to iris_feature. And iris_data, within the dataset, we are getting target, target and it is assigned to iris_target. Iris dataset is often used in AI education. It is used for classification, label the classification for supervised learning. Target information contains, The class information. As I already explained, there are three types of iris. So, target variable contains this information. The data, Set contains for feature information. So let's see how does it look like. After assigning iris_feature and iris_target, let's first see shape of the dataset. The feature, iris_feature dataset is two dimensional dataset. As you see here, the row number is 150, it means that observation is 150. And column variable is 4, so it is 150 by 4 dataset. But target dataset, iris_target dataset is also numpy array and its dimension is one dimension. So if you want to combine to numpy arrays, the dimension should match it. So we can make iris_full dataset. This is another numpy array, two dimensional array. We are using hstack, horizontal combination of two dataset. Iris_feature and iris_target data will be combined horizontally. Yeah, but before combining, we need to reshape because it is one dimensional data. But in order to convert this one dimensional numpy array into two dimensional numpy array, we are used specifying 150, that is the row number and one column. So the same data, but at this time, the target data existing in two dimensional array space. So if we execute this one and check the shape after horizontal stacking, this is what we already studied when we study numpy, right? So you remember, we are horizontally adding two numpy arrays and it becomes, its dimensions, shape information is 100 by 5, right? So now, two dataset is added. if you want to see the first five data points, so this one is returned. So there are five columns and five rows, two dimensional space. Now we are transforming the numpy array into data frame. So iris_feature, at this time, rather than using the combined array, I'd like to give you an example problem. We are constructing two arrays and combining Two data frames from two iris and then combining two data frames. So we can convert the combined all into a data frame. Or we can convert the dataset or into a data frame and the target data frame and combine them. Either way you can use, but here's let me give you an example first. We are creating iris under by feature data to make a data frame and column names here. Feature names are used if we go above. There was the feature name, feature name, right? When we called this iris dataset contains also feature names, if we execute again here, target names. Target names and feature names. Feature names is the variable names. So we are assigning variable names to this data frame and if we execute what you see is this one. So here's row index which is automatically assigned to this data frame. And now column names are appearing, four column names. The first column name, sepal length, second column name, sepal width the third column name petal length, the fourth column name, petal width, right? So it is 150 by 4 dimension. So save information is 150 and 4. Now, if you look at the data fame created here, the value names are separate, contains empty space. It also contains centimeter information. So if you want to use that column name, it is not easy to type that column name. So it is better to change, that column name into a simple or column names. For example, here we can rename the column names using this command line. This is, I already introduced, that you can use here is iris feature on the body of which is created here, right? Now we are changing column names using this command line, but at this time you need to strictly follow the order of column names. If you change column names arbitrarily, then you misidentify the column information. So sepal length, so here's sepal length, same name but now there's no empty space and sepal width, petal length, petal width. Then let's see how the name is changing. This is the name changer you see, but if you want to use another approach, because what if there are so many column names? That's why you cannot remember the order of column names correctly. In that case, you can use dictionary in order to try and change column names. So let me make this dormant and see another case. So first, we need to read one, this one and then this cell shows you rename columns, here's iris_feature_df. This is the object name, you are using rename function and then column is equal to, this is the dictionary time, sepal length. This is the original column name matching new column name is given here, so four column names are created and running. And also you remember here inplace true, if you do not give this one actually the original dataset column name does not change. So if you want to change column names, reading that original dataset iris_feature_df. You need to put this command line inplace equal true. In that case, the column name is changing. So this is a new column name created based on this rename function. So you can use to rename function, but in this case order does not matter as long as it matches order does not matter. So when there are many column names and if you cannot remember correctly the order column names, maybe it is better to use this approach. Now, we are creating another data frame based on target. But at this time, target name is here because this is one variable iris target, we are getting that and let's execute this one. Then what you see is here's draw index and target very well contains 0, 1, 2, for example, if we use tail instead of a head, what is the last five observation will be returned from 145-149. Five last observations, its target information is 2, it means that there are three kinds of iris. For setosa 0 assigned, for versicolor 1 is assigned, for virginica 2 is assigned. So let me see the shape information 100 by 1. Now it is 2 dimensional, and also, as you have seen originally, the dataset provide target names. So if you execute this one, you can see target names. Sentosa, versicolor, and virginica. This is not iris_target, this is original iris_dataset. Reading the dataset we can take out target_names as a recipe for this is dictionary type dataset. That's why if you put key here, the matching values are presented here, right? But as you see here, target theory contains 0, 1, 2. But if you want to create another label that contains names of a kind. It is kind, so in that case you can use this command line. Yeah, here's let me explain the command line, here's iris on the bar target on the body F data set which is provided. And you are taking specific chunk of data using a rosy in case of piracy. Which index are using label index, so from that original data said you are getting target information. And then If the target information is equal zero then we are creating another new variable called the label. Reading this iris on the target on the idea for data set and we are assigning values sentosa string value to this label. So and then we are keep repeating so in case of both color, it is a kind of slicing for that data. Target variable matching label value is vascola if it is to virginika, so if we execute this on what happens. No, the iris on the bar target on the RdF have another variable called label because it is created based on this. Whenever it is matching another name of iris kind is assigned, so you can easily check whether new label variable is correctly created or not. Again, if you use tail you can check T-A-I-L-N, 0bviously virginika target number is two and matching label. So we can create another variable, we can add are not available to existing data set. Existing data frame using this command lines, now we are combining to data set. We already created iris on the by future on the body F, this is or data frame containing future information. This is another data containing a label or target information, we are horizontally adding access equal, we are adding horizontally. So we are creating new data set and see it looks like this one target and label data frame is added to existing. New ideas on the bar featuring the body F and created new data set. New data frame called iris on the buffalo on the body F and five observations printed here. If you use TAILN, you can again see the last five observations, so the same. What is the demanding? Because two very recited, probably 150 by 6, no, so far. We studied how to add new variables based on existing information and how to combine two data frames. Now it's time to study dividing existing data frame into soft data frames. What if you want to take off sentosa only data set? You want to create another data set, so setosa on the body F, you can take out Chong Khao or part of data set which have settles a label. It's easy because it is slicing, you are slicing this way I'd be so full DF label. If label value is equal to set aside, you are taking only those raw data set, so from iris on the bar, full on the body F. You are taking some part of road data based on label information, so if you take out, what do you see, only 50 data set is separated from 0 to 49. It contains settles a source, if you want to see shape it 50 by 6 as I said before only the toes. A data set is sliced, horrid, gently slashed, the same approach, it can be applied to this case verse color case 1st. Let me show you the first to commend the line, so in this case we are using the same approach as we did before, so but we change it on the verse color then. What do you see here, Is that? The index number from 50 to 99, that is the 51, so first column, right? But in this case index number is not starting from zero because we are using existing in that number, so if we want to reset the index number, you use this one. You add this command line reset on the bar index and see what happens. New index number starting from zero is added but existing original index does not disappear. It creates another new variable cold index and contains old index numbers. So what if you don't want to get all the index numbers in their case, you use this one the same reset index but you add drop to drop is equal to true, then obviously. No, all the index does not exist in the status, so if you want to drop the old index, you add this on drop troe using the same approach, we create another data set for veronica. Then you see here We added already reset on the by index and contained drop equal, true, right. Then we create this one index numbers starting from 0 to 49. Now we divide the whole data frame into three charms, right chunks. But what about we again, based on three sub data set, we create the whole data set, so in that case warriors the same function concatenation. So in the previous case we used to contact, right horizontally combining, we used to contact vertically combining. We also use contact but in this case vodka combined we use access zero. And we are adding three sub data setos and the body F, those are on the body, vertical law on the body, vasinita on the body F and we need to reset index because. Data right now, 150 observations. So starting index 0, ending index should be 149. And old index from 0 to 49 should be dropped. So if after rebuilding the original dataset, we have, right now have another new data set, which is exactly the same as the previous original the whole dataset. Now here is another case, scores_df, scores_ni, we are combining using concat function. In this case, we are resetting index, then what you see here is that, We didn't specify access information. Then obviously, automatically, it means that we are stacking vertically. We are combining the two dataset vertically, and index 0 to 5 is numerical index is added. Original index is contained, converted into our variable computer, chemistry, food, math, econ, physics, they are all included here. And, In the two data frame, in the first case, Num exist, in the second case, Num is not here. In the first case, Lee was not existing. That's why in that case, they are combined, but no information, not a number is printed here. Now, making our unique values by combining variables. For example, you can add Year variable using this one. Score_full, this is combined the dataset, right? Your already presented here. But we want to add Year variable by creating another seven year sequence. So if we ask you this one, another year data will be added, right? So first, in order to simplify the output, let me make other command lines dormant and execute this one, what you see is this one. Now, year added, right, year added. And then what I want to show here is that you want to create course_year, another new variable. Because by combining the course name index, reading index, you are combining course name in year. You are creating unique index, another unique label index. In that case, how can you create a course_year unique label data? So scores_full original dataset, from that dataset, we are taking index, we are concatenating under bar, we are putting under bar as your string. Another concatenation and scores_full from the dataset we are taking here. But as in this case, year is not string, year is integer numbers. So in order to combine strings, we need to turn this integer value into a string. How can we convert integer value into string? You use this method that values, the values from this year as the type. Convert type is changed into string as type is a function converting object type. So from integer, it will be converted to a string. So now all, three part is string. This part is string, under bar is string, this year information converted into string. That's why it will be concatenated, and new variable is created here. Of course, _Year is created, right? So based on two variable information, you can create another unique id for each scores. So after creating this one, what do you see? Variable number is keep increasing. So after creating new unique id for scores, you may want to drop delete two variables index and year, because they are redundant information. So in that case, year is the original dataset, you use drop function and you're specifying column names. So it means that you want to drop index in the year, and new dataset is created, scores_full_ the data frame. And see final data frame created here is this one. Of course, under bar is suggesting an index and year variable disappeared. So far, I explained how to preprocess data before analysis. So this is a good example of showing how good a Pandas library is. Pandas library is powerful. With this library, you can transform existing data in many ways. You can easily chop out and combine them again, or you add new variables, you can do that. Before finishing this video clip, here's another review question, true or false? Simple use of reset_index results in keeping old index after adding a new index to a data frame. This is what I explained. So if you want to drop old index, you need to use a parameter setting, right? What was that? It is drop true, drop equal true, then all the index that's not up here, and the question is true or false? True, true. If you do not add, drop equal true. Old index will be added as a new variable to the existing data frame.