Hi. In the real world sometimes, you need to combine two dataframes containing different information. Should it there is a key variable based on that key variable you can combine two datasets, but two datasets contains different information, there are some common information, but some part of the information of two dataframe are different. In that case, what happens? How can you efficiently handle dataset for combining two different datasets? Let me give you a realistic case. There are two dataframes which is created based on two dictionaries. The first dataframe is class_df, class dataframe. It contains four variables; course_id, lecture_id. It is probably lecturer information. This is not course_id, it is lecture_id. This is professor code, faculty code. Each faculty is providing course number ECN101 or BTM103, those are courses taught by each professor id. Because this is one courses offered by one professor and there are registered students, 70 students are taking ECN101. CAG means that course average grade, for example, 3.0 in case of ECN101. Seventy students took ECN101 from professor F1501. The average score student to get from the faculty member is 3.0, so this is the course average grade. On other dataframe, course evaluation dataframe. Here is faculty id. Surely, faculty id is different. F1506 is not contained in the previous dataset and also if the course number is the same, registered student number is the same and because each professor teaches one course, there's a matching course number and faculty number. Those are the same as above, but shares some course numbers are different and faculty numbers are different. CES, course evaluation score, after semester. Student evaluate, faculty members teaching. This is the student evaluation score, 4.0, 3.9, 4.3, the last case, 4.7. The first dataframe looks like already here. I don't know, this is the second one, the first one is here. Lecturer id, student registered course number, and CAG. There are five courses and five lecturers. The second case, there are also five lecturers and five courses but the informations contained in the second case is different from the information contained in the first case. Now how can you merge them? You can merge two dataframes, concat, you can use, in the previous video clip I used concat, but surely you can also use merge. When you merge two dataframes, probably use concat or merge. Should it later our ratio on other function join. You can use join but concat or merge is fine, probably. On my experience, merge is the best one. Let's combine two DataFrames based on lecture ID. Then what we see, the outcome is this one. Only the data point existing concurrently into DataFrame are presented. In this case, in the condition you supply [inaudible]. If a data point exist on either one of the DataFrame, it does not appear. In this case based on lecture idea, three data points commonly existing in the two DataFrame is combined in new DataFrame, mergered underbar DataFrame 0. What about you want to use two merging criteria, lecturer ID and course ID? In that case, if you want to use two variable as a criteria of merging two DataFrame, you can do so as it is presented here. Two variables are contained in on. If you execute this one, the same DataFrame is created. Because only three observations are commonly existing in the two DataFrame and they share same faculty ID and course ID. This happens because there's one-to-one matching between lecture ID and course ID. What if one faculty member teaches two courses? In that case, there might not be one-to-one matching between lecture ID and course number. In that case, the outcome will be different. Now, column slicing. In this case, what do you see in the previous case? Actually, register student number existing in course, in the first DataFrame, the same information contained in the second DataFrame. In that case, what happens? Automatically, new_x, _y is added. I didn't add it. The Pandas library automatically add x and y. This is the left DataFrame. X is added to left DataFrame, y is added to right DataFrame. There's duplication, so you want to delete duplicated variable. You can use drop or you can positively select a few variables that you want to keep. In this case, we are dropping student register number with _y. Here, we are keeping only the first column. In this case, we have a reduced DataFrame. There are different ways of combining two DataFrame. How? After how parameter, you can use inner. It means that end. Inner means end. You are combining dataset based on two variable indexes. In this case, the outcome is the same as before we have seen. So only three. What if sometimes you want to use left dataset as your basic data frame. In that case, you use left. Then what you have is that, the first data frame is kept and the other dataset is added when the tool data lecture id does not appear on other dataset, obvious there's no data point so that's why NaN appears. Surely if you can use a left you can use right also. Then course evaluation dataset will become a base dataset and left dataset is added to the course evaluation data frame. When you merge two data frames, you can also use outer, outer here means that all. You are combining two data frames without losing any data point. What you see here is this one. Now all data point you can just find in this data frame. In that, if we want to get this data frame, you use outer parameter for how. Then you get this one. Also you can use join function. But join function is very restrictive. It means that data frames with law row label indexes, and no overlapping variables. In that case, two data frames can be merged using join function. Also set under bar index using this function, you can set row indexes with existing variables. What I mean by this set under bar index, you can create new index using column information. Let me show you the case example. There are many command lines. If I run the cell, add one, then I commit message. First class under bar df. This is the first data frame that I created. Set index using lecture id and course number. Then what happens? Here, now id changes to this combined id. Lecturer id, course number together used as a row index. The remaining variables are existing in the dataset. Another case, course evaluation, also in this case, we are creating new id. New row index using lecturer id and course number. This is the same as before, same as d over, but we are adding this one CES, it means that we are keeping on the CES in order to reduce or in order to overcome overlapping variable issue because in this course under bar eval under bar df, there's also registered student information. This one is still under bar reg. But if you want to use join, this join function two data frame shouldn't not contain duplicate variables. That's why we are keeping on a course evaluation score variable. Let's execute one, so in this case, there is label based on two variables. Only variable you could just see that the data frame is CES. Between class under bar df1 course under bar eval under bar df1. Between two data frame there's no overlapping variables. In that case, we can create another data frame using joint function. We use the outer join. It means that all the data will be kept. In this case, register student number, and course average grade, course evaluation score, three papers are remaining and a new row index is presented. Based on this new data-set, what if you want to remove a specific row. In that case you can use this command line. What is the DataFrame name merged_df4? You want to remove a specific row from this module on divide df4 and you're restating that DataFrame in order to select specific variable CES. Course evaluation score. If course evaluations score is 3.9, it'll be deleted because there is this tilde. Tilde sign means that negative removing that observation. So in this case if you execute, what happens? What happens is this one. A row row disappeared with 3.9 CES. If CES variable's value is 3.9, all rows will be deleted. But there's only one row this, so one row is deleted. What if we remove this tilde from the sign, then only this role will be kept. So if you execute this one, this new merged_df4 gives, only one-row observation. So if you want to positively select all field rows, you use this command line. But if you want to remove a few row information based on from existing DataFrame, you include this tilde and it means that you are removing all few rows based on this condition. Isin is a function checking, whether this information is existing in this CES variables. If this value exists in this CES variable then that observation is deleted. This is a very convenient, we will preprocess data set. Another function that I need to explain is dropna. Dropna means that here if you see NaN observations you want to delete all rows with at least one NaN, not a number. So in that case you use this one, an outcome will be this one, dropna. So only remaining data is without NaN, three rules are kept, four rows are deleted. So dropna using this function you can clean the data set. You can remove the observations with NaN, it means that missing data. If you want to remove observations with missing data, you can use dropna function. Before finishing this video clip, again, here goes a review question; true or false. In combining data frames, the merge function is more flexible than the join function. Yes, it is true. Join function can be used only when a few conditions are satisfied. That's what I already explained. So if you want to manage multiple data frames, you can use either concat or merge. But merge is, based on my experience, I recommend for you use the merge function.