Hi. In this video clip, let me explain how to get datasets and how to save it and retrieve it in your computer. The main topic that I'm going to cover in this video clip is file reading and saving and introducing dataset repositories. There are three famous dataset repositories. The first one is UC Irvine Machine Learning Repository. Here is the link. The other one is Rdataset. Another statistical packaging, which is very famous is R. You can use R for statistical analysis or coding instead of Python. R dataset contains lots of dataset you can use. Also, Kaggle datasets, which is also very famous repository, from which you can extract so many kinds of datasets. First, let me introduce UC Irvine Machine Learning Repository. Click it. What you see here is Home page. I already introduced Iris dataset, but what if we want to get Iris dataset from UC Irvine Machine Learning Repository. Simply, in this Search window, you put "Iris," then Enter. Then the first link which is provided is Iris dataset under UC Irvine Machine Learning Repository. Click it. Then you came back to UC Irvine Machine Learning Repository, but at this time Iris dataset is presented. If you want to get dataset, here's description of the dataset, you click the top folder. The top folder is here, and if I a little bit enlarge, here's Iris data. If you want to get Iris data, you can get the data from the link here. How you can get the link information, mouse on this Iris data and then mouse on right-click. If you right-click, Korean information is provided. But down here, obviously, you can see on your computer English Drop-down menu. You can see Copy link. This is the link copy, so left-click. Then the link to this dataset will be copied on your computer memory. So copy it and then go back to your coding window, and let me show you whether it is correctly copied or not. Actually, I put that here, Iris dataset pd.read_csv. The Iris dataset provided by UC Irvine Machine Learning Repository is CSV file, that's why I'm reading, read_csv function, which is contained in pandas library. Because this dataset is now stored in my computer, I'm getting it from the web page. This is data URL, but this is exactly the same as I copied from that one. Let me copy the link here. This is the link that I copied just a few seconds ago. If you compare this link information with this link information, exactly the same. Now, header=None means that this Iris dataset provided by UC Irvine has no heading information. It means that there is no variable information contained on the dataset. This Iris dataset contains only variable data information without showing their variable names. Header=None, I am putting this one, if you do not put this information the first-row information will be used as a header. That's why you need to put this information. If you don't want to use None, instead of none you can use zero. Sometimes also you have written header true means that already in the dataset, there is a first raw for variable name. In that case, you need to use the first-row information as variable name. In that case, you put header true or header one, number 1. In case a header=None, number 0 instead used. Here zero can be used instead of none. In case of a true, you can use 1 instead of true. After downloading that information into your working memory, you are assigning column name, SepalLength, SepalWidth, PetalLength, PetalWidth, and the last column is Name of kinds. Setosa, Versicolor, and Virginica, those are already there's a fifths column containing Iris class. Let me execute this one then what happens? It takes time a little bit because we are downloading it and this dataframe is presented here. But at this time, this Iris dataset is downloaded, brought from UC Irvine data repository to your working space. Now, in order to provide column names I used, Iris dataset that columns equal. We already studied these variable naming columns. If you want, you can put names when you download, for example, you put this name function copy and put it after header, but before header, a little bit easier. You put names and separate comma. Then what happens, actually exactly the same dataset will appear with the same name. It takes time because we are downloading. But anyway, same dataset is used. You can use names method in order to assign when there is no header information provided in the dataset. We downloaded CSV file from our depository. Surely you can download other datasets from UC Irvine repository, or this second data repository, or dataset or Kaggle datasets. I'm not introducing all the cases, but you can do it, no problem. Let's print how many observations are in the data frame? Because we already familiar with Iris dataset, length 150. We already know there worth 150 observations in the Iris dataset. The same dataset is downloaded. We previously used the Iris dataset contained in cycle-learn library. But if our dataset is not contained in existing library, you need to sometime download from somewhere. In that case, you can use this approach. Shape information is 150 by five two-dimensional dataframe. So far, I introduced the famous dataset repositories. Now, once you downloaded the dataset in your working memory, if you turn off computer, the data within your working memory disappears. Before finishing your coding, you need to save the dataset. Now let me explain how you can save the dataset. First, let me explain the concept of absolute address or absolute file path, and relative address or relative file path. You can use either address or file paths. Address means simply file path. Absolute file path in computing starts from C drive because every computer has C drive which is the main drive. In my case, on the C drive, there is users, and on the users, there is CELT directory. On the CELT directory, I created Coursera directory. Within this Coursera directory, I put all the coding files right now. But in this case, I'm trying to create another data subdirectory under Coursera because when you code, you may download many datasets. Do not mix datasets with your coding files because your directory becomes messy. Rather than mixing coding file and dataset files, you need to separate. In this case, let me show you my directory structure right now. Under Coursera, here's images subdirectory. I'm putting all the image files here. But I want to put dataset, then in order to put datasets in data directory, we need to create another subdirectory. Here is folder sign, with plus sign on it, click it, it is new folder. If you click that one, then untitled folder appears, then you rename, you assign name for that directly, data. I'm creating subdirectory data. You better to follow this because in the following video clips, I'm going to use this file path. Now, absolute file path actually shows the whole address from root to directory C drive, root to directory. But because where you are working right now, I'm working on Coursera directory. It means that I'm here. You don't have to type when you save or retrieve data, you don't have to type absolute address. As I did here, I'm using a relative address, relative file path, because where I am right now? In Coursera directory. Then in my working directory, I'm only designating where data is stored, on the data subdirectory. I'm going to save the irisdata. See, I brought in my working space. In order to save irisdata, the object that I created here by downloading the file, I'm saving to this data subdirectory with iris.csv file name. In order to save, this is data frame, irisdata, you use to_csv. This to_csv function belongs to Pandas library. Because I am working with Pandas data frame, so I'm applying this saving function contained in Pandas library. If I use this one, what happens? Before executing this cell, let's go into the dataset subdirectory by clicking it. Nothing in there. Because I have never saved so far any file in this subdirectory. Go back to Coursera and execute this one. Then what happens? There must be a file in the dataset subdirectory. This file should be found in the sub-directory. If you click the data set, you see the iris.CSV that you just to saved and click it if you want to see that dataset here; sepal length, sepal width, petal length, petal width, and name. This is the fifth column. If you scroll down, you see all the 150 rows. This is a benefit of using Jupyter lab. This way of looking at the data is similar to the way of looking at data in Excel file. This layout is almost exactly the same as the Excel layout, so you can check the dataset in this Jupyter lab program. After collaging it, now, you can retrieve iris dataset that we just saved. In that case we use read_csv. Now, we are assigning another object name in order to distinguish this retrieve data frame from this, the original iris dataset. Let me execute this line. Then we see another data frame containing the same information. But at this time, this information is retrieved from the dataset you saved. Try to use relative file paths, do not use absolute file paths in saving and retrieving data. Why am I emphasizing this? Because if you put absolute file paths, what happens? When you share this file, Jupyter Notebook file with others, the file paths depends on people who use the computer. The other people may assign different name for the same directory, for example, rather than using data or other file paths they may use. By using relative file paths, you can improve file-sharing. Also it is easier to type file path information as long as you use the same sub-directory name data under this Coursera directory. Without changing the code at all, you can use the code that I posted somewhere or when I share this file with you, you simply use the code without correcting it. Use relative file paths. Though sometimes you may have to use absolute file paths, but it'll be very rare to use absolute file paths. Finally, the review question is, what is the syntax to use the first row values of a CSV file as column names? Header equal 1 or header equals true. That was the command line you need to use.