Of course, creating the dataset is normally 80 percent of your work. But here, we'll assume that the data already exists. So, all we need to do is some basic prep work to get our dataset ready for machine learning. There's really going to be two key steps here. First, creating the subsample of our data so that we can quickly prototype our machine learning model. Second, once we have built this modelling infrastructure, we will then split our dataset into a training and evaluation dataset. Building a machine learning model involves several steps: creating the dataset, building the model, and then operationalizing the model. The cooking analogy here is choosing high-quality grocery ingredients, preparing them together with your expertise, and then serving up delicious meals. Each of these steps will be discussed in detail. What makes a feature "good"? You want to basically take your raw data and you want to represent it in a form that's amenable to machine learning. So, it has to be related to the objective. You don't want to just throw random data in there. You have to make sure that it's known at prediction time. This can be surprisingly tricky. We'll talk about some instances of this. Your data also has to be numeric. At the end of the day, computers deal with numbers, not text or categories. So, for example, if our data is represented in text form, we need to find a way to convert it to a scaled numeric value. It also has to have enough examples for different feature values. Finally, you need to have some human insights. Often, the most impactful features incorporate domain knowledge from human experts. Let's take the second aspect here. You need to know the value at the time that you're predicting. Remember that the whole reason to build the machine learning model is so that you can predict with it. If you can't predict with it, there's no point of building the machine learning model. So, a common mistake that a lot of people make is to just look into their data warehouse and just take all the data you find in there, all the related fields, and throw them into the model. So, if you take all these fields and you just use it in your machine learning model, what happens when you're going to predict with it? When you go to predict with it, maybe you'll discover that your data warehouse had sales data like how many things were sold the previous day. That was an input into your model. But then, it turns out that daily sales data actually comes in a month later. It takes some time for information to come out from your store. There's a delay in collecting this data. Your data warehouse has information because somebody went through the trouble of taking all the data or joining the tables and putting them in there. But in production, at prediction time, at real time, you don't have it. Will we know all of these things at prediction time, as in the sex of the baby and plurality? Well, it depends. If we have ultrasound, yes, we'll probably know these. Without ultrasound, it's doubtful that we'll be able to get these values. Mothers often have ultrasounds for a foreign but not always. Thus, in practice, it would be a good idea to build two models, one with sex and plurality and one without. Another approach is that instead of building two separate models, we can build only one model but train the model both with fully known data and with masked data. That way, the same model can be used in both situations. So, switching gears, how are we going to create our datasets? Well, the simplest option is to sample rows randomly. Remember that we need to create a training dataset and evaluation dataset, and maybe even an independent test dataset. How will we split this data into these three parts? Well, simplest option is to sample rows randomly. Each data point is a birth record from the natality dataset. Random sampling eliminates potential biases due to the order of the training examples. But there are a few issues with taking random samples for training and evaluation datasets. Can you think of any? Well, what about triplets, for example? These are three rows with essentially the same data, three babies for whom the mother's age, gestation weeks, and other features are all the same. We don't want what are essentially nearly identical data points to be in both training and evaluation. We want them to fall into one or the other. That would result in leakage of information from the training dataset to the evaluation dataset, and that's something we want to avoid as much as we can. Well, how can we solve this? How can we make the two datasets be non-overlapping? For machine learning, you want to be able to repeatedly sample the data you have in BigQuery. One way to achieve this is to use the last few digits of the hash function on the field that you're using to split your data. One such hash function available in BigQuery is the farm fingerprint. Farm fingerprint will take a value like a string of December 10th, 2018 and turn it into a hashed numeric type. This hash value will be identical for every other field that is December 10th, 2018. This will now be repeatable because the farm fingerprint function returns the same value anytime and is invoked on a specific date. You can be sure that you will get the exact same 80 percent of the data each time. I'm showing you an example of splitting an airline dataset based on the date. If you want to split your data by arrival airport so that 80 percent of airports are in the training dataset, compute the farm fingerprint on arrival airport instead of date. Looking at the query here, how would you get a new 10 percent data sample for evaluation? Well, you could change the less than eight and the query above to equals eight, and for the testing data, you could change it to equals nine. This way, you'll get 10 percent of samples in evaluation and 10 percent in testing. Developing the machine learning model software on the entire dataset can be expensive, and it is better to develop your modeling pipeline on a smaller sample. This is a mistake I see a lot of data scientists make. They start out trying to build their fully-fledged model on the whole dataset. But a much better approach is to start out simple and develop your TensorFlow code on a small subset of data, then scale it out to the cloud. Chances are your model isn't going to execute properly the very first time. It's much better to debug on a small data set. If you were to use the full dataset, it can take hours or even days to make updates to your code. Then, once the application is working, you can run it on the full dataset and scale it out to the cloud. So, we can take just a random subsample of our training dataset using the rand less than 0.01 combined with the farm fingerprint sampling technique. This will allow us to keep only one percent of the training dataset.