Welcome back. So, one of the big struggles in teaching a course on data science is that it's really hard for students to get access to a real live dataset. At its worst, that means a student is practicing SQL which is all about joining datasets on a single CSV file with only one table. The other difficulty especially when creating new tables is that in real life, time moves forward. The new data is continually added to the dataset. This we just don't have the ability to simulate in a course like this, but we're going to talk about it. By the end of this section, you'll define and use some new vocabulary, and you'll be able to answer questions about dependencies between tables. Let's go through some examples of questions that we have covered or will cover in this course, and figure out what will be different if we created this table with the latest data always. So first we'll talk about the user counting problem in the last section. Let's imagine that we did the naive solution where we just removed every user whoever got deleted. Then, what if a bunch of users were deleted? Maybe we've found out recently that a ton of them more bots and we didn't want to include them in the user count. If the users were deleted yesterday, that could affect the user count way back in time, all the way back to the day they were created. So we'd need to rerun the whole table every single day. Fortunately, in our solution video, we figured out a solution where we don't have to do this. We'll count just the net added users each day. We can look at just the most recent days and the count of net users added in a day will never be wrong even if something changes later. If our dataset was really big, this one isn't all that big, our computational time might be worth conserving, and we could actually get by with just inserting a new row one day at a time. So what would that look like in the code? Well, we'll be altering the table rather than creating it, and we want to add in some where clauses so that we can compute the results just for users created recently or deleted recently. If you're working with this system to automate it, there'll be some way to stick in a date variable into the code. But it will depend on your environment, so we're not going to talk about it too much. With the date variable, the query can be rerun each day without having to manually edit the date. So these are the things we'd add, and it'll look a little bit like this. So imagine that all of the tables in the database are automated this way. Next, let's talk about dependencies. We're going to skip ahead a little bit and think about the view-item query that we wrote in the previous module. In the next section, we're going to practice creating the table. But for a moment, let's imagine that we also want to automate it. In that case, we may initially create the entire table like we do in the next section. Then modify the query with some date restrictions and just add new events to it daily. Okay. But there's more to it than just slapping on some date variables. This view-item table we want to build would depend on the events table being up to date. The moment new events come in, the view events table will become stale, meaning it will not be up to date. But also if these update happens say nightly, then it only makes sense to kick off the job or the task of updating the view-items table, after the events table has been refreshed. If we want to build something on top of the view events table, for example if we always want to keep an up to date list of the most recent item that you viewed, then the most recent item-view table would depend on the view-item table. So here's our pipeline. First, the events need to be updated, then the view-item events table, then the most recently viewed-item table. If I build out all these tables and schedule them to run in the proper order, I could say that I've built a data pipeline with multiple dependencies. Sometimes the abbreviation ETL which stands for Extract Transform Load, is used to describe the steps happening under the hood during table creation. Sometimes the software that manages all of the scheduling is called an ETL System. There are lots of tools you can use to manage the scheduling of all these, but we aren't going to cover them in this class. Lastly, question. If we design our queries to run a data time, how would we fill up the table with dates from the past? The answer is, there's a process often called backfilling. If you've written the query well, you can just use the date variable and put in an older date. Depending on your setup, there are going to be different tools for this, and we aren't talking about how to do that because it isn't really a SQL process. If this subject interests you, this area of focus is called data engineering. It involves a variety of aspects of making the data available to data users, including managing the software that schedules the pipelines. To review, here are the vocab words that I'd like for you to feel comfortable with after this section.