Hey there. In this course, we'll be building off your existing SQL knowledge. On a dataset which is a generalized example of an e-commerce site. The format of the data we use is general enough that it should extend to many other applications and also hopefully simple enough that it doesn't take long to get familiar with the dataset. To succeed in this course, you'll need to have mastered select statements, where clauses group and order by, inner and left joins, logical operators. If any of these concepts are rusty, I recommend revisiting the first course in this series SQL for Data Scientists. You should also be comfortable with aggregation functions like, max, count, sum and so on. Also case when or if statements and sub-queries. You should at least have an idea of what they're used for. So, that you know when to review them when they would be useful. We'll also be talking about windowing functions. So, I'll link some reference material in case you need a refresher on the syntax. But we won't be spending a lot of time covering them. At the end of this video, I'll introduce you to an exercise. If the exercise takes you a long time, you may find that the best use of your time is to review these concepts before proceeding with the course. Otherwise, it's a good chance to practice your skills and explore the data set for the course. So, let's get started. So, here is a simple wireframe drawing of a website of a company we might work for. They have some items which can be viewed and hopefully purchased via the users. In order to purchase an item a visitor must first create an account and be logged in and then add the item to their cart and then check out. So, we'll need to keep a table of users in our database, that's up-to-date. Which means we keep track of the user ID, the name and email address of the users any other information we want like when the account was created, and then if there was a change such as, this user for example, was deleted. So, whenever that happened, which is much later, much after the created at timestamp. We would hope that this would show up immediately in our user's table in our analytics database. As is typical of tables like this the ID column here in the users table is referred to elsewhere as the user ID. So, if you hear me say user ID, even though the column is actually called ID. That is a pretty common convention. We also keep track of the items that we have for sale on our imaginary website. So, here's some metadata about. Each of these we've got the item ID. We have a name for the item, miniature apparatus and then some other information like the adjective category. In this case the modifier is null. You'll get a chance to explore the item dataset. There's lots of funny things in there. Then when the item was created and then the current price for the item. Keep in mind that the item's price could change. So, this would reflect their most recent price. All of these tables are effectively copies of what's running the website, and because we don't want to run a heavy query and then slow down the website for everybody else. This is just a copy and it exists in a separate place. So, here we have an invoice table. When a user completes the purchase, we would create an invoice with one line per item they purchased along with important information about the purchase like, when they paid and what the price of each item was when they paid for it. When they purchased it et cetera. We also have this other table that I call events, and this is not meant to stimulate a copy of a live production table. It is more like a stream of receipts for activity information and we'll be using it to keep track of more ephemeral things like page views. So, that we can do analysis later. When we think about the events we might like to record, one of the things that's tricky is, we don't really know how many columns we'll want. So, for example if I want to record a page view or an item view event, a user has gone to specific item page. I'm going to need a few things and I know that already. I'm going to need an event ID. Here in this example, you can see I'm using a very long and messy looking string. It's called universally unique identifier, and there's some advantages to using that rather than just an integer like the way we used for user ID. I'm also going to want to have an event name. So, here the name of the event is, item view and also time. So, the timestamp of the event, and in this case I'm definitely going to want the user ID. I want to know who was doing this viewing of items, and then I'm also going to want to know what item they were viewing. I might also want some extra information. I might want to know what platform they're on. Are they on the web? Are they on a mobile device? I might also want to know the refer. Where they came from. Maybe it was a Google search. Maybe they clicked on a recommendation. I'd really like to know because it might help me sell more of these items. So, for an event like this an item view, here's how it will look in our course dataset. All of these rows have the same event ID, and the same event time, and the same event name, and in the way I've structured it the same user ID, and then for the additional stuff, that could be of arbitrary length. We wanted three things. We wanted an item ID. So, here I have that item ID as the parameter name, and then in the column parameter value I have this, and here need to be stored as a string not an integer because when I look at some of the other parameter values they might be non numeric. So, the platform here this item do you happened on web and the refer for this was a user profiles. So, maybe our user ID 87182 was looking at someone's profile and reading their reviews and then they clicked on this item and that's how they came to view it. So, if we wanted to record an event for something else like a user creating a review, we might want some different parameters and maybe even a different number of parameters. So,in that case we could add more columns to this. This format is very accommodating. The structure is very generic and there's a few other ways we could do this. So, for example, we could just have one row per event with the things that we keep every time. The event ID, the event time, the event name, the user ID, and then we could have a last column that is a little bit more free form. Here, there's a JSON that lists in a standardized structured way that we also want to have an item ID parameter with this value, a platform with this value, a refer with this value. So, this is another way we could do it. It's got some advantages and some disadvantages. Like each company you might work for could do this in a different way. So, this data is meant to mimic online e-commerce site. Maybe it's selling real things that need shipping. Maybe it's selling digital things, and you could imagine this being Amazon or an Airbnb situation where you have to create an account and then you view things, and then you decide if you want to purchase them. We could also adopt this to other cases where maybe the action isn't to purchase but instead to save. For example, on Pinterest you could pin an item to a list and that could be, we could make a table that similar to our orders table. But instead of price, it has something else, and it's more focused on the image or the link or something like that, or on Instagram, or you're favoriting or bookmarking a post. There's are some other cases that we'd need to do some more significant changes to represent. So, a messaging service, we need to have some messages table. and we need to think about things like, what conversation does a message belong to, and is the message a draft or has it been sent? Then we might also want to think about how these messages are viewed. Whether they are read or unread. Whether we want to show a dot when they were read. We might also want to adapt a dataset or think about a dataset where we're incorporating viewing of video or listening to audio, and in that case we'd have to think about how we'd want to store people's progress in the media that they were consuming, and also how we'd want to record there listening time, and another case that we haven't really covered here is, something more like a news feed. Where we might want to think about how we represent a user viewing or getting an impression on a piece of content that's in their feed. So, we have an exercise coming up where I'm going to ask you to fix some of the broken queries in this workbook and it should go through some of the concepts that you've learned in other classes and if you get stuck. So, if you run this query and you find the error and you don't know how to fix it, my recommendation is to find a part of the error message and to type it into your search bar and see what you get. This is a really common way of problem solving. I still get errors in my SQL queries all the time, and this is how I problem-solve, and then next step, there's an open discussion if you'd like to participate. To figure out what a messages table should look like. What columns would you need? What would you want to include versus not include? When you're working for a company, that's building out software and new services. You might get to be part of a discussion about how a new table is created and how it's formatted. So, you can use this as a practice to talk to your peers about what that might look like