Hey there. Hive is a common data warehousing infrastructure that can allow you to use SQL on a distributed data system. Partitions are way of subdividing a table into a piece that can be stored or queried individually. In this video, you'll see how to create partitions in a table and it will discuss some ways that they can be used to make your large queries more efficient. This is just an optional video, where we're going to spend a bit of time talking about Hive partitions. So let's start out by learning some new words. Hadoop is a whole ecosystem of open source tools developed by Apache that work on a distributed file system. If your data is really big, you might choose to store it across multiple servers, sometimes called a cluster. When you want to retrieve the data across multiple servers or nodes, you need a way for those nodes to communicate with each other. Hadoop is a set of tools needed to do that. Hive is a specific tool that takes your SQL and translates it into something that can be run on nodes running Hadoop. The SQL variant is called HQL, but I've mostly heard it called Hive SQL or just Hive. So, I could say, "I'm writing a Hive query." One of the features of Hive that we're going to talk about is partitions. Maybe you should say this is a feature of Hadoop, since it has to do with how the data is stored in the file system. But, I think it's fine, because it affects the SQL part too and that's what we're going to talk about. So, if a table is just a single file, a partitioned table is multiple files, sorted or partitioned into folders. Having partitions doesn't change the functional correctness of a query when you're retrieving data, but it changes the way you create and insert data into the table. So, let's just look at some of the changes that happen when you want to have a partition. Here on the left is what the table creation statement would look like by default in Hive. It's already a little bit different from the MySQL in most of this section, because the data types have different names and we don't specify noble columns or primary keys here. The major change is that if we want a partition by category, well, first we need to list that column last and then we need to pull it out of the create table statement into a partition statement. Category will still work like the column, but it will be pulled from the file system structure, not the table itself. Then, when you insert into a table, you add in this line. Again, notice that in the query, I need the list to list category last so that it can be partitioned in this format. So, why would I want to partition my data? It has the ability to make table updates faster, it has the ability to make data retrieval faster, and it has the ability to make joins faster. So, sounds pretty great. Basically, anytime you have data where you'll be frequently updating or querying some specific parts of it, that's when partitioning might make sense. I'm not actually sure that category is the right thing to partition on. I'm not sure that I'd be writing queries where I specifically needed one category. So, let's think a little bit more about when we should partition. Date is a really common thing to partition on. It's a really common to want to insert just the most recent days worth of data. Remember, partitions can help make inserting data easier. Imagine that also we'll be writing a lot of queries where we care about a recent date Window. So, for example, if we want to create a table where we have a downstream job that has a "WHERE" clause, like I have in here, we'll have to scan the entire table just to look for dates in the most recent 30 days. But if we've partitioned it by date, then we can avoid scanning the entire table and that table could be decades long. We can just look at 30 of those folders, 30 of those dates, 30-ish. That could save us a lot of time. We need to make sure that we're mentioning that we only want those dates close to where the table is being loaded into memory. So, within a subquery or in the join condition. That's how we can avoid a full table scan. It is totally fine to add in some extra functionally redundant conditions to make sure that you're loading as little of the table as possible. I might have sold you on partitioning, but partitioning on multiple columns, it's possible, but it only really helps if it cuts down on the amount of table that gets scanned. So, pick partitions that you know you're actually going to use. Last tip is to pick partitions that have roughly similar sizes. Hive will split the query by partition and run it across multiple machines. If one partition is really large, everybody else is going to be waiting on it to finish and that slows the whole query down. So, we've made it to the end of the second module, and you should be feeling really ready to go forth and create some new tables.