Now one of the most popular topics in SQL has had to mash up multiple data sources together in a single query to answer more complex insights. Now in this module we'll tackle how to append additional historical datasets together, through UNIONs, as well as how to join together different datasets horizontally through SQL JOINs. Let's walk through the basics and I'll highlight some key pitfalls along the way. All right, so now let's move into one of my most favorite topics to talk about which is enriching your dataset through the use of JOINs and UNIONs. And if your SQL guru, there's a couple of different tricks and tips that you can use inside a big query that I think you'll like. But let's go over some of the basics. A JOIN, for example if you have the tables for the IRS annual tax history for 2015, but you want to get the name of the charity. Now that's in two different tables. You actually have to link those two tables together, in SQL we call that a JOIN, and you do that on a common identifying field that shared between the two tables. And there's a lot of caveats when it comes to whether or not those fields are duplicative or which type of JOIN you're going to use. And we're going to get into all that nuance as part of this module. Now a little bit simpler than that is a UNION which on the right. If you have historical data that shares the same schema, you can actually append records or mash them together vertically, I like to say, through what's called the SQL UNION. So combining both of these two concepts together, you could actually bring together the annual tax filing history for 2012 or however long the dataset goes for 2012 to 2013, 2014 to 2015, and so on. And get an enriching information about the organizational details that's not present in the individual filings by using a JOIN. So let's explore these concepts a little bit more in depth with a walk through example. So the example that we're going to use to illustrate JOINs and merging of data is going to be temperature and weather station data. So the example that we're going to be using for this walkthrough is the NOAA weather dataset. NOAA is research agency in the US which tracks the patterns of weather and does a bunch of other cool things with meteorological studies. And as such, they have a lot of really, really good weather data. So we have temperature recordings, not just for today, but all the way back into the 1920s for a lot of their different weather stations all around the world. So you have these different temperature readings, and you have the locations of the weather stations that they belong to. And as you see with the dividing line, those two pieces of data are in two separate tables that we're going to have to join together. So it's many more tables than just two. So for those daily temperature readings, it actually goes back to 1929. So there's an individual table for the annual readings from 1929, [LAUGH] and I believe this is hourly, every hour, right? So you're talking millions and millions of records, all the way to the current day. And then on the right hand side you have a look up table that has the information about the stations that took those readings. So this is like Wake Island Airfield. The latitude and longitude and a bunch of other look up information that we don't want to repeatedly store in those daily temperature reading tables that we're going to use something like a JOIN to enrich those two together. So let's talk a little bit more about the nuances of that. Okay, so we need to find a unique identifier for our weather stations before we can even hope to join them against another table. So when you inherit a dataset, this is honestly one of the trickiest things to uncover, is how do you uniquely identify a row in a dataset that you've been given? So there's two fields that look like they're identifiers, the usaf for the United States Air Force, weather identifier there, usaf. And then you have the wban, the wban in there. So what you likely saw in the first lab as part of the first course is introduce you to counting, duplicative, or distinct records. And we use two simple account functions as you see there in the SQL code. The regular count, and the count distinct. That's a quick way to see whether or not those records are the same, meaning that it is unique for those row counts. And as you see here with the red X is they are not. So you have an interesting dilemma. If you don't have any one field that can uniquely identify a row, what could you potentially do? So two general options. One, you could create your own primary key. You could insert arbitrary row number and have that be unique identifier for that record set. But if we're going to have any hope of joining it against a different dataset, maybe we'll have to make a deal with what we have. And one of the things that you can do is create a combination key. Where the combination of these two identifiers, the usaf field and the wban field together kind of smashing them together through that concatenation or that merging, is in itself a unique identifier. So you can see for that first record, right? So for the 912450, you can see row 1 and row 4 are duplicative for the usaf. But if you merge them together with the wban, even though the wban has 41606 repeated three times, the combination of both of those fields together uniquely identifies them. Okay, so how are we actually going to join that combination key on all of these different tables? We need to somehow UNION potentially bring all these data tables historically together, mash them vertically. But that might be a really long effort. So let's cover exactly what a UNION is first before we talk about the syntax to how to do it. So UNION in brief, if you have tables with the same schema or at least from the field that you're selecting across the tables are the same, what you can then do is apply a UNION and it mashes those two tables together vertically. So if you have records from the year 1929 and then 1930 into two tables as you can see the result they're added vertically, it's going to give you a consolidated table with records from both 1929 and 1930. So let's go into a little bit more of nuance. So there's two different types of UNIONs. You can have a DISTINCT UNION, or you can have what's called a UNION ALL. So if you just wanted to blindly mash together all the records, without worrying about whether or not one table had a duplicative record from another table, you would be using the UNION ALL. So now say if you wanted to actually be mindful of duplicates across both the different tables, you would be using a UNION DISTINCT. Now here, we shouldn't see repetitive temperatures across different tables, since those temperatures are fenced in by the year of the particular table. But if you just want to be extra cautious, you could do that UNION DISTINCT.