In this week, I want to introduce you to a few more exploratory data analysis methods for distribution data. With a handle on our basic ggplot work, and an understanding of some of the fundamental types of charts such as line, bar, scatter, and histograms. I think we can start to look at a few more advanced visual methods for distributions of data. While the theme of this week stays with visualization of distributions, I'm going to bring in a different dataset, this time from the Department of Natural Resources in the State of Michigan. This dataset looks at the fishing data on the Great Lakes, which are the largest freshwater bodies in the world, called Creel data. This data makes up the Evo anglers, these are people who fish, put in and the results of that effort. Most of this data is collected manually by the Department of Natural Resources. Let's jump in and take a look. I'm going to start my usual way. Bring in tidyverse, janitor, tidylog. I'm going to bring in tidylog regularly now, and we're just going to read in the CSV, and pipe it to clean names. We could see here we only have a few different variables, and unique identifier, time, month and year, some location information, lake, county, port site, and a mode, this seems to be, how the fishing was done, and then an estimate, which is either Angtrips or effort. The first of these estimates is, how many trips anglers made in a time period, while the second is the number of hours they actually put in fishing. Now this data would be considered non tidy, in that, for a given date, there is both an estimate for the Angtrips and for the effort. Those should really be represented by different columns or variables. We can clean this up a bit to make it easier to work with. There are two different functions which can help us convert rows to columns, and vice versa, pivot wide and pivot long. Right now, our data is in long form because we have two rows for each observation, one with the angular trips and one with the effort. If we use pivot wide, we can convert the single rows into two columns. One which holds the estimate of the number of trips, and the other which holds the estimates of the effort. Let's take a look at how to do that. First, we're going to drop this unique identifier. If we left it in there, it would cause problems because pivot would say, "Well, these rows aren't actually the same. What should I be doing with this?" Actually we don't really need a unique identifier in here. Now, when I bring in the data, I'm actually going to do that by doing a negative selection. I find that a nice handy way to get rid of it, instead of saying, all of the columns that I want to keep. But then I'm actually going to drop a bunch of columns that we're not going to need for the analysis either. Now we want to pivot wider, we want to add columns. We want the column name to come from the estimate variable, and the value to come from the estimate value. Every time you're pivoting wider like this, you need two things; the name of what your new column is going to be, and what its value actually should be. We're going to tell tidylog which values these are with the names from the values from columns respectively. I'm going to say names from the estimate, and values from the estimate value. Let's just take a look at what this would do the first 10 or so entries. You can see that we pivoted this out, and as we move across here, we'll see that we actually now have an Angtrips value and an effort value. For this first entry here, we've got the month of 4, we've got the site information, we've got the year of 1989, and the next row is going to be a different date. It's got a different month to it. This is how things should look. You can see that we've got this unique year and month for each line, and then we have new variables for Angtrips and effort. We have widened our table, though I did drop a bunch of columns we don't need to, so it's not necessarily, doesn't look so much wider. Let's now do this to all of our data. I'm just going to run this function across all of our data. Now let's take a look at our results. This doesn't look quite right. When we do this to all of our data, the value in the Angtrips and Effort variables becomes a vector of doubles. Now if we page through most of this, they seem like they're of Length 1 or maybe a few NA values, but that should be fine. What's happening? Why is pivot wide creating these values as vectors? Well, we need to do some sleuthing and our data. Usually, when this happens, it means that there are more than two rows that match somewhere, so pivot wider isn't quite sure what to do, and it just decides to put vectors and those variables for us. Pivot wider never wants to lose data because of something it did implicitly. If we told that the columns should pivot out, and those columns are not unique, and have multiple entries for the rest of our data, pivot wider is just going to put vectors inside the variables for us. Remember that a single column has to be of the same datatype, so if it finds one place where it's going to put a vector of doubles, everything is going to become a vector of doubles, even if they only have a single double value in it. A handy function that we haven't looked at yet, and actually comes from the janitor package , is called GetDups. This package allows you to specify which columns you're interested in, and tells you how many matching values are in those. Let's use it and try and find out what our issue is here. By default, GetDup is going to return a new column in our data called dup counts. Let's sort by that. We can indicate the columns we want to exclude from the duplicate checking. In this case, it's just our estimate and estimate values. We expect those not to be duplicated because those are the things we're trying to pivot out. Then I'm going to change it descending by the dup count, so we'll see anything that's got a high dup count number. Let's take a look at this. Yeah, we've got some higher dupe_count numbers. We see that there's actually a number of entries which have three duplicates when we only want to have two duplicates. One for the ANGTRIPS and one for the EFFORT. We look at the first three entries, we see that there's actually one entry for angler trips and that there are two entries for effort with different values. Welcome to the chaos of real-world data. Let's see how many instances are like this about this bad data that we have. I'm going to run get dupes again and I'm just going to filter word dupe_count is greater than two. Then I'm just going to pipe that to nrow, which is going to count for each one of those, how many there are. We can see that actually that removes most of our data, and there's only 18 rows which fit this criteria. I'm going to consider this an acceptable amount of data loss and put that into our data manipulation pipeline. Before I do this though, I want you to think about this get_dupes function I just introduced to you. It identifies duplicates given a set of columns. We don't really need this though. We have all of the tools we need to do this ourselves with dplyr. If we didn't have this function, how would you identify duplicates? Now, let's finish off our data cleaning activity. I'm going to use get_dupes as a filter and our data then pivot from the remaining values. I'm going to call it get_dupes. Then I'm going to filter where the dupe_count is two, so just the things that I want, then I'm going to drop the column dupe_count because we don't need it anymore, and then pivot and let's see if that fixes our problem. Let's page over in our data a little bit here. Yeah, we see now that our ANGTRIPS and our EFFORT are actually the numbers that we're interested in. Just notice though that we have some now non-clean names in there. Some capitalized words, whereas previously when we run things through clean names, everything comes back in snake case, we have a table now where the observations are unique to a given data, location, and mode, the fishing that's being done. We have our angler trips and efforts as new variables. Now there's sort of a philosophical question at play here though. What is an observation really? When should we stop pivoting? For this data, I've gotten mine and observation as a data location in a mode. But somebody else might say, hey, it's just a date and a location, and that mode should be combined with columns of angler trips and effort. Now, this all depends on the question and what visualization you're interested in doing. Still, some might argue that if you only have a couple of locations and you're looking to compare between them then every location has a measurement in a given month and year. Maybe you just make sense to have month and years, the columns. Now, I don't think there's a clear answer to what a tidy DataFrame is in this context, it's really your inquiry that's going to drive what is tidy for you. Here's an example of how we could pivot this even wider and create new columns for each mode and angler trips and effort pair. You'll notice that a lot of this is empty and for good reason, some methods of fishing, like ice fishing, aren't actually something you can do in the summer on some of the great lakes. I'm not going to save this, but I'm just going to use it to demonstrate. We pivot out and we're going to use the names from the mode and the values from ANGTRIPS and EFFORT. We're going to do this Cartesian join again, but for a Cartesian product between modes and hanged trips and effort. So here's an example of what that would look like for ANGTRIPS_BOAT, we have a number but there's also an entry for the ANGTRIPS_PIER/DOC ANGTRIPS_OPEN ICE, ANGTRIPS_SHANTY, which is a shack on the ice and so forth. You can see that as we move across, we'll also have those for effort as well. This wraps up our data for the week. Fishing data. Michigan is a very outdoorsy state with lots of hiking, fishing, and hunting. Vast proportions of the state actually have a very low population density are quite remote. In this week we saw how pivot wider works to turn rows into columns and change what an observation is in our dataset. This is actually a pretty essential skill in most data manipulation activities. We also introduced you to this handy get_dupes function. One more thing though, which I haven't shown you yet is how to save your tables to a file. Up until this point, we just copy and paste our data from previous lecture. But since we're going to be working with this data the whole week, I want to show you how we can write this data to a file. It's not exactly rocket science. There's a number of different mechanisms, but it's still pretty common to use a CSV file. I'm going to run this through clean names to get rid of some of those new variables. Then I just call write CSV and I'm going to call it clean_fishing.csv. The defaults for read CSV are all fine for us today, but you could do things like a pen to a file if it already exists, suppress column names changed, delimiters. There's also a host of other functions for other file formats. Now that we've got some clean data, let's go exploring.