In this lesson, we're going to see how MongoDB fits into a linear regression, or to say, overall general data science workflow. Specifically, we're going to see how MongoDB's aggregation framework is very useful when performing linear regression or other types of analyses on large or unstructured data sets. And then specifically, we're going to look at how we can do a practical example of using Scikit-learn to do a linear regression on some data stored in MongoDB. So, there are many different types of data analysis that can be formed directly in MongoDB using its powerful aggregation framework. However, linear regression currently is not one of these types of analyses. There are other tools, like Scikit-learn, that are much better suited for supporting this type of analysis. So why are we talking about MongoDB when we're trying to talk about linear regression? Well, the reason is is that there are several reasons why you'd want to use MongoDB in your data analysis workflow, alongside more traditional data science tools like Python and Scikit-learn. So the first reason is, you might have a large volume of data that you need to analyze. So when you have large volumes of data, data sizes that are generally require more than one computational unit to store and process data, then you need to sample your data in some way, because you don't have enough computing resources to run a linear regression on that one single machine for that large amount of data that you have. Moreover, you might have some type of unstructured data that needs to be transformed to perform some type of analysis. Or, your data just might already live in MongoDB. These are all reasons why you'd want to leverage MongoDB's powerful aggregation framework to pull data out of MongoDB, clean it, transform it, process large amounts of data, and then pipe that information into something like Scikit-learn, where you can then do your type of machine learning analysis. This is going to be a much more efficient way to manage large types of unstructured data than trying to transform and clean that data in a traditional programming language with something like Python. Moreover, once you have performed that analysis, it's very easy for you to then store your results in MongoDB for easy retrieval later down the road. But enough with words and diagrams. Let's actually see this in action. So here, we're going to take a look at the 100YWeatherSmall data sets, and this is a collection of weather data collected by NOAA, the National Oceanic and Atmospheric Organization, here in the United States. And we're going to use MongoDB Compass to visualize the distribution of our data. And as you can see here, there are a whole bunch of different data points in each document. But we're going to focus on three. We're going to focus on air temperature, we're going to focus on dew point, and finally, we're going to look at pressure. And when you look at the distribution of the value field, the embedded value field for each of these top level fields, you can see that we have these serious outliers. Over here, we have outliers for pressure in the 9,000s, we have outliers for dew point in the 900s, and we have outliers for air temperature also in the 900s. And this is how NOAA signifies erroneous data. So we're going to need to go ahead and filter these guys out before we perform our linear analysis. And then we're going go ahead and perform that linear regression and see if we can predict air temperature given dew point and pressure. So let's go ahead and do that right now. So here, I've gone ahead and already set up the imports that I need. We're going to go ahead and connect to our MongoDB to be Atlas Cluster, and then finally, we're going to go ahead and connect to this 100YSmall database and the data collection which has this NOAA data set. And like I was saying earlier, we want to go ahead and filter out those outliers. We want to filter out those erroneous values. And so, for air temperature.value, we're going to say less than 900, dewPoint.value less than 900, and pressure.value, we're going to say less than 9,000, because this pressure is being measured in hectopascals, which are frequently measured well above 1,000. So, we can go ahead and create our filter. We also want to go ahead and do a projection, so we can remove our _ID and just keep these three fields. So, go ahead and do that. We also want to filter this data. So, there's tons of different data points in this collection, but we really only want to do a small sample. So we're going to do 10,000 documents. And with a sample command, we'll get a random selection. So we now have all of our different stages, and we can go ahead and pass these stages to the db.aggregate command, and get our cursor. And then we're going to go ahead and exhaust this cursor by wrapping it with list and storing this list in this weather data variable. Great. So, let's go ahead and take a look at one of these example documents. And as you can see, we now have the three fields we care about: temperature, dew point, and pressure. And we have the three values that correspond to those fields. Now that we know that our data looks like the way that we want, we can go ahead and use the json.normalize function from pandas to go ahead and marshal this into a data frame. Now that it's in a data frame, let's go ahead and make sure our data frame looks the way we want. And there we go, pandas went ahead and created three different columns with all different values for the three different dimensions. Now I want to go ahead and enable the matplotlib inline function so that seaborn is able to display its graphs. And here, I'm using the pairplot function on our data frame to go ahead and create pair-wise comparisons against every variable. And we can go ahead and visualize that right here. So as you can see, we have air temperature, dew point, and pressure. And then we have air temperature, dew point, and pressure. So when we look at these pair plots, we can see that dew point and air temperature have a pretty good linear correlation. And you can see the same chart again right here. And pressure and air temperature also have some interesting clustering, but I wouldn't say it's very linear in nature. Now that we know there is a good correlation between dew point and air temperature, let's go ahead and drop out air temperature, and create a data frame that just has dew point and pressure. And then we want to create another data frame that just has air temperature. And that's because we want to basically try and predict air temperature, given a dew point and a pressure. We can then go ahead and create an object for linear regression using the ordinary least-squares model from Scikit-learn. We're then going to go ahead and split up our data frames so that we have both a training set and a test set, with our test set only comprising 20 percent of our data. And that is as simple as doing reg.fit. We just do the fit method on our linear regression object, and it's really that easy. We can now go ahead and actually look at the line of best fit here. We can look at the coefficients for our terms. So this first term represents our dew point. And so, as you can see, as a very large coefficient. It's very close to one. And that's because, as we saw earlier, dew point has a high correlation, a linear correlation to air temperature, whereas pressure still has a small component but not nearly the same component that dew point has, which is what we observed earlier. We can also look at the intercept for our equation, which is negative 24. That doesn't really matter as much, but I just wanted to show you that you have access to both these underlying variables. And then we can go ahead and predict, using our test data set, and kind of compute some test temperatures. Since we know the actual air temperatures for these pairs of dew points and pressures that we're predicting here, we can go ahead and just subtract them using NumPy math and square them, and then take the mean of all those squares, and this right here will be our mean squared error. And so, when we look at this, we can see six degrees, which means, on average, given a dew point and pressure, we'll be plus or minus six degrees from the actual air temperature, which is not too bad. This value by itself isn't super useful, but it really might be helpful when you want to compare this linear model against some other maybe more complex model. So, let's go and recap what we've learned. We've discussed some different reasons for why you'd use the aggregation framework with MongoDB alongside your current data analysis workflow. And then, moreover, we've gone ahead and seen how to actually perform a very basic linear regression analysis using Scikit-learn from data that's actually stored directly in MongoDB.