In this lesson, we're going to discuss Clustering Algorithms. And so, we're going to cover how they work at a high level, and then where to explore scikit-learns, KMeans clustering algorithm, and we'll talk about how we can pre-process data to obtain better results. At a basic level, clustering involves calculating distances and moving points which we call centroids. In this image, the centroids are the green dots which are initially placed randomly, and the distance from a centroid to each point is calculated, and then the points nearest the centroid are assigned to that centroid. So once all the distance have been calculated, the centroid is then moved to the center of all of those combined distances, therefore, minimizing each of those distances, and then this process is repeated until the centroids do not move and no points have been assigned to a new centroid. We'll be using KMeans and scikit-learn for clustering in this lesson. And one important thing to note is that KMeans only works with numerical data, and that's because we can't calculate Euclidean distances on non-numerical data and so, that means we're going to do some pre-processing. So, let's go and see what that processing looks like in our notebook. So first, we want to make sure we have are necessary dependencies imported, and then we're going to go ahead and connect to our MongoDB Atlas cluster. So, we're going to use the UCI Machine Learning Repo again. And here, we're going to use the pulsars dataset which is a dataset containing various measurements taken by radio telescopes. We'll see in a bit but it contains eight continuous features in one class which is whether or not, it is a pulsar. So here, we'll just execute a simple find command for everything projecting away the _id, since it's not actually a part of the original dataset. And then, we're going to go ahead and marshal that into a dataframe and take a peek of our data. And here, you see that there are nine columns, eight of which are features and one of which is the class, and you can see this data is on a variety of different scales. Let's go and use seaborn to look at a pair plot of our data. And this is pretty interesting, we can see some pretty good groupings, and some good clusters, but let's go ahead and take a look at the correlation matrix. So, we're first going to go ahead and remove the label from our dataframe, and then go ahead and create a correlation matrix, and here's an ISO helper to make the correlation matrix a lot prettier, and looking at the label column, we can see that there's a strong correlation between a label and ex_kurt_ip, as well as between the label and skew_ip. And so, if we use just these two features along, we would expect an accuracy within about the 70th percentile for protecting whether or not an object has a pulsar. So, let's go ahead and perform the initial clustering. But before doing that, we want to determine how many clusters we want. So for this one, we create an elbow plot. And so, we already know that we really need two clusters because it's either going to be a pulsar or not a pulsar, and this data basically backs this up, you can see that at two, we had the most sharp increase, and then we have some pretty serious diminishing returns. But, if we didn't know how many clusters we needed, we could use this, we could see where the elbow is in our plot. We can figure out how many clusters we want to aim for. Before running the algorithm, we're going to go ahead and split our data into a training and test set. And it's pretty straightforward to go ahead and run the algorithm from sklearn. I do want to point out, since we're not passing in the labels, this is unsupervised learning. So, go ahead execute that. And so, on this last cell here, we're computing a confusion matrix. And in this cell, we are going to go ahead and make the output very pretty, and this is one of the greatest parts of sklearn is that the documentation is very, very good, and I was able to go ahead and copy and paste this function directly from the documentation. And now you can go ahead take a look at our confusion matrix. And you can see that we do a very good job of determining when it's not a pulsar. But we don't do such a great job when we're trying to classify it as a pulsar. And so, you can see here that we actually are in the 70th percentile for accuracy as we kind of expected, but this is really just because we're labeling things as not a pulsar, and our data shows that most things are not a pulsar. And so, given the size and the numbers of the actual pulsars, we should expect about 70 percent accuracy just if we just label everything as not a pulsar. So, how can we improve this? Well, one of the things that we can do is scale our data, the input sizes that we are currently dealing with very greatly. And then we look at the output of described, this is pretty clear, we can see that we have a quite the range of max values for each of these different features. There's a lot of things that we can do to scale our data and we can look at the distributions before and after. And before, you can see that we have things all over the place. And after, you can see that things are much more around the same values. Let's make sure we didn't destroy any data. And then we look, we can see the correlations are the same. Now, let's look at another elbow plot with our scale data. And you can see that our elbow is a little bit more pronounced now. And again, we're asked to find only two clusters, that's where the elbow is the most visible. And now, when we run the same algorithm again, you can see that we have a pretty sizable improvement, almost a 20 percent improvement actually. And looking at the confusion matrix, we can see that we're doing a much better job of classifying whether something is a pulsar, and so our precision and recall scores have gone up considerably. But the question is, can we do even better? So here, we can try to use a principal component analysis which will allow us to reduce the dimensionality of a problem set. Let's see if applying this transformation will help us eke out better results. First of all, apply PCA, allowing us to reduce our data down to just two dimensions. And now, when we look at this elbow plot, we can see a very sharp elbow. And when we go through the same algorithm again, now we can see a marked difference with a 97 percent accuracy, and is doing an even better job of classifying things as not pulsar and pulsar. And you can actually see the actual clusters reflected here as predicted, and you can see the two clusters. But in reality, we can see that some of these points bleed over into the other plot which is why we had those misgivings. Let's recap what we covered. We saw clustering works at a high level, and we saw how to quickly use KMeans algorithm from sklearn. Moreover, we saw how we can apply this many times with different amounts of clusters if you're unsure how many clusters you need as you can plot those results in an elbow plot. We saw the importance of normalizing and scaling our data in order for this algorithm to be effective. And we saw how to make these results even better by applying other algorithms like PCA.