[MUSIC] Let's now dig into a more detailed description of this course content. So, the course overview that we just talked about is going to be presented in four different modules in this course. So let's step through each of these modules and discuss what we will to present at a high level. So in our first module, we are going to discuss this document retrieval task. Where there is some document that a person's reading and we want to search over all these other documents. To find a similar one that they might also be interested in reading. And to do this we are going to talk about nearest neighbor search. Where we compute the distances between our query article and every other article out there. And then retrieve the document that's closest, the one with the smallest distance. Well there are a couple of key components of doing this nearest neighbor search. One of them is how are we going to represent our document. And the second is how are we going to compute the similarity between documents, or the distance between them? Because both, how we represent our data, as well as how we compute these distances really influences what we think of as a nearest neighbor. And we're going to go through various choices for each of these components and talk about some of the tradeoffs that are made. Then we're going to turn to how to scale up nearest neighbour search to really large data sets. Because our brute force approach of computing distances to every single datapoint in our data set. Is computationally really, really intensive for just performing one nearest neighbor query. So instead we're going to talk about a really cool data structure, called KD-trees. That allows us, in the context of nearest neighbor search, to prune out a large portion of the search space when we're going to compute our nearest neighbor. Well, at least, it can often do that, though you can get unlucky. But, KD-trees, although very cool, don't work very well in high dimensional spaces. So if you think of our document and having a very large vocabulary. And that our document representation is going to be based on the size of that vocabulary. KD-trees no longer seem like that attractive of an approach for doing our nearest neighbor search. So instead, we're going to talk about ways to do approximate nearest neighbor search. Using something called locality sensitive hashing. Because one of the concepts we going to teach in this course is the fact that in a lot of applications we don't actually care about providing the exact nearest neighbour. But maybe one that is pretty close to that is good enough. And when we're willing to accept that type of approximation, we can do things much more efficiently in some cases. Then, in the second module we're going to turn to this idea of clustering. Where here our goals are going to be to discover groups of related articles just based on the content of the articles themselves. So this is going to be an example of an unsupervised learning task. And the first algorithm we're going to present for performing this type of clustering is k-means. Where the goal of k-means is to minimize the sum of the square distances between data points and the set of learned cluster centers. And, so we can think of this just as a partitioning of our data set. Where we're making hard assignments of data points to specific clusters. And again this is an example of an unsupervised algorithm. Because we're just taking the inputs themselves and trying to learn this structure from them without having any output labels. We don't know the cluster labels at any of our data points in this data set. Then later in the second module we are going to turn to trying to scale up k-means to large data sets using something called map reduce. Map reduce is a framework for parallelizing operations over a number of different machines. We're going to discuss the general map reduce framework that can be used in a wide range of different settings. As well as it's specific application to k-means. Then in our third module, we're going to turn to probabilistic models for performing clustering. And the goal here is to capture uncertainty in the cluster assignments that we're making. So for example, maybe there's an article where it's not really clear. Is about science or is it about world news? Or really we don't have these labels, so is it more related to the group of articles that have content about science? Or the group of articles that have concepts about world news without these labels being shown? And so there are some articles that it might not really be clear exactly what assignment to make. And we want the algorithm to present us with this uncertainty instead of just providing a hard assignment of a document to a given cluster. And one thing we can do with this type of output is help us learn a user's preference over a set of different topics. Because if we have feedback from the user about articles that they liked or didn't like. And we also have the uncertainty of what cluster that article was associated with. We can put both of these pieces of information together to have a better description. Of what impact the feedback from the user should have over our learned preference on topics from that user. And the probabilistic model that we're going to use for clustering is called a mixture model. And here, when we're thinking about assigning data points to clusters, it's not just the cluster center that's going to come into play like in K-mint, but also the shape of the cluster. And here, again, unlike in k-means, we're not going to make hard assignments of data points to clusters. We're going to make what are called soft assignments. Where data points can have weights in multiple clusters. Where those weights are based on our uncertainty about the assignment of the data point to the cluster. So the goal here is to go from a set of unlabeled data points, like this gray cloud of points shown here, to a set of color points. But now, instead of making hard assignments, there's going to be an entire spectrum of colors. Representing the allocation of data points to clusters and the uncertainty in that. So for example if you look at the points line between this blue and green cluster there is a point. They're points that have this shading somewhere in-between those two colors. Meaning there's uncertainty about whether those observations are assigned to the blue cluster or the green cluster. And, likewise, between the green and fuchsia clusters. And, the algorithm we're going to go through for inferring these soft assignments of data points to clusters. Is referred to as expectation maximization or the Algorithm. And finally, in our fourth module, we're going to describe a more intricate probabilistic model for analyzing documents. And this is called Latent Dirichlet Allocation or LDA. And to motivate the ideas behind LDA, let's look at a specific document shown here. And what we see that this document has a bunch of science related words. So maybe, based on the content of this document we might group this article with other documents that are also about science related topics. But, this article also has a lot of tech related words so may be a better description is to group this article with the tech articles. But instead this document is really about science and technology. And that's exactly LEA allows us to do. It allows us to provide what's called mix membership where a given document is allowed to belong to multiple different topics. And, not only is a document associated with multiple topics, but LDA provides the relative proportion of these topics in this document. In particular, every topic within the LDA model is described as a distribution over the words in the vocabulary. And so what we see is that maybe the science topic has more weight on science related words. Whereas the tech topic has more weight on words related to technology and likewise for sports and so on. But the crazy thing is we're going to be doing this in a completely unsupervised manner. We're just going to be providing the words of a document and this for all documents in the corpus. So, we have a bunch and bunch of words for every document in the corpus. And just from that alone, we're going to infer these topic specific distributions over the vocabulary. As well as the document specific proportions of topics in that document. And so we're going to go through exactly how we can extract this kind of information. But what we see is that this is another example of discovering structure in data, in this unsupervised manner. [MUSIC]