Machine Learning: Clustering & Retrieval

Machine Learning: Clustering & Retrieval

This course is part of Machine Learning Specialization

Taught in English

Some content may not be translated

Instructors: Emily Fox

97,469 already enrolled

Included with Coursera Plus

Learn more

Course

Gain insight into a topic and learn the fundamentals

4.7

(2,344 reviews)

91%

17 hours (approximately)

Flexible schedule

Learn at your own pace

View course modules

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

15 quizzes

Course

Gain insight into a topic and learn the fundamentals

4.7

(2,344 reviews)

91%

17 hours (approximately)

Flexible schedule

Learn at your own pace

View course modules

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

Build your subject-matter expertise

This course is part of the Machine Learning Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

There are 6 modules in this course

Case Studies: Finding Similar Documents

A reader is interested in a specific news article and you want to find similar articles to recommend. What is the right notion of similarity? Moreover, what if there are millions of other documents? Each time you want to a retrieve a new document, do you need to search through all other documents? How do you group similar documents together? How do you discover new, emerging topics that the documents cover? In this third case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In this course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce. Learning Outcomes: By the end of this course, you will be able to: -Create a document retrieval system using k-nearest neighbors. -Identify various similarity metrics for text data. -Reduce computations in k-nearest neighbor search by using KD-trees. -Produce approximate nearest neighbors using locality sensitive hashing. -Compare and contrast supervised and unsupervised learning tasks. -Cluster documents by topic using k-means. -Describe how to parallelize k-means using MapReduce. -Examine probabilistic clustering approaches using mixtures models. -Fit a mixture of Gaussian model using expectation maximization (EM). -Perform mixed membership modeling using latent Dirichlet allocation (LDA). -Describe the steps of a Gibbs sampler and how to use its output to draw inferences. -Compare and contrast initialization techniques for non-convex optimization objectives. -Implement these techniques in Python.

Clustering and retrieval are some of the most high-impact machine learning tools out there. Retrieval is used in almost every applications and device we interact with, like in providing a set of products related to one a shopper is currently considering, or a list of people you might want to connect with on a social media platform. Clustering can be used to aid retrieval, but is a more broadly useful tool for automatically discovering structure in data, like uncovering groups of similar patients.<p>This introduction to the course provides you with an overview of the topics we will cover and the background knowledge and resources we assume you have.

What's included

4 videos5 readings

4 videosTotal 24 minutes

Welcome and introduction to clustering and retrieval tasks6 minutesPreview module
Course overview3 minutes
Module-by-module topics covered8 minutes
Assumed background6 minutes

5 readingsTotal 45 minutes

Important Update regarding the Machine Learning Specialization10 minutes
Slides presented in this module10 minutes
Software tools you'll need for this course10 minutes
A big week ahead!10 minutes
Get help and meet other learners. Join your Community!5 minutes

We start the course by considering a retrieval task of fetching a document similar to one someone is currently reading. We cast this problem as one of nearest neighbor search, which is a concept we have seen in the Foundations and Regression courses. However, here, you will take a deep dive into two critical components of the algorithms: the data representation and metric for measuring similarity between pairs of datapoints. You will examine the computational burden of the naive nearest neighbor search algorithm, and instead implement scalable alternatives using KD-trees for handling large datasets and locality sensitive hashing (LSH) for providing approximate nearest neighbors, even in high-dimensional spaces. You will explore all of these ideas on a Wikipedia dataset, comparing and contrasting the impact of the various choices you can make on the nearest neighbor results produced.

What's included

22 videos4 readings5 quizzes

22 videosTotal 136 minutes

Retrieval as k-nearest neighbor search2 minutesPreview module
1-NN algorithm2 minutes
k-NN algorithm6 minutes
Document representation5 minutes
Distance metrics: Euclidean and scaled Euclidean6 minutes
Writing (scaled) Euclidean distance using (weighted) inner products4 minutes
Distance metrics: Cosine similarity9 minutes
To normalize or not and other distance considerations6 minutes
Complexity of brute force search1 minute
KD-tree representation9 minutes
NN search with KD-trees7 minutes
Complexity of NN search with KD-trees5 minutes
Visualizing scaling behavior of KD-trees4 minutes
Approximate k-NN search using KD-trees7 minutes
Limitations of KD-trees3 minutes
LSH as an alternative to KD-trees4 minutes
Using random lines to partition points5 minutes
Defining more bins3 minutes
Searching neighboring bins8 minutes
LSH in higher dimensions4 minutes
(OPTIONAL) Improving efficiency through multiple tables22 minutes
A brief recap2 minutes

4 readingsTotal 40 minutes

Slides presented in this module10 minutes
Choosing features and metrics for nearest neighbor search10 minutes
(OPTIONAL) A worked-out example for KD-trees10 minutes
Implementing Locality Sensitive Hashing from scratch10 minutes

5 quizzesTotal 150 minutes

Representations and metrics30 minutes
Choosing features and metrics for nearest neighbor search30 minutes
KD-trees30 minutes
Locality Sensitive Hashing30 minutes
Implementing Locality Sensitive Hashing from scratch30 minutes

In clustering, our goal is to group the datapoints in our dataset into disjoint sets. Motivated by our document analysis case study, you will use clustering to discover thematic groups of articles by "topic". These topics are not provided in this unsupervised learning task; rather, the idea is to output such cluster labels that can be post-facto associated with known topics like "Science", "World News", etc. Even without such post-facto labels, you will examine how the clustering output can provide insights into the relationships between datapoints in the dataset. The first clustering algorithm you will implement is k-means, which is the most widely used clustering algorithm out there. To scale up k-means, you will learn about the general MapReduce framework for parallelizing and distributing computations, and then how the iterates of k-means can utilize this framework. You will show that k-means can provide an interpretable grouping of Wikipedia articles when appropriately tuned.

What's included

13 videos2 readings3 quizzes

13 videosTotal 78 minutes

The goal of clustering3 minutesPreview module
An unsupervised task6 minutes
Hope for unsupervised learning, and some challenge cases4 minutes
The k-means algorithm7 minutes
k-means as coordinate descent6 minutes
Smart initialization via k-means++4 minutes
Assessing the quality and choosing the number of clusters9 minutes
Motivating MapReduce8 minutes
The general MapReduce abstraction5 minutes
MapReduce execution overview and combiners6 minutes
MapReduce for k-means7 minutes
Other applications of clustering7 minutes
A brief recap1 minute

2 readingsTotal 20 minutes

Slides presented in this module10 minutes
Clustering text data with k-means10 minutes

3 quizzesTotal 76 minutes

k-means30 minutes
Clustering text data with K-means16 minutes
MapReduce for k-means30 minutes

In k-means, observations are each hard-assigned to a single cluster, and these assignments are based just on the cluster centers, rather than also incorporating shape information. In our second module on clustering, you will perform probabilistic model-based clustering that provides (1) a more descriptive notion of a "cluster" and (2) accounts for uncertainty in assignments of datapoints to clusters via "soft assignments". You will explore and implement a broadly useful algorithm called expectation maximization (EM) for inferring these soft assignments, as well as the model parameters. To gain intuition, you will first consider a visually appealing image clustering task. You will then cluster Wikipedia articles, handling the high-dimensionality of the tf-idf document representation considered.

What's included

15 videos4 readings3 quizzes

15 videosTotal 91 minutes

Motiving probabilistic clustering models8 minutesPreview module
Aggregating over unknown classes in an image dataset6 minutes
Univariate Gaussian distributions2 minutes
Bivariate and multivariate Gaussians7 minutes
Mixture of Gaussians6 minutes
Interpreting the mixture of Gaussian terms5 minutes
Scaling mixtures of Gaussians for document clustering5 minutes
Computing soft assignments from known cluster parameters7 minutes
(OPTIONAL) Responsibilities as Bayes' rule5 minutes
Estimating cluster parameters from known cluster assignments6 minutes
Estimating cluster parameters from soft assignments8 minutes
EM iterates in equations and pictures6 minutes
Convergence, initialization, and overfitting of EM9 minutes
Relationship to k-means3 minutes
A brief recap1 minute

4 readingsTotal 40 minutes

Slides presented in this module10 minutes
(OPTIONAL) A worked-out example for EM10 minutes
Implementing EM for Gaussian mixtures10 minutes
Clustering text data with Gaussian mixtures10 minutes

3 quizzesTotal 90 minutes

EM for Gaussian mixtures30 minutes
Implementing EM for Gaussian mixtures30 minutes
Clustering text data with Gaussian mixtures30 minutes

The clustering model inherently assumes that data divide into disjoint sets, e.g., documents by topic. But, often our data objects are better described via memberships in a collection of sets, e.g., multiple topics. In our fourth module, you will explore latent Dirichlet allocation (LDA) as an example of such a mixed membership model particularly useful in document analysis. You will interpret the output of LDA, and various ways the output can be utilized, like as a set of learned document features. The mixed membership modeling ideas you learn about through LDA for document analysis carry over to many other interesting models and applications, like social network models where people have multiple affiliations.<p>Throughout this module, we introduce aspects of Bayesian modeling and a Bayesian inference algorithm called Gibbs sampling. You will be able to implement a Gibbs sampler for LDA by the end of the module.

What's included

12 videos2 readings3 quizzes

12 videosTotal 57 minutes

Mixed membership models for documents3 minutesPreview module
An alternative document clustering model4 minutes
Components of latent Dirichlet allocation model2 minutes
Goal of LDA inference5 minutes
The need for Bayesian inference4 minutes
Gibbs sampling from 10,000 feet5 minutes
A standard Gibbs sampler for LDA9 minutes
What is collapsed Gibbs sampling?3 minutes
A worked example for LDA: Initial setup4 minutes
A worked example for LDA: Deriving the resampling distribution7 minutes
Using the output of collapsed Gibbs sampling4 minutes
A brief recap1 minute

2 readingsTotal 20 minutes

Slides presented in this module10 minutes
Modeling text topics with Latent Dirichlet Allocation10 minutes

3 quizzesTotal 84 minutes

Latent Dirichlet Allocation30 minutes
Learning LDA model via Gibbs sampling30 minutes
Modeling text topics with Latent Dirichlet Allocation24 minutes

In the conclusion of the course, we will recap what we have covered. This represents both techniques specific to clustering and retrieval, as well as foundational machine learning concepts that are more broadly useful.<p>We provide a quick tour into an alternative clustering approach called hierarchical clustering, which you will experiment with on the Wikipedia dataset. Following this exploration, we discuss how clustering-type ideas can be applied in other areas like segmenting time series. We then briefly outline some important clustering and retrieval ideas that we did not cover in this course.<p> We conclude with an overview of what's in store for you in the rest of the specialization.

What's included

12 videos2 readings1 quiz

12 videosTotal 62 minutes

Module 1 recap10 minutesPreview module
Module 2 recap3 minutes
Module 3 recap6 minutes
Module 4 recap7 minutes
Why hierarchical clustering?2 minutes
Divisive clustering4 minutes
Agglomerative clustering2 minutes
The dendrogram4 minutes
Agglomerative clustering details7 minutes
Hidden Markov models9 minutes
What we didn't cover2 minutes
Thank you!1 minute

2 readingsTotal 20 minutes

Slides presented in this module10 minutes
Modeling text data with a hierarchy of clusters10 minutes

1 quizTotal 6 minutes

Modeling text data with a hierarchy of clusters6 minutes

Instructors

Instructor ratings

4.8 (90 ratings)

Emily Fox

University of Washington

6 Courses471,635 learners

Carlos Guestrin

University of Washington

8 Courses472,374 learners

Offered by

University of Washington

Recommended if you're interested in Data Analysis

University of Washington
Machine Learning: Classification
Course
Howard University
Capstone: Data Science Problem in Linear Algebra Framework
Course
University of Washington
Data Science at Scale - Capstone Project
Course
University of Washington
Machine Learning
Specialization

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Learner reviews

Showing 3 of 2344

4.7

2,344 reviews

5 stars
74.40%
4 stars
19.11%
3 stars
4.73%
2 stars
0.72%
1 star
1.02%

Reviewed on Aug 24, 2016

Reviewed on Jan 29, 2020

Reviewed on Jan 24, 2017

View more reviews

New to Data Analysis? Start here.

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.
The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy.

Machine Learning: Clustering & Retrieval

Course

Skills you'll gain

Details to know

Course

See how employees at top companies are mastering in-demand skills

Build your subject-matter expertise

Earn a career certificate

There are 6 modules in this course

Welcome

What's included

Nearest Neighbor Search

What's included

Clustering with k-means

What's included

Mixture Models

What's included

Mixed Membership Modeling via Latent Dirichlet Allocation

What's included

Hierarchical Clustering & Closing Remarks

What's included

Instructors

Offered by

Recommended if you're interested in Data Analysis

Machine Learning: Classification

Capstone: Data Science Problem in Linear Algebra Framework

Data Science at Scale - Capstone Project