Distributed Computing with Spark SQL

This course is part of Learn SQL Basics for Data Science Specialization

Taught in English

Some content may not be translated

Instructors: Brooke Wenig

45,781 already enrolled

Included with Coursera Plus

Learn more

Course

Gain insight into a topic and learn the fundamentals

4.5

(661 reviews)

86%

Intermediate level

Some related experience required

13 hours (approximately)

Flexible schedule

Learn at your own pace

View course modules

What you'll learn

Use the collaborative Databricks workspace to write scalable Spark SQL code that executes against a cluster of machines
Inspect the Spark UI to analyze query performance and identify bottlenecks
Create an end-to-end pipeline that reads data, transforms it, and saves the result
Build a medallion (bronze, silver, gold) lakehouse architecture with Delta Lake to ensure the reliability, scalability, and performance of your data

Skills you'll gain

Details to know

Shareable certificate

Add to your LinkedIn profile

Assessments

8 quizzes

Course

Gain insight into a topic and learn the fundamentals

4.5

(661 reviews)

86%

Intermediate level

Some related experience required

13 hours (approximately)

Flexible schedule

Learn at your own pace

View course modules

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

Build your subject-matter expertise

This course is part of the Learn SQL Basics for Data Science Specialization

When you enroll in this course, you'll also be enrolled in this Specialization.

Learn new concepts from industry experts
Gain a foundational understanding of a subject or tool
Develop job-relevant skills with hands-on projects
Earn a shareable career certificate

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

There are 4 modules in this course

This course is all about big data. It’s for students with SQL experience that want to take the next step on their data journey by learning distributed computing using Apache Spark. Students will gain a thorough understanding of this open-source standard for working with large datasets. Students will gain an understanding of the fundamentals of data analysis using SQL on Spark, setting the foundation for how to combine data with advanced analytics at scale and in production environments. The four modules build on one another and by the end of the course you will understand: the Spark architecture, queries within Spark, common ways to optimize Spark SQL, and how to build reliable data pipelines.

The first module introduces Spark and the Databricks environment including how Spark distributes computation and Spark SQL. Module 2 covers the core concepts of Spark such as storage vs. compute, caching, partitions, and troubleshooting performance issues via the Spark UI. It also covers new features in Apache Spark 3.x such as Adaptive Query Execution. The third module focuses on Engineering Data Pipelines including connecting to databases, schemas and data types, file formats, and writing reliable data. The final module covers data lakes, data warehouses, and lakehouses. Students build production grade data pipelines by combining Spark with the open-source project Delta Lake. By the end of this course, students will hone their SQL and distributed computing skills to become more adept at advanced analysis and to set the stage for transitioning to more advanced analytics as Data Scientists.

In this module, you will be able to discuss the core concepts of distributed computing and be able to recognize when and where to apply them. You'll be able to identify the basic data structure of Apache Spark™, known as a DataFrame. Additionally, you will use the collaborative Databricks workspace and write SQL code that executes against a cluster of machines.

What's included

6 videos3 readings2 quizzes1 discussion prompt

6 videosTotal 43 minutes

Course Introduction6 minutesPreview module
Why Distributed Computing?9 minutes
Spark DataFrames7 minutes
The Databricks Environment9 minutes
SQL in Notebooks6 minutes
Import Data4 minutes

3 readingsTotal 80 minutes

A Note From UC Davis10 minutes
Readings and Resources40 minutes
Assignment #1 - Queries in Spark SQL30 minutes

2 quizzesTotal 60 minutes

Assignment #1 Quiz - Queries in Spark SQL30 minutes
Module 1 Quiz30 minutes

1 discussion promptTotal 10 minutes

Learning Goals10 minutes

In this module, you will be able to explain the core concepts of Spark. You will learn common ways to increase query performance by caching data and modifying Spark configurations. You will also use the Spark UI to analyze performance and identify bottlenecks, as well as optimize queries with Adaptive Query Execution.

What's included

6 videos2 readings2 quizzes

6 videosTotal 35 minutes

Module Introduction1 minutePreview module
Spark Terminology3 minutes
Caching6 minutes
Shuffle Partitions5 minutes
Spark UI6 minutes
Adaptive Query Execution (AQE)11 minutes

2 readingsTotal 60 minutes

Readings30 minutes
Assignment #2 - Spark Internals30 minutes

2 quizzesTotal 60 minutes

Assignment #2 Quiz - Spark Internals30 minutes
Module 2 Quiz30 minutes

In this module, you will be able to identify and discuss the general demands of data applications. You'll be able to access data in a variety of formats and compare and contrast the tradeoffs between these formats. You will explore and examine semi-structured JSON data (common in big data environments) as well as schemas and parallel data writes. You will be able to create an end-to-end pipeline that reads data, transforms it, and saves the result.

What's included

7 videos2 readings2 quizzes

7 videosTotal 62 minutes

Module Introduction2 minutesPreview module
Spark as a Connector10 minutes
Accessing Data11 minutes
File Formats12 minutes
JSON, Schemas and Types8 minutes
Writing Data8 minutes
Tables and Views8 minutes

2 readingsTotal 90 minutes

Readings60 minutes
Assignment #3 - Engineering Data Pipelines30 minutes

2 quizzesTotal 60 minutes

Assignment #3 Quiz - Engineering Data Pipelines30 minutes
Module 3 Quiz30 minutes

In this module, you will identify the key characteristics of data lakes, data warehouses, and lakehouses. Lakehouses combine the scalability and low-cost storage of data lakes with the speed and ACID transactional guarantees of data warehouses. You will build a production grade lakehouse by combining Spark with the open-source project, Delta Lake. Whoever said time travel isn't possible hasn't been to a lakehouse!

What's included

8 videos2 readings2 quizzes1 peer review1 discussion prompt

8 videosTotal 51 minutes

Module Introduction4 minutesPreview module
Data Lakes vs. Data Warehouses7 minutes
What is a Lakehouse?4 minutes
Delta Lake6 minutes
Delta Lake (Demo)5 minutes
Delta Advanced Features (Demo)6 minutes
Continuing with Spark and Data Science12 minutes
Course Summary4 minutes

2 readingsTotal 70 minutes

Readings60 minutes
Assignment #4 - Lakehouse10 minutes

2 quizzesTotal 60 minutes

Assignment #4 Quiz - Lakehouse30 minutes
Module 4 Quiz30 minutes

1 peer reviewTotal 60 minutes

Final Notebook Review60 minutes

1 discussion promptTotal 10 minutes

Self-Reflection10 minutes

Instructors

Instructor ratings

4.6 (145 ratings)

Brooke Wenig

University of California, Davis

1 Course45,781 learners

Offered by

University of California, Davis

Recommended if you're interested in Data Analysis

University of California, Davis
SQL for Data Science Capstone Project
Course
University of California, Davis
Data Wrangling, Analysis and AB Testing with SQL
Course
Google Cloud
Datastream: PostgreSQL Replication to BigQuery
Project
Coursera Project Network
Introducción al Deep Learning
Guided Project

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Learner reviews

Showing 3 of 661

4.5

661 reviews

5 stars
65.40%
4 stars
23.11%
3 stars
6.64%
2 stars
2.26%
1 star
2.56%

Reviewed on Jun 12, 2022

Reviewed on Oct 31, 2020

Reviewed on Feb 11, 2021

View more reviews

New to Data Analysis? Start here.

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.
The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy.

Distributed Computing with Spark SQL

Course

What you'll learn

Skills you'll gain

Details to know

Course

See how employees at top companies are mastering in-demand skills

Build your subject-matter expertise

Earn a career certificate

There are 4 modules in this course

Introduction to Spark

What's included

Spark Core Concepts

What's included

Engineering Data Pipelines

What's included

Data Lakes, Warehouses and Lakehouses

What's included

Instructors

Offered by

Recommended if you're interested in Data Analysis

SQL for Data Science Capstone Project

Data Wrangling, Analysis and AB Testing with SQL

Datastream: PostgreSQL Replication to BigQuery

Introducción al Deep Learning

Why people choose Coursera for their career

Learner reviews

New to Data Analysis? Start here.

Open new doors with Coursera Plus

Advance your career with an online degree

Join over 3,400 global companies that choose Coursera for Business

Frequently asked questions

When will I have access to the lectures and assignments?

What will I get if I subscribe to this Specialization?

What is the refund policy?

More questions