Welcome to the module and executing Apache Spark on Cloud Dataproc. In this module, we'll review the parts of the Hadoop ecosystem. How to run Hadoop on Cloud Dataproc. Why you should consider using GCS instead of HDFS for your storage. How to optimize Cloud Dataproc. And lastly, a lab for you to practice what you've just learned. It helps to place the services you'll be learning about in a historical context. Before 2006, big data simply meant big databases. Database design came from a time where storage was relatively cheap, and processing, the compute power, was rather expensive. So it made sense to copy the data from its storage location to the processor to perform data processing. Then the result would then be copied back to storage. Around 2006, distributed processing of that big data became practical with Apache Hadoop. The idea behind Hadoop is to create a cluster of computers and leverage distributed processing. HDFS, Hadoop's distributed file system, stored the data on the machines in the cluster, and MapReduce provided distributed processing or compute over all of that data. A whole ecosystem of Hadoop-related software grew up around Hadoop, including Hive, Pig, and Spark. So what is Hadoop used for these days? Organizations use Hadoop for on-premises big data workloads using distributed data processing via MapReduce. They make use of a range of applications that run on Hadoop clusters, such as Presto. But a lot of customers also use Spark. Spark provides a high-performance analytics engine for processing batch and streaming data. Spark can be up to a hundred times faster than equivalent Hadoop jobs, because it leverages in-memory processing. Spark also provides a couple of abstractions for dealing with data, including resilient distributed data sets and data frames. Spark in particular is very powerful and expressive. It's used for a lot of those Hadoop workloads. So where did Apache Spark come from? Wouldn't it be better to just run MapReduce directly on the Hadoop cluster? Well, running MapReduce directly on tons of Hadoop clusters, that's very useful, but it has the complication that the Hadoop system has to be tuned for the kind of job being run to make efficient use of the underlying hardware, those resources of the cluster. Imagine a job working on millions of pieces of sensor data coming in from an internet of things or IoT application. And then imagine a job working on those huge photos from our previous example. Trying to do both things at the same time time efficiently is really complicated. One important innovation that's helped out is Spark. And a simple explanation of Spark is that it's able to mix different kinds of applications and to adjust how it uses the available resources on your cluster. You have to learn to program Spark differently from traditional programming, because you can't tell it how exactly to do things. To give spark the flexibility it needs to determine how to use the resources that are available you have to describe what you want to do and let spark determine how it actually does it to make it happen. This is called declarative programming versus imperative programming. In imperative programming, you tell the system exactly what to do and how to do it. In declarative programming, you tell the system what you want and it figures out how to actually do the implementation. You'll be using these concepts inside of your Spark labs later on in this course. There's even a full SQL implementation on top of Spark. There's also a common data frame model that works across Scala, Java, Python, SQL, and R. And lastly, there's a distributed machine learning library that you may have heard of called Spark MLLib. So where does a Hadoop cluster store its data? Well, within the Hadoop distributed file system, or HDFS. HDFS is the main file system Hadoop uses for distributing work to the nodes on its cluster. It's part of the cluster, which means even if you're not running jobs that use the compute hardware on the cluster, you still have to pay for that power for the cluster to persist all of that storage. This is the disadvantage of tying together compute and storage, which we'll talk about how to address by running Hadoop in the cloud later on. You as a data engineer. If you're using on-premise Hadoop, you're the one responsible for managing cluster resources via the yarn utility that Hadoop provides. If a job is demanding too many resources or if there are issues with your hardware or software, it's ultimately your responsibility to manage this on premise and keep your cluster going. Those are two common issues with OSS or the open source version of Hadoop, cluster tuning and utilization. A company will typically have several Hadoop clusters that are shared by several organizations and run a wide variety of jobs. Hadoop experts have to then adjust many configuration settings in the collection of underlying open-source project software to optimize the cluster for varying kinds of work it's being asked to perform. That's what we mean when we say the tuning problem. Hadoop clusters also tend to have a lot of dedicated Hardware, which makes them expensive when they're not being used. That's the utilization problem. Hadoop administrators may find that they're searching throughout people in the organization to find their data processing jobs so they can then increase cluster utilization. If they're successful, then the capacity of the cluster will start to be consumed, and it may be time to order more hardware. This cycle of tuning, underutilization, overutilization, and expansion creates a significant overhead for a data engineer running on-premise Hadoop. We'll see how doing Hadoop in the cloud addresses these concerns. Note that there are many other components to the entire Hadoop ecosystem to support Big Data workloads. If you're currently using Hadoop, a lot of these names might look familiar. Let's recap some of the limitations of doing Hadoop on-premise. First, those workloads in the cluster, they're not elastic. This means you're bound to the compute power and storage capacity of your on-premise cluster. If you have a huge reporting and analytics workload to run at the same time another ML team wants to use your cluster, you may run into long processing times for those jobs. On the other hand, if you have no jobs running, your cluster will still, again, need that power and just sit there idle. Why, again, is that? Because the compute and the storage are tied together. If you don't power your cluster 24/7, you'll lose the state of your storage. If you want to run your Hadoop and Spark jobs in the cloud, you can use Cloud Dataproc. It has built-in support for Hadoop, and it's a fully managed service on GCP. That means you don't have to worry about hardware or software updates and installs. That's all done and managed for you. Also, if you need a larger cluster, you don't need to wait and order more machines. You can simply add or remove nodes in your cluster via the Cloud Dataproc UI in just minutes. Dataproc also has simplified version management. Keeping all of your open source tools up-to-date and working together is one of the most complex part when managing a Hadoop cluster. When you use Cloud Dataproc, much of that work is managed for you with Cloud Dataproc's versioning system. Lastly, Dataproc has flexible job configuration. A typical on-premise Hadoop setup uses a single cluster that serves many purposes. When you move to GCP, you can focus on individual tasks, creating as many clusters as you need. This removes much of the complexity of maintaining a single cluster with growing dependencies and differing software configuration interactions. What that last point really means is that you can spin up a particular cluster just for a given workload or job. Then you run that workload and turn down the cluster when you don't need those compute resources anymore. This is especially true when you need to persist data off-cluster instead of inside HDFS on the cluster, a topic we'll discuss in greater detail soon.