Hi. Julie here again. Earlier in the course, you saw how to do batch data processing with Cloud Dataproc and other methods. Now it's time to introduce you to a key serverless tool that should be in your data engineering toolkit, Cloud Dataflow. This entire module will cover batch Dataflow pipelines and why Dataflow is a commonly used Data Pipeline tool on GCP. Not to give away too much of the answer, but you can write the same code to do both batch and streaming pipelines with Dataflow. We'll cover streaming pipelines later. So the topics we will address are; how to decide between Cloud Dataflow and Cloud Dataproc, why customers value Dataflow, and pipelines, templates, and how you can now run SQL on Dataflow too. Let's get started. Cloud Dataflow is the serverless execution service for data-processing pipelines written using Apache Beam. Apache Beam is open-source. You author your pipeline and then give it to a runner. Beam supports multiple runners like Flink and Spark if you wanted to run your Beam pipeline on-prem or in another Cloud. You don't even have to run it on Cloud Dataflow. This means your pipeline code is portable. So why run Beam on Cloud Dataflow? It is the most effective execution environment upon which to run Apache Beam. The Dataflow team actively contribute to that open source library for Beam and build features around the latest Beam offerings. So why not just sake with running Hadoop or Spark jobs on Dataproc? Well, with Dataflow, you don't have to manage clusters at all. Unlike with Cloud Dataproc, the auto-scaling in Dataflow scale step-by-step, it's very fine grained. Plus, as we'll see in the next course, Dataflow allows you to use the same code for both batch and stream. That's where a Beam gets to scene by the way. Batch and stream together make Beam. So how do you decide between the two? When building a new data processing pipeline, we recommend that you use Dataflow. If on the other hand you have existing pipelines written using Hadoop technologies, it may not be worthwhile to rewrite everything. Migrate it over to Google Cloud using Dataproc and then modernize it as necessary. As a data engineer, we recommend that you learn both Dataflow and Dataproc. This way you can make the choice based on what's best for your specific use case. What are some scenarios in which you might choose Dataproc over Dataflow? If the project has existing Hadoop or Spark dependencies, then it might make sense to use Dataproc, and sometimes the production team is a lot more comfortable with a DevOps approach. They want to provision machines themselves rather than going with the serverless approach where Google does it for you. In this case, Dataproc might be the right choice as it provides greater control. Now, if you don't care about Streaming and your primary goal is to move existing workloads, then Dataproc could be fine too. However, Dataflow is our recommended approach for building pipelines and the rest of this module will explain why. So a recap what we've covered so far. Cloud Dataflow provides a serverless way to execute pipelines on batch and streaming data. Serverless means there are no servers for you to manage, the service just works. For example, if you have a ton of data to process, Dataflow will intelligently call upon more virtual machines to help. Since Dataflow also support streaming, this makes your pipeline low-latency, meaning you can process data as soon as it comes in. Let's dive into how this whole streaming versus batch processing got started. This ability to process batch and stream with the same code is rather unique to Apache Beam. For a long time, batch programming and data processing used to be two very separate and different things. Batch programming dates to the 1940's, and the early days of computing where it was realized that you can think of two separate concepts, code and data. Use code to process data. Of course both of these were on punch cards. So that's what we were processing. A box of punch cards called a batch. It was a job that started and ended when the data was fully processed. Stream processing on the other hand is more fluid. It arose in the 1970's with the idea that data processing was something that is ongoing, like a stream of water in a pipe. The idea is that data keeps coming in and you process the data. The processing itself tended to be done in micro batches. So what about today? How do these two distinct processes get combined? This is the genius of Apache Beam. It provides abstractions that unify traditional batch programming concepts and traditional data processing concepts. Unifying programming and processing is a big innovation in Data Engineering. The four main concepts are Ptransforms, Pcollections, Pipelines, and Pipeline Runners. Let's drill into each of these concepts in more detail. A pipeline identifies the data to be processed and the actions to be taken on the data. The data is held in a distributed data abstraction called a Pcollection. The Pcollection is immutable. Any change that happens in a pipeline and just one Pcollection and creates a new one. It does not change the incoming Pcollection. The actions or code is contained in an abstraction called a Ptransform. The Ptransform handles input, transformations, and output of the data. The data in a Pcollection is passed along the graph from one Ptransform to another. Pipeline runners are analogous to container home such as Kubernetes Engine. The identical pipeline can be run on a local computer, datacenter VM or on a service such as Cloud Dataflow in the Cloud. The only difference is scale and access to the platform specific services. The services that Runner uses to execute the code is called a backend system. Immutable data is one of the key differences between batch programming and data processing. The assumption and the Von Neumann architecture was that data would be operated on and change in place, which was very memory efficient. That made sense when memory was very expensive and scarce. Immutable data, where each transform results in a new copy, it means that there is no need to coordinate access control or sharing of the original and just the data. So it enables or at least simplifies distributed processing. The shape of a pipeline is not actually just a single linear progression, but rather a directed graph with branches and aggregations. For historical reasons, we refer to it as a pipeline but a data graph or Dataflow might be a more accurate description. Now what happens in each of these green boxes? They represent transforms on Pcollections. A Pcollection represents both streaming data and batch data. There's no size limits your Pcollection either bounded or unbounded. That's why it's called a Pcollection or parallel collection. The more data, the more it's simply distributed in parallel across more workers. For streaming data, the Pcollection is simply without bounds. It has no end. Each element inside a Pcollection can be individually accessed and processed. This how distributed processing of the Pcollection is implemented. So you define the pipeline and that transforms on the Pcollection and the Runner handles implementing the transforms on each element, distributing the work as needed for scale and with all available resources. Elements represent different datatypes. Once an element is created in a Pcollection, it is immutable, so it can never be changed or deleted. In traditional programs, a datatype is stored in memory with a format that favors processing. Integers in memory are different from characters which are different from strings and compound data types. In a Pcollection, all data types are stored in a serialized state as byte strings. This way there is no need to serialize data prior to network transfer and de-serialize it when it's received. Instead, the data moves to the system in a serialized state and is only deserialized when necessary for the actions of a Ptransform.