As you might remember from the demo earlier, separating compute and storage enables cost-effective scale for workloads. With the notion of turning down clusters, you may be worried about all the data that's currently stored on disk, HDFS. You may be using HDFS on-premises. What happens to that? What are new Google Cloud Platform architecture? Your data is not stored in the cluster. Your data is now stored off cluster in Google Cloud Storage buckets. Really, won't that make things slow to reach out across a network each time the cluster needs some data. Recall that earlier we mentioned Google's datacenter bandwidth between compute and storage. We talked about a petabit bisectional bandwidth. What that means is that if you draw a line somewhere in the network, the bisectional bandwidth is a rate of communication, at which servers on one side of the line, can communicate with servers on the other side. With enough bisectional bandwidth, any server can communicate with any other server at full network speeds. With petabit bisectional bandwidth, that communication is so fast that it no longer makes sense to transfer the files and store them locally. Instead, it makes sense to use the data from where it is stored, so here's the plan, we'll use HDFS, but we'll use HDFS only on the cluster for working storage, storage during processing, but will store all actual input and output data on Google Cloud storage. Because the input and output data is off the cluster, the cluster can be created for a single job or type of workload and can be shut down when not in use. Changing your code that works on-prem to have it work with the data on Cloud Storage is easy. Just replace HDFS in you spark or pig load with GS, so you take the HDFS URLs, hdfs://, and replace it the gs://. This will make the spark or pig job read or write to Cloud Storage that's it. There's also an HBase connector for, Cloud Bigtable so if you have your data in Bigtable, you can use the HBase connector, and there's a BigQuery connector that you can use to work with data if the data is in the Analytics warehouse. When we say off cluster storage, we're talking about Cloud Storage or we're talking about BigQuery, or we're talking about Bigtable. Cloud Dataproc and Google Cloud Storage are tightly coupled with other GCP products. Storing your data off cluster means that you can processes this data and not just with spark from Dataproc, but also from all these other products. Not to mention storing it off cluster in GCS is generally cheaper since A, disks attached to compute instances are expensive in an ARAF themselves, and B, if your data are off cluster, you get to shutdown the compute nodes when you're not using them. Bottom line, store your data in Google Cloud Storage. Friends don't let friends use HDFS. Let's recap what we've covered. You can get a Cloud Dataproc cluster up and running in about 90 seconds, this gives you all of the power of Hadoop without having to manage clusters. As you saw, you can lift and shift your existing Hadoop and spark workloads by simply replacing hdfs:// URLs with gs:// URLs. You can connect Cloud Dataproc to Google Cloud Storage and unlocked the benefits of both scale and Cloud economics. You get to provision a cluster per job if you want to, and shut the cluster down when you're done. Lastly, your clusters are customizable, which could include auto-scaling and preemptible VMs so that you get cost savings.