Designing data processing systems includes designing flexible data representations, designing data pipelines, and designing data processing infrastructure. You're going to see that these three items show up in the first part of the exam with similar but not identical considerations. The same questions or interests show up in different context; data representation, pipelines, processing, infrastructure. For example, innovations in the technology could make the data representation of a chosen solution outdated. The data processing pipeline might have been implemented in a very involved transformations now available as a single efficient command. The infrastructure could be replaced by a service with more desirable qualities. However, as you'll see, there are additional concerns with each part. For example, system availability is important to pipeline processing but not data representation. Capacity is important to processing but not the abstract pipeline or the representation. Think about data engineering in Google Cloud as a platform consisting of components that can be assembled into solutions. Let's review the elements of GCP that form the data engineering platform. Storage and databases, services that enable storing and retrieving data, different storage and retrieval methods that make them more efficient for specific use cases. Server-based processing, services that enable application code and software to run that can make use of stored data to perform operations, actions, and transformations producing results. Integrated services, combine storage and scalable processing in a framework designed to process data rather than general applications. More efficient and flexible than isolated server database solutions. Artificial intelligence, methods to help identify, tag, categorize, and predict three actions that are very hard or impossible to accomplish in data processing without machine learning. Pre and post-processing services, working with data and pipelines before processing such as data cleanup or after processing such as data visualization. Pre and post-processing are important parts of a data processing solution. Infrastructure services. All the framework services that connect and integrate data processing and IT elements into a complete solution: messaging systems, data import, export, security, monitoring, and so forth. Storage and database systems are designed and optimized for storing and retrieving. They are not really built to do data transformation. It's assumed in their design that the computing power necessary to perform transformations on the data is external to the storage or database. The organization method and access method of each of these services is efficient for specific cases. For example, a Cloud SQL database is very good at storing consistent individual transactions, but it's not really optimized for storing large amounts of unstructured data like video files. Database services perform minimal operations on the data within the context of the access method, for example, SQL queries can aggregate, accumulate, count, and summarize results of a search query. Here's an exam tip, note the differences between Cloud SQL and Cloud Spanner, and when to use each. Service differentiators include access methods, the cost or speed of specific actions, sizes of data, and how data is organized and stored. Details and differences between the data technologies are discussed later in this course. An exam tip, know how to identify technologies backwards from their properties, for example, which data technology offers the fastest ingestive data? Which one might you use for ingestive streaming data? Managed services are ones where you can see the individual instance or cluster. Exam tip, managed services still have some IT overhead. It doesn't completely eliminate the overhead or manual procedures but it minimizes them compared with on-prem solutions. Serverless services remove more of the IT responsibility, so managing the underlying servers is not part of your overhead and the individual instances are not visible. A more recent addition to this list is Cloud Firestore. Cloud Firestore is a NoSQL document database built for automatic scaling. It offers high performance and ease of application development and it includes a data store compatibility mode. As mentioned, storage and databases provide limited processing capabilities and what they do offer is in the context of search and retrieval. But if you need to perform more sophisticated actions and transformations on the data, you'll need data processing software and computing power. So where do you get these resources? You could use any of these computing platforms to write your own application or parts of an application that you storage or database services. You could install open-source software such as bisequel and open-source database or Hadoop, an open source data processing platform on Compute Engine. Build your own solutions are driven mostly by business requirements. They generally involve more IT overhead than using a cloud platform service. These three data processing services feature in almost every data engineering solution. Each overlaps with the other, meaning that some work could be accomplished in either two or three of these services. Advanced solutions may use one, two or all three. Data processing services combine storage and compute and automate the storage and compute aspects of data processing through abstractions. For example in Cloud Dataproc, the data abstraction with Spark is a Resilient Distributed Dataset or RDD and the processing abstraction is a Directed Acyclic Graph, DAG. Implementing storage and processing as abstractions enables the underlying systems to adapt to the workload and the user data engineer to focus on the data and business problems that they're trying to solve. There's great potential value in product or process innovation using machine learning. Machine learning can make unstructured data, such as logs useful by identifying or categorizing the data, and thereby enabling business intelligence. Recognizing an instance of something that exists is closely related to predicting a future instances based on past experience. Machine learning is used for identifying, categorizing and predicting. It can make unstructured data useful. Your exam tip is to understand the array of machine-learning technologies offered on GCP and when you might want to use each. A data engineering solution involves data ingest management during processing, analysis, and visualization, these elements can be critical to the business requirements. Here are a few services that you should be generally familiar with. Data transfer services operate online and a data transfer appliance is a shippable device that's used for synchronizing data in the Cloud with an external source. Cloud Data Studio is used for visualization of data after it has been processed. Cloud Dataprep is used to prepare or condition data and to prepare pipelines before processing data. Cloud Datalab is a notebook that is a self-contained workspace that holds code, executes the code, and displays results. Dialogflow is a service for creating Chatbots. It uses AI to provide a method for direct human interaction with data. Your exam tip here is to familiarize yourself with infrastructure services that show up commonly in data engineering solutions. Often they're employed because of key features they provide. For example, Cloud Pub/Sub can hold a message for up to seven days providing resiliency to data engineering solutions that otherwise would be very difficult to implement. Every service at Google Cloud platform could be used in a data engineering solution. However, some of the most common and important services are shown here. Cloud Pub/Sub a messaging service, features in virtually all live or streaming data solutions because it decouples data arrival from data ingest. Cloud VPN, Partner Interconnect or Dedicated Interconnect play a role whenever there's data on-premise that must be transmitted to services in the Cloud. Cloud IAM, firewall rules and key management are critical to some verticals such as the health care and financial industries and every solution needs to be monitored and managed, which usually involves panels displayed in Cloud Console and datacenter Stackdriver monitoring. It's a good idea to examine sample solutions that use data processing or data engineering technologies and pay attention to the infrastructure components of the solution. It's important to know what the services contribute to the data solutions and to be familiar with key features and options. There are a lot of details that I wouldn't memorize, for example, the exact number of IOPS supported by a specific instance is something I would expect to look up and not know. Also the cost of a particular instance type compared with another instance type. The actual values is not something I would expect I'd need to know as a data engineer, I would look these details up if I needed them. However, the fact that an N4 standard instance has higher IOPS than an N1 standard instance, or that the N4 standard cost more than an N1 standard are concepts that I would need to know as a data engineer.