Welcome to this course on resilient streaming applications on Google Cloud Platform. The first Chapter is on the architecture of streaming analytics pipelines. This course discusses what stream processing is, how it fits into a big data architecture, when stream processing makes sense, and what Google Cloud technologies and products you can choose from, to build a resilient streaming data processing solution. We'll also discuss the challenges associated with stream data processing. There are three key challenges: handling variable data volumes, dealing with unordered or late data, and deriving insights from data even as it's streaming in. In this chapter, we'll also do a lab which is a pen and paper one, where you'll fit some typical streaming scenarios into this streaming architecture. So, what does streaming mean in this context? Essentially, we're talking about doing data processing, not on bounded data, but on unbounded data. Unbounded is a key here, and typical datasets are bounded. What this means is that they're complete. At the very least, you will process the data as if it were complete. Realistically we know that there'll always be new data, but as far as data processing is concerned, we'll treat it as if it were a complete dataset. Another way to think about bounded data processing is that, we'll be done analyzing the data before new data comes in. On the other hand, if you have an unbounded dataset, it's never complete, and there's always new data that's coming in, and typically the data is coming in even as you're analyzing the data. So, we tend to think of analysis on unbounded datasets as this temporary thing, carried out many times, being valid only at a particular point in time. So, streaming is essentially data processing on unbounded data. Bounded data is data at rest. Stream processing is how you deal with data that's not at rest with unbounded data. But more broadly, people often talk about streaming as an execution engine. The system, the service, the runner, the thing that you're using to process unbounded data. If you design it correctly, such a stream processing engine can give you low latency, can give you partial results, speculative results, results that you can change later. The ability to reason about time to control for correctness of the data and the power to perform lots of complex analysis even as the data are coming in. So, that's what we're going to learn in this course. We're going to learn about designing stream processing systems that can do all of these things. That's all well and good, but how common are unbounded datasets? Why should you care? Well, unbounded datasets are actually quite common and becoming more and more common as sensors get cheaper and network connectivity keeps improving. For example, traffic events. So, you may have sensors laid out around thousands of highway miles, thousands of miles of highway. In this course actually, we'll use traffic data from San Diego, which is a city in Southern California. Data that were collected over a year. So, in some sense, this is data that's bounded, it was collected, it's historical data, but we're going to simulate it as if it were streaming in real-time. Or, take for example, usage information of some Cloud component by every user within a Google Cloud Platform project. For monitoring purposes, this will be streaming data. This is a very common theme by the way. You want to derive insights fast and in order to derive insights fast, you're going to be monitoring data as it's coming in. Or take for example, credit card transactions. If you're considering the credit card transactions of every card member, ever since they opened the account and as they are using their credit cards, then you're talking about streaming data and this is necessary for things like fraud detection. So, this illustrates two other common themes. One, the need for fast decisions. This idea that you have to stop the fraud before the person actually makes the purchase. That's the reason you need to process the data as it's coming in. It's not enough to collect the credit card data and process it once a day for example, you need to process it as it's coming in and that need for a fast decision is what leads to streaming. The second common theme here is that, you have lots of data from a variety of sources and it keeps growing over time but not just that, it's not only about the new data, think about a fraud situation. It's not just about a purchase that's being made right now, it's also about the history of purchases that this user has made. So, you're not just talking about new data, you're talking about comparing the new data with all the data that you've collected so far. Or take the final example, user moves in every user in a multi-user game, an online game. That's a streaming dataset. This is a sort of use case that probably didn't exist a few years ago. A few years ago, games were static but now, games are highly personalized and then when you think about things like character design, in-app purchases, these are all hugely dependent on knowing up to the minute user activity. When you're thinking about up to the minute user activity, you're thinking about an unbounded dataset, you're thinking about streaming. So, to summarize, the three themes that we saw in the four examples on the previous slide. Number one, massive data from a variety of sources that keeps growing over time. Secondly, the need to derive insights immediately so that you can display them in the form of dashboards, for example in the case of the cloud monitoring situation. And finally, the need to make the decisions fast to interact with users at the right time to make timely decisions. So, the demand for stream processing is increasing a lot these days. It's not enough to process large data, you have to process it fast so that a firm can react to changing business conditions in real time. Real-world stream processing use cases include things like trading, fraud detection, system monitoring, order routing, transaction cost analysis, the list goes on and on. So if you want to look for fraud transactions historically, well that's batch but if you want to look at it as it happens, that streaming. So you need stream processing to answer questions like, how much sales did I make in the last hour due to advertising conversion? Well, that's the last hour, it's new data, it's streaming, it's monitoring. Which version of my web page to people like better? You want to show this is a dashboard to your user interface designers, then it streaming. You want to do this after the fact you run an experiment and then you want to come back a week later and analyze it, well that's batch. So streaming is about the need for making faster decisions or for example which transactions look fraudulent. You want to stop this camera before the purchase goes through, so that's streaming as we've seen. So when we talk about big data when you have terabytes and petabytes of data, we talked about this in our course when we looked at serverless data analysis the way you solve for large amounts of data is to do MapReduce, is to break up your data, do auto scaling analysis of this data. So we looked at BigQuery, we looked at Dataflow but we looked at them as batch processing solutions over large amounts of data. The cool thing is that on Google Cloud Platform, both these products; both BigQuery and Dataflow make an entrance and streaming also on GCP. The same tools you use for batch can also be used for streaming. Another aspect of big data is variety; audio, video, images, etc, unstructured text, blog posts. The hard problem with variety is dealing with unstructured data. If your data are all structured, you would just throw it into a relational table and use joins but unstructured data is a little harder. In the courses on unstructured data, we talked about machine learning APIs to make sense of variety. But the third aspect to big data, is near real-time data processing, data that's coming in so fast that you need to process it to keep up with the data. That's what we're going to be talking about in this course when we talk about streaming. So if you think about a big data architecture, you will have several parts to it often masses of structured and semi-structured historical data that you may have stored in Cloud Storage or pops up for BigQuery that's volume and variety. On the other side, stream processing is what is used for fast data requirements for velocity. Both compliment each other. So if you think about internet of things for example, you're talking about increased data volume, probably increased variety and velocity of data and so this leads to a dramatic increase and need for things like stream processing technologies. So let's now move on to talk about the challenges associated with stream data processing. Stream processing is what makes it possible to derive real-time insights from growing amounts of data. So if you're thinking about a good stream processing solution, there are three key challenges that it needs to address. Number one; it needs to be able to scale. Think of credit cards; there'll be hundreds of thousands of transactions happening all the time and the volume is not going to be consistent, it's not going to be exactly the same at all times. Around holidays the volume is going to be higher, late at night the volume is going to be lower. Second challenge, you want to essentially use Continuous Queries. Lots of times we are interested in querying the latest arriving data. For example, we may want to do things like moving averages looking for unusual spikes. But in order to do that, we have to continuously calculate some kind of mathematical or statistical analysis. We have to do this on the fly, on the stream but at the same time we have to account for things like late data out of order data et cetera. Third challenge; we can't just stop the ingest whenever we want to analyze the data. We need to be able to derive the insights even as the data are coming in. In other words we want to be able to do SQL-like queries that operate over time windows, over time windows over that data. So when you think about stream processing systems, they've been around a while. They were born nearly a decade ago because of the need for low latency processing of large volumes of dynamic time-continuous streams or sensors and monitoring devices, etc. So, they've been around and probably the first users were the financial industry with stock market data but today you see it everywhere. You're generating stream data through human activities, through social media, through machine data, through sensor data. So there's a lot of stream data out there, so we need to know how to process it.