Hello. This week, you will learn about Apache Spark, a modern distributed fault tolerant computation platform. This week, is structured into three parts. First, we will go for the history of Spark, its core concepts, abstractions and operations. You will learn about RDDs, transformations, actions and resiliency. During in the second lesson, we will continue to explore advanced topics. You will learn how Spark executes your applications. How to tune persistence and caching. What are broadcast variables and accumulators. Then, in the third lesson we will go for examples Sparks application in Python so you can learn, how to use the framework in your tasks. Note that this week is focused on basic fundamentals. It is important to cross them to understand how high level tools work. In the next course of specialization. We will cover analytics and machine learning on Spark with a great detail. But for now, let's focus on understanding and internalizing the fundamentals. A brief overview of Spark history. The project started in 2009 in AMP lab in UC Berkeley. Initially, Spark was a research project with a focus to build a fast in-memory computing framework. Three years later in 2012 Spark had the first public release. As Sparks started to gain traction in the industry, it was no longer a research project, and to facilitate better development model and community engagement Spark has moved to Apache Foundation. Becoming a top level project in 2014. In the same year Spark has reached version 1.0 and two years later in 2016, 2.0. I find it helpful to think of Spark development in epoch terms, where every epoch has its own key ideas to explore and its own goals to achieve. The first epoch is the inception of Spark. Original development was motivated by several key observations. For many use cases, it was questionable if the MapReduce is an efficient model for computations. First, it was observed that cluster memory is usually underutilized. Some data sets are small enough to fit completely in cluster memory, while others are within a small factor of cluster memory. Given that memory prices are decreasing, year over year it is economically efficient to buy extra memory to feed the entire data. Second, there are redundant input and output operations in the MapReduce. For ad-hoc tasks, it is more important to reduce the completion time, rather than provide durability of the storage, because ad hoc queries generate many temporary, or one off data sets that could be quickly disposed. Third, the framework is not that composable, as developers would like it to be. For example it is tedious to reimplement joints over and over again, as a code reuse is complicated requiring some engineering discipline. Spark addresses these issues. Many design shortcomings were fixed by introducing an appropriate composable abstraction called RDD. Also RDD abstraction allowed for more flexibility for the implementation and the execution layer thus, addressing the performance issues. The second development epoch, was about the integration. The key observation, was that typically users had several frameworks installed on their clusters, and each of these frameworks was used for its own purpose. An example here, is the MapReduce use for batch processing, Storm for stream processing and Elastic search for interactive exploration. Spark developers, try to build a unified computation framework suitable for both batch processing, stream processing, graphical computations and large scale machine learning. The effort resulted in the separation of Spark core layer consisting of basic abstractions and functions, and a set of Spark applications on top of the core. The third development epoch, which is still ongoing is driven by the wide adoption of Spark in data science community. Many data scientists, use specialized libraries and languages like R or Julia in their everyday work. These tools, adopt the relational data model which we covered in the first week of the course. Spark has embraced the same model in the form of Spark data frames, thus, enabling smooth and efficient integration with the data scientists tools. As I mentioned earlier, this week, we will focus on fundamentals. Spark related frames, Spark ML lib will be covered in the next course of specialization, and Spark streaming will be covered in the course of real time applications. In the first lesson video of the week, we are going to work through the RDD abstraction. Please, continue whenever you feel ready. See you.