The main goal is, how do you write large scale parallel programs, parallel and distributed applications? Which has both scalable performance, when we say scalable performance, what we mean is, we could be running an application on a multi-core or it could be running on a cluster with thousands of machines. And the other thing that you have to remember is that the programmers that are writing these applications are not experts in distributed systems of parallel programming, right? They're domain experts, they could be a machine language guy, they could be a vision guy. They are not necessarily experts in writing distributed program. Even for us it's hard [LAUGH], right, writing distributed programs. So the programming model has to be very simple. And the key thing is that you, when you take an operating systems course from me, for instance, I'll teach you how to build fine grain concurrency control mechanisms, synchronization mechanisms and communication mechanisms and so on. But when you are developing applications on these large clusters, the developer has to be freed from worrying about these fine grain concurrency control of the application components. I mean, for instance, things like threads and locks and barriers. Those kinds of things are not something that you want to burden the developer with. So the model that has become very powerful and being used quite a bit is a data flow graph model of the application. With the application components, which are expressed as subroutines, they are the vertices of a data flow graph. And the edges denote the communication among the components. And we want to exploit data parallelism, meaning that you are doing similar operations, or the same operation on a large corpus of data. And therefore you can exploit the fact that you can apply data parallelism on the different data sets. And so what we want to do is require the programmer be explicit about the data dependencies in the computation. And let the system, that is you guys, right, we are the system builders. So the system worry about distribution and scheduling of the computation respecting the data dependencies that have been put forth by the developer. You all probably heard of the term automatic parallelization, right? Automatic parallelization is taking a sequential program and trying to parallelize it automatically, that's very hard. In the beginning of computing, you're writing sequential programs. And then when parallelism came about we wanted to say, well you write your sequential program, we'll parallelize it for you. It's easier said than done, right? And so automatic parallelization is not something that you can do and that is something that is well understood. And so therefore what we want to do is, the developer supplied application component is the unit of scheduling and distribution. We are not trying to trying to parallelize that unit. What we are saying is we can have multiple copies of this unit running in parallel. And this is sometimes referred to as embarrassingly parallel applications, right? Where you are running similar computation on a whole bunch of data sets. So that's sort of the philosophy behind how do you develop programming models that will run on large-scale data centers. And one of the things that you have to worry about in such programming models is that failure is a fundamental feature, if you will, of data center applications, right? It's not a question of if a failure will happen, but it is going to happen, it's just a question of when it is going to happen. And the failures can come in various different ways. And so despite these failures, which may be of the computational elements, it could be of the networking fabric, despite all of these things you want the ability to say that you can achieve a deterministic computation of the application that have been developed by domain experts. So that's sort of the stage. And what we're going to do in this lecture is we're going to look at some of the programming frameworks that have been proposed. Map-reduce, Dryad, Spark, Pig Latin, Hive and Apache Tez. And these three we will we'll look in some detail. And the other I will just sort of mention in passing.