So if you consider the the programming challenges for data centers, there are two things that you have to worry about. We need to worry about fault tolerance. Because we've said that it's not a question of if a failure will occur, but it's a question of when it is going to occur. So what MapReduce does, its approach is to use stable storage for intermediate results. So that even if there is failure occurring, you can always go back to the stable storage and get the intermediate results. And also, it makes the computations idempotent, now, what does that mean? Well, what that means is, even if you have the same computations, repeat it again, it's going to produce a new output intermediate result. So you can ignore any partial result and use the one that was just computed. So these are the two tricks to make sure that you can adjust fault tolerance in the MapReduce framework, which is all very good. The idea is the mapper is going to write intermediate results into the disk and the reducers can take it from here. And even if this mapper fails in the middle, spawn a new one and that's going to write to the disk and then they can pick it up from there. And they're going to write to distinct, they are all local disks. And that way the master knows where to find the appropriate piece of computation result that was generated by map. Now the cons of this approach is the fact that this guy here is expensive, we all know that. And you can see in this picture, for instance, each of these reduces have to reach out to the disk to pull out, pull the data, the intermediate results. And they have to do that for every instance of map program that exists from which they have to pull the results, right? That's one deficiency, the other is it inhibits efficient reuse of intermediate results in different computations. What that means is that if this is being pulled out, it is being pulled out into memory. And if it is available in memory, it's cheaper to get it from memory. After all, you are reading the same thing in all of these reduces. If it is available in memory, it is much faster. So reuse is not being taken advantage of when you think pull things from the disk, right? So these are two cons of this particular approach to addressing fault tolerance. Now, the design principles in Spark is they want to shoot for performance and fault tolerance. Means it is always this desire that you can have both, right? So you want fault tolerance, but you also need performance. So if you want performance, you want to keep the intermediate results in memory. That's the way we can make sure that first of all it is fast, writing to memory. And it is opportunity for reuse because if it is in memory then all of them can pull from memory rather than going to the I/O. So that's something that we want to do. But if you do that, then if there is a failure then you have to worry about how do you address fault tolerance? So we have to have some efficient way of providing fault tolerance for the in-memory intermediate results. So these are two things that we want to provide and that's what spark design principle is. And if you look at what the secret sauce of spark is, is a data structure which they call resilient distributed data, or RDD for short. And again, just like in the the MapReduce framework, the intermediate results are immutable. Meaning that it's everything that you're doing is write once, but it can be read by several consumers. And it can be read any number of times, but writing is exactly one. So immutable intermediate results is exactly the same as in the MapReduce framework. Now the trick to fault tolerance is by logging the lineage of this resilient distributed data, what do I mean by that? Well, let's say that this is in stable storage and this is the transformations that is taking some result from here to here. So this is a transformation that's happening. And what we're going to do is, we're going to say that the new resilient distributed data, RDD2, is generated as a result of applying two transformations. T1 on RDD1, and then on that you apply T2, and that's how you generate RDD2. And this is what we're going to record in stable storage, as to how you generate new RDDs from the old RDD. So we're not checkpointing the entire data itself, but the transformations that go from one stable version to an intermediate version like this. Once you do that, if there is a failure, what you have to worry about you have to regenerate the RDD using this lineage. Now the thing is, you have a distributed computation and maybe this is good, this is good, and what is not good is only this guy. So if only this guy goes down for whatever reason, the computation that was running on the processor where T1 was executing that fail, or T2 fail and so on. So if that happens, then what you want to do is you want to rerun this computation on T1 and T2 in order to regenerate RDD2, the piece of RDD2 that was missing. So you rerun this and as a result of rerunning it you generate the new RDD2. So you kept everything in memory and the only thing that we've kept in stable storage is how you go from one piece of RDD to another piece of RDD. What are the transformations that you're applying. And once you do that, only the missing portion needs regeneration. That way you're getting both performance as well as fault tolerance. And you getting reuse, because the same thing could be used by multiple consumers. And this is typical of most of these computations that the intermediate results are going to be used by several subsequent instances. Take MapReduce framework, the output of the mapper is something that is needed by all the reduces, everybody is going to reuse it. So this is the idea behind the resilient distributed data. And the nice thing about Spark is that it has a generality which may not be readily apparent. But I want you to look at some of the resources that I've provided so that you can get some more information on Spark. Because it unifies many current programming models. The data flow models that we know about like MapReduce, Dryad and SQL can all be reprogrammed as Spark instances. And also there have been several specialized models that have been proposed for specific applications. Batched stream processing, if you have iterative MapReduce framework and iterative graph applications, all of that can easily be programmed in the Spark model. That's the nice thing about Spark approach. And this was a system that was developed originally at MIT and there's a lot of new instances of data center applications that use a spark framework more than the MapReduce framework. And of course, MapReduce has been there for a long time, there's a lot of legacy applications that use that. But Spark is the new winner in terms of trying to deliver both fault tolerance and performance.