There are four systems we will look at not in any great detail, but just to give you a salient points of each one of these systems. I have much more details that you can see in the papers that I've put in the resources. First we'll start with Mesos which is a fine-grained resource scheduler. When you look at resource sharing in the Cloud, the challenge comes from the fact that applications are not all written in one particular programming framework, but they use a variety of frameworks. We've seen a variety of programming frameworks in this class and a couple of lectures back; map-reduce, dryad, Spark, Pregel which is a graph processing engine. So these are all different programming frameworks in which you maybe developing applications, but the reality is that these apps which are written in different frameworks, need to share the data center resources simultaneously. They're all running at the same time, they're all asking for dibs on the machines. A data center we know has thousands of nodes if not more. These days data centers start at about 100,000 nodes. So that's the number of nodes that you have in a data center. There are a hundreds of jobs that are running at any point of time, and each job has a number of task. So we're talking about millions of small duration tasks that have to be mapped on to the resources of a data center. That's the job of the resource manager at a data center. So the key is from a data centers' perspective, you as an individual may worry about the completion time of your program, but if you look at it from the point of view, the data center operator, they want to maximize the utilization while meeting the quality of service demands of the application. You want to make sure that the customers are happy, but at the same time, you want to run your data center at 100 percent utilization if you could. Traditional approach to resource sharing along the lines of what I just showed you in fair share scheduling that Hadoop uses. One of the things that is typically done in a data center is statically partitioning the cluster. So if I have nine machines for instance, and if I have three applications, the Hadoop application and the Pregel application and MPI application that want to use the resources, I might say, okay each one of them gets three of these nine machines. If there is a time axis in the course of the execution, depending on what is going on in the application, it may be using the resources very efficiently for some amount of time, and maybe there's a dead time when there is no activity at all in the application because it is doing I/O. There may be another situation where there is spurt of usage. This may be the same thing that is repeated for each one of these applications. If you look at the usage or the utilization, it may not go past. This is a cartoon. It's not a real data or anything like that, but it's a cartoon which I picked up from the Mesos paper that says that at any point of time maybe the resource usage for each of these applications may peak to about 33 percent and never more than that. If that is the case, then a better idea would be to say, what are all the requirements of all these applications, and can we then do a global schedule of all of the requirement because we know that each one of these applications has a bunch of tasks. If I know all of those tasks and what the requirements of this tasks are, I can do a better scheduling. So one thought is where can we have a global schedule where the inputs are the requirements of the framework, meaning how many tasks are there, what is the duration, what is the resource requirements and so on, and what are the resource availability that we have, what policy we may want to use meaning what kind of fair sharing policy that we want to use to maximize the resource utilization as well as meet the quality of service requirements and come up with a global schedule of all the tasks. So this could be one approach to doing that, but I already mentioned that optimal scheduling is an NP-complete problem even for small numbers and machines. Doing a global schedule like this is very hard. Now we're talking a scale of 100,000 machines and thousands of jobs at the same time, so it is not scalable to the size of the cluster and the number of tasks. It's also difficult to implement and monitor in the presence of failures which is a given in data centers. So given all of this, Mesos' goal can be stated in this fashion. What they want to do is they want do dynamic partitioning. You want to think of this as a shared cluster instead of thinking of it as a dedicated set of machines for each one of these applications, but we want to do dynamic partition which will lead to high utilization. The goal is that if you look at these cartoon charts that I have, there are gaps in here when not all of the resources are being used, and that's a good time to use the resources for some other application that requires more resources. So ideally, what we want to get to is a graph that looks like this where the resource utilization is close to 100 percent. Maybe what you're doing is you are using gaps in the resource usage of these different applications so that we can maximize the computation time or the particular dominant resource time for some of that application, and overall, another auxiliary goal is also to make sure that can we actually reduce the running time for each one of these applications. So these are some of the goals, but from a data centers' point of view, and I've to be honest about all these resource managers. There focus is not so much on the application running to completion first, but their focus more on trying to maximize the utilization because that is paramount from the point of view of a data center. So let's look at what Mesos approach is. I'll give you a very simple analogy. I have 12 colleagues that I'm taking to a restaurant and I want to get three tables to have all of these colleagues. I can tell the restaurant owner I have 12 people and they can seat them in any which way they want, or the restaurant owner can tell me, "Look, here are three tables. You decide who goes where." I might know which colleagues get along with others, who don't, and I can make the assignment of which colleagues sits in which table. So those are two different approaches. In some sense, Mesos approach is exactly that. That is, it is going to say, "Where resources are available, I don't know how you want to use it, I'm going to give it to you and then you figure out how to use it." That is what is called an offer. In some sense, the Mesos approaches like a simple thin micro layer that sits between the resources at the bottom and the application's at the top. One of the things that they want to do is recognize that all applications are not the same because you have Hadoop application, MPI application. Each application is different, and in terms of how these applications may interact with one another, meaning if I have a particular application with a number of tasks, how do they interact with one another, what data locality do they have, how much of sharing is there. Those are things that are known to the schedulers and not to the Resource Manager itself. Therefore, they want to bump up the allocation of specific tasks to the computation elements up to each one of these frameworks. So that's the idea behind that. So we've got the resources here. What the allocation module is going to do is, it's getting information from the slaves. When I say slave, what it is a machine with a certain amount of resources, and it is saying, "Well, I'm done with whatever I was running, and these are the resources that are available in terms of CPU memory and so on." There is a thin layer which is the Mesos layer which has an allocation module. What the allocation module is going do is, it's going to say,'' Well, these are the resources that became available, now who do I give it to?'' It's like the hotel guide deciding if both Gustav and I bring a party, and each of us need a certain amount of resources, the manager may decide that this time I'm going to give it to Gustav, and another time I'm going to give it to [inaudible] and so on. So that's the allocation modules work is to pick the framework to offer the resources to. So it's going to say, "Well, the resource offer I have, and I may give it to let's say this guy." He might then decide taking that and how to map the available resources to do computations. I'll give you an example of that. The other thing that can happen, once again going back to my analogy, maybe I go there rush time when the restaurant is really busy. They cannot give me three tables all at the same time. They may decide to give me one table at a time and say, ''I have one table, do you want to use it?'' I could say, ''Yes, these four guys I can put them on this table," or they may say, ''Well, I have another table which just occupancy of only two people'' I might say, ''Oh, no, that's not going to work because I need at least a group of four." So when an offer comes up to these frameworks, they have the opportunity to decide, yes I want to take it or no, that doesn't suit my needs right now. So that is something that you can do at the level of the schedulers. So here is an example. So here is the slave reporting that I've got four CPUs and four gigabytes of memory that's available. The allocation module looks at this and then says, ''Okay, this particular resource, I'm going to give it to this framework." So that is what is happening here. An offer is made to framework 1 and framework 1 says, ''Okay, I'll take that offer, and I'm going to take this offer which has four CPUs and four gigabytes of memory. I'm going to say that two CPUs are going to be given to one task and one gigabyte of memory is going to be given to another task, and for the second task, I'm going to give one CPU and two gigabytes of memory." So that's the allocation it gives back to the Mesos master which then can take that and then map it onto the slave.