Basically in the MapReduce programming model, input and output to each of these two functions called map and reduce, they appear as key value pairs. So for instance, let us take a simple fun example, let's say I have a corpus of documents, and in the corpus of documents, I'm looking for specific names in the document. Let's say I want to see how many times the word Kishore appears in the documents, Arun appears in the documents, Drew appears in the document. So that's the program. You can immediately see that there's an embarrassingly parallel computation because you have corpus of documents and you're looking for specific things that you want to look at that. We want to express the input to the functions as key value pairs. So the input dataset, if you will, consist of a keyspace, which is the file name and the value is the contents of the file. That's the input. So you have several instances of this key value pair. File 1 with its contents, file 2 with it's this contents and so on. What the map function is doing, it is looking for the specific names that I identified. So for instance, if you take this guy, it's taking f_1 and the contents of f_1 and looking for these three names, how many times it occurs in that. Similarly, the same thing is being done by this map function, this map function. These are several instances of that and this is simple procedure that the programmer has written. He or she doesn't know that it is going to be paralyzed in any shape or form. But what we're doing is we're forming out several instances of map that'll take these contents and then spit out the key value pairs as the output. So what is the key value pair that is being put out as the output? Well, the key is the unique name that's being generated. That's the output of each one of these. The value is the number. So the interaction between every component is in terms of key value, it is just that the input key-value pair is different from the output key value pair that is coming out of the map function. So what the map function is doing is saying that I found Kishore so many times, I found Arun so many times, and I found Drew so many times. These are the outputs coming out of each one of these guys. That's the function of the mapper is looking for this unique name in this simple example. The purpose of the reduce function is to act as an aggregator for the values that have been produced by the mapper. So for instance, this guy is getting inputs corresponding to all the Kishores that have been found by the different mappers, and this guy is finding all the Aruns that have been found, and this guy is finding all the Drews that have been found. It is aggregating that and producing that as the output. You can see that the key value pair that is generated as the output of the reduce function is the same as the output of the map function. So that's essentially what MapReduce programming framework is asking the programmer to do. Why MapReduce? Well, several processing steps in giant scale services can be expressed as MapReduce function, and all of the domain expert is asked to write are two functions; a map function and a reduce function. They know exactly what the semantics of those functions are and they know that the input to the map function is a key value pair, the output is a key value pair which serves as the intermediate result for the reduce function to work on. Here is another example that you may all be familiar with. It is giving the rank for pages. Page ranking is a very important algorithm that runs on web content. So in this, the keyspace is a bunch of URLs and the content is, of course, the content of a particular URL. The mapper function, what it is doing is it is looking for a specific target. Maybe I'm interested in knowing how many times my website is being reached by other webpages, so that's the target. Maybe Vishnu is interested in knowing how many times his website is being accessed. So the target can be my webpage, the target can be Gustav's webpage and so on, that's the target. This URL is saying, "Yes, this particular target was found in this URL." So you're analyzing the content and then saying, "This particular URL that you're looking as the target appeared in the content of this particular URL," and then spitting it out as a hit. What the reducer is doing is it is getting the target corresponding to, let's say, all the hits it found in the corpus of documents that has my URL embedded in them and then producing the final source list saying, "This particular target, so many sources had this URL pointing out of them." That is the basis for generating the ranking, for instance, of pages. Again, you can see that it is embarrassingly parallel. All that the domain expert has to do is produce these two simple functions; map and reduce. All the details of scheduling and plumbing, all the thing is being done by the runtime, instantiating the number of mappers and reducers, and effecting the data movement between the mappers and the reducers. All of that is going to be done by the runtime. That's the MapReduce model. So when you're thinking about a developer versus the system builder. We're liberating the developer from having to worry about all the distribution and the plumbing. Now, you're learning the model and now you're going to go off and build as a system developer, how do I effect the orchestration of that? So that's the thing that I'm going to talk about next. In summary, the MapReduce model developer's responsibility is providing the input dataset and giving the map and reduce function. That's it. Now, the system runtime responsibility, which is what you guys are, what do you have to do is take this input dataset and shard, this is a new term that has come up in a data center applications. Sharding the data is taking a dataset, bringing it into slices that you can give to different mappers. So that's something that the system has to worry about. Then you may be using a distributed file system for communication between the mappers and reducers. That's part of what goes on. This is your system responsibility in terms of effecting this MapReduce model to work correctly. One question is, obviously, the input dataset is something that the developer has to give. Now, whose responsibility is it in terms of deciding how to shard it? Now, there could be a default policy that is implemented in the system on how to shard the data. In principle, any sharding technique is perfectly fine because all that you have is a corpus of data which is provided as a key value pairs. So I can break it. If I have 10,000 key value pairs, I can break it into 10 chunks each of 1,000 or 100 chunks, each of 100. All of that is up to me. Now, up to me meaning the system developer, if you want. Or you can give the ability to the user, the developer, to say, "If you have some way you want to shard the data, you tell me and I'll apply that." Both of these are possible in data center applications. It is also dependent on how many resources you've got in order to run your computation. So if I shard a 10,000 dataset into 10 chunks, maybe my intention is have 10 mappers that are going to work on each one of the shards. These are things that go into how you go about sharding the data.