[MUSIC] Hello, my name is Mike and I welcome you to Eindhoven University of Technology and to this MOOC on I/O efficient algorithms. I/O efficient algorithms are designed to operate efficiently on very large data sets. And in this MOOC, you will learn some of the sentimental theory behind such algorithms. [MUSIC] Are you coming? [MUSIC] Welcome to this MOOC on I/O efficient algorithms. In this MOOC, we're going to learn how to analyze the I/O behavior of algorithms. But before we get into that, I want to start with an example that illustrates why this is actually very important. Suppose that we want to do something very simple as computing the average elevation in a mountain region. Let's say the Alps. And suppose we have, as is the case nowadays, a very detailed model of the Alps. So here you see a nice picture. And the way these models often work is as follows. You simply have a 2-dimensional array. And for each cell in the array, that corresponds to a small region in your mountain area. So for each cell in the array, you have stored the height of the region at that particular location. Okay, so now suppose we have such a model for the Alps and we want to compute the average elevation. Well, pretty simple, right? So here's an algorithm. We simply go over all the cells in our array one-by-one, and we add all the elevations, and in the end, we divide by the total number of cells. And this would give us the average elevation. And the way the algorithm that I've written here works is that it goes over all the cells row-by-row, right? So first we look at all the cells in the first row, second row and so on, and as we go over the cells, we add their elevations. Okay, pretty simple. So now let's change the algorithm a little bit, and instead of going over the array row-by-row, let's go over the array column-by-column. Okay, so it means that we simply have to switch these two loops in steps two and three. Okay, so instead of row by row, we go over the array like this. Again, you simply add all the values and in the end divided by the total number of cells. Well, both correctly compute the average elevation and you would say well, it doesn't really make any difference whether you the first algorithm or the second one, right? Wrong, if the number of cells is very large, then it can actually make a huge difference which algorithm you use. And after taking a few lessons in this MOOC, you will understand why this is the case. But before we do that, let me first explain a little bit more about in which particular situations this can make a big difference. So here we have our digital elevation model of let's say the Alps. And maybe let's say that we have an area which is around 200 by 100 kilometers wide. Okay, and let's say that our elevation model is pretty precise. We have per square meter for each square meter. We know the elevation. Okay, so we have a grid whose resolution is 1 meter. And that's indeed the resolution that you get nowadays from techniques for measuring elevation like lidar or even satellites nowadays have very high precision. And let's also assume that to store the elevation of one particular cell, we're going to use 8 bytes. Also a standard number. Then what does it mean? It means that this data set has size 160 gigabyte. Okay, and that's not going to fit in to the main memory of let's say your laptop. Okay, so instead it's going to be stored on the hard disk of your laptop. Well, what does it mean if the data is stored on the hard disk? Then actually most of the time spent by your algorithm is not on, well, adding elevations and things like this, but it's actually on fetching the data from the hard disk that you need and maybe writing the data back after you've used it. So the time is dominated by the number of I/Os rather than by the CPU computation time. And that's why you need to take these I/Os into account. And in the example we just saw, it could be that if you compute it row by row, the whole computation runs in a few seconds, but if you do it column by column, then it will take several hours. Well, pretty surprising maybe, but also it clearly shows how when your data set is so large that it does not fit into the computer's main memory, but they're stored on disk, why the I/O behavior is really crucial. And you really need to understand how the design of your algorithms influences the number of I/O. And that's what we're going to see in this MOOC. So what we're going to do in particular is well, first of all, we're going to give, you could say an abstract model of a machine, so of the memory and the computing unit. And we're going to see how to analyze the number of I/Os in this particular model. And then after this introduction to the model, we're going to look at the number of basic results and algorithms in this area. So how can you for, well, relatively simple problems design algorithms that are I/O efficient, that do relatively few I/Os. So that's the plan for this course. [SOUND]