In 2010, Google published a paper outlining a system that they had been working on. This publication about a system called Pregel, has been one of the most influential publications on large scale graph computing. The Pregel system essentially implemented the BSB model that we covered in the last lecture. To show how the Pregel system is programmed, we present the published code of Google's most famous algorithm, the PageRank. Recall that PageRank's task is to compute note centrality. The basic philosophy of PageRank is that a note that is connected to an more important note gains more importance. In the filler on the right, B is a very important node with a high page rank because a lot of other nodes directly or indirectly point to it. So C being his direct neighbor that receives an edge from B, also has a high page rank. All the C itself, there's not have too many incident edges. On the left side we have Google's published C++ code that implements PageRank method on the Pregel infrastructure. You don't have to know C++ to understand the basic essence of the code. We will explain the basic elements of the Vertex program in the next few slides. Remember from the last lecture that the VSP technique considers a vertex to be a process. Every vertex implements a method called compute. This method, implements the logic of what a vertex should do during the super steps. The program starts by creating a special kind of vertex called the PageRank vertex for which this compute method is specified. You will notice that the compute method starts by saying what happens when the super step is one or more. But what happens in super step zero? Usually, superstep 0 is used for initialization. Every vertex, initialization PageRank value to a same number which is one divided by the total number of vertices in the graph. Computationally, a PageRank of a vertex is a number calculated by adding two terms. The first term, depends on the total number of vertices in the graph, and is therefore the same every time. And the second term depends on the page rank of the neighbors of the vertex. How does the vertex compute the second term? It gets the PageRank values of the neighbors in its messages, and adds them up. As we saw for SSSP, after a vertex computes its value, it goes to the propagate step and sends a message to its outgoing edges. For PageRank, the message it sends out is just computed value divided by the number of outgoing edges. When this is done, the node halts for the next superstep, and waits for some other node to wake it up. At this point, we have seen two examples of graph analytic operations executed on the BSB programming model. Now, we'll look at the same problem from a slightly different viewpoint. GraphLab, originally a project from Carnegie Mellon University, now turned into a company called Dato, took a similar yet different approach to the problem. In GraphLab, any kind of data can be associated with a vertex or an edge. This information is stored in what is called a data graph. Let's look at the same page like I already did. The syntax is a bit different but the logical blocks are identical. These are the same blocks that we highlighted before. GraphLab breaks up these blocks into three different user specified functions called Gather, Apply, Scatter, or GAS, for short. Okay, so what's different? Let's mention a few important differences. First, let's consider gather. Rather than adopting a message passing or data flow model like Pregel, GraphLab allows the user defined update function complete freedom to read and modify any of the data on adjacent vertices edges. In GraphLab, a receiving vertex has access to the data on adjacent vertices even if the adjacent vertices did not schedule the current update. In contrast, for Pregel the control is with descending nodes. An update can happen only when a node sends out messages. For some graph analytics operations like a dynamic version of PageRank, this is very important. Further, if vertex can update its value asynchronously. That is, as soon as it receives an update without having to wait for all nodes like we do in VSP. This often helps some intuitive algorithms like page rank converge faster. When the graph is large and must be split across machines, BSP cuts the graph along edges, as we see on this slide. So every cut edge results in a machine to machine communication. If we increase the number of across machine edges, we increase communication cost. This has an interesting practical consequence. As we have mentioned, many graph applications have communities. And central nodes that have high degree. Many young people today have over 500 Facebook friends. So when an analytical operation like Page Rank, that goes through multiple iterations until convergence, for these graphs, the communication cost can become very high. In the cluster graph shown here, every color represents a different machine. Let's look at that red vertex. What would happen if we split it like this instead? The red node gets split. And different vertices of different machines work with their own copy of the red vertex. In the diagram, the primary red vertex is marked zero and copies are marked one through five. Now the gather phase happens for each copy. Followed by a second operation from the copies to the primary red vertex. And this is followed by new user defined operation called merge that combines the partial results from the copies to the primary vertex. So for PageRank, the merge operation boils down to computing the total summation of the partial summation of the edges computed at the copies. Thus in this lesson, we have seen two of the most influential paradigms of large scale graph computation when the number of nodes and edges run into tens of millions and more.