Now that we've studied the details of point-to-point communication with send, receive, isend, ireceive, we will step up to talk about collective communication, which involves multiple ranks as opposed to point-to-point communication, which involves only two ranks, a pair of ranks doing send-receive. So, again, the model is that we have these ranks, which are MPI processes executing in a data center connected by a network. And what if R0, for example, wants to send a value X to all ranks? Now, given what we've learned about send-receive, we could figure out a way to do it. We could just create a loop where R0 repeatedly sends X to rank R1, R2, R3, and so on, and we could have a single receive in each of ranks R1, R2, R3. This may be okay if we had a very small number of ranks, but if we have 1,000, 10,000, 100,000 ranks, we will have a sequential bottleneck in R0 because we will be sequentially processing each send one at a time. However, the network that connects these ranks is capable of much more communication bandwidth than that. So what we will do is instead use collective communications, again, we have the SPMD model, so all these ranks are executing the same main program and doing an initialization. And the value of X, let's say, all of them have some INT X allocated, and if the rank is 0, X[0] was initialized to some value that we want to communicate to all the others, let's say this is 99. Now, how do we avoid doing that sequential iteration through multiple sends? Well, MPI provides a collective communication call, in this case the one we want to use is for broadcast which is abbreviated as BCAST, and as with all MPI calls we specify the buffer we want to send X, the starting offset, and the length, X, 0, 1, and the data type, these are integers, so MPI.INT. And one very important parameter for some of these collective calls, which is the root. And when we say 0, which is the root, what we are saying is the value in X[0] for rank 0 is going to be sent to all the other ranks. And after this call to BCAST which must match in all the ranks, kind of like a barrier, when we pass this call to BCAST, all the ranks R1, R2, R3 would have received the value in X[0]. So if all of them were to print, say rank and X[0], you would see one print statement from each rank with its rank and the value of X[0], which will be 99 because of the broadcast. So this is very convenient and not only is it convenient for the programmer, it's much more efficient than trying to do it sequentially by yourself because the broadcast call is implemented to use the full parallelism in the machine and the network. There are other examples of collective communication as well, suppose R2 wanted to compute the sum of all Ys and store it into Z. So, to do this, we can use the MPI call called REDUCE. And in REDUCE, we have to specify the input which will be Y and the offset 0, as well as the destination which is Z and the offset 0, and the number of elements 1 has to be specified only once for both the send buffer and the receive buffer because it's the same for both. And then we have the data type, let's say, MPI.INT, and then very importantly again the root. Now, R2 is the one that wants to collect the sum, so we have to put 2 as the root in this case. Now what'll happen is, all the values in Y[0] across all the ranks will be added up, this will be done automatically for you in the call to reduce. And only for R2 will the value of Z[0] contain the sum, so if you had a print statement that said if rank equals 2, print Z[0], this value Z[0] would contain the sum of all Ys, and it would only be available in rank 2. So I think you're getting the idea now of how these collective calls work, and seeing the call to reduce should hopefully remind you of MapReduce, another paradigm that we've studied earlier. And in fact, effectively, if you look at this distributed memory parallelism with collective communications, it feels a lot like MapReduce. All the local computations that occur before a reduce are essentially map operations, they are local operations all being performed in parallel across all the ranks. A reduce is a time of coordination among ranks, and all the ranks call reduce to achieve that coordination. And in fact, many programmers who started with data analytics frameworks such as Spark are moving to frameworks such as MPI for increased performance because they find that that MapReduce Paradigm is also applicable over here in certain cases. So with that, you have a good overview of distributed memory parallelism, ranging from point-to-point communication all the way to collective communication. And I wish you a lot of luck in writing very interesting applications that can run across multiple nodes in data centers.