Hello, the subject of this video is distributed computational systems. When people talk about distributed systems, they often mention that such systems are built from unreliable components.. In a few moments, I will explain what it means. Let us imagine that you have a distributed file system like HDFS that you made in the first week. As you remember, big files in this file system are distributed over different machines. In this picture, you can see an example of such a small cluster. Let us assume that you have a huge text file. For instance, of Wikipedia stored in one file, having one article per line. And you have to find the most popular words here. A small spoiler, right at the end of this lesson, you'll be able to do it by yourself with Hadoop MapReduce. If you would like to find an answer to this problem, you need first to read data from local disks, to do some computations, and to aggregate results over the network. What components can break in this system? To be honest, each and all. Cluster nodes can break any time because of power supply, disk damages, overheated CPUs, and so on. From the distributed system, clearer perspective. There are three types of node failures. They are fail-stop, fail-recovery, and Byzantine failure. Fail-stop failure doesn't mean that if machine crashes, it crashes for good and out. It means that if machines get out of service during a computation then you have to have an external impact to bring system back to a working state. For instance, a system administrator should either fix the node and reboot the whole system or part of it. Or, a system administrator should retire the broken machine and reconfigure the distributed system. Therefore, such distributed systems are not robust to node crashes. Fail-recovery failure means that during computations, notes can arbitrarily crash and return back to servers. A process full of surprises, your life will never be boring again. What is interesting, this behavior doesn't influence correctness and success of computations. That is, no external impact necessary to reconfiguring the system at such events. For instance, if a hard drive was damaged, then a system administrator can physically change the hard drive. And there are no other step necessary to return this node back to service. After reconnection, this node will be automatically picked up by a distributed system. And it will even be able to participate in current computations. The last and the most interesting type of failures is Byzantine failure. A distributed system is robust Byzantine failures if it can correctly work despite some of the nodes behaving out of protocol. In other words, you have nodes that are going to lie through their little digital teeth to destabilize the system. If you are developing a financial system, then you are likely required to deal with these types of failures to protect your customers and your business. The definition of Byzantine failure is widely known because of the article, The Byzantine Generals' Problem, written by Leslie Lamport, Robert Shostak, and Marshall Pease, and published in 1982. Let's make the example clear with the help of brilliant metaphor. A problem can be expressed expertly in terms of group of generals of the Byzantine army camped with their troops around an enemy city. Communicating only by a messenger, the generals must agree upon a common battle plan. However, one or more of them may be a traitor who trying to confuse the others. It is shown that using only unsigned messages, you have to have more than two-thirds of loyal generals. For instance, in the system of three generals, there is no chance for loyal ones to come to agreement about the battle plan. It can be shown that this problem is equivalent to the problem where you have only one commanding general and the other are lieutenants. In the case of one general and two lieutenants, the hand-waving proof is easy. In the first image, you have the second lieutenant as a traitor. The first lieutenant got a comment, attack, from the commander and got a message, commander ordered to retreat, from the second lieutenant. In the second image, you have the general as a traitor. Without lost of generality, the first lieutenant got a comment, attack, from the commander and a message, commander ordered to retreat, from the second lieutenant. In both images, you have the same message in different situations. And loyal participants should come to a different plan, attack in the first scenario, and retreat in the second one. So why? In this video, you have found out that a distributed system can have unreliable components. For instance, physical machines, from the perspective of distributed systems can be classified through systems that can deal with fail-stop nodes, fail-recovery nodes, and Byzantine ones.