Problem. In a distributed system, whenever a server takes any action, it needs to ensure that in the event of a crash the results of the actions are available to the clients. This can be achieved by replicating the result to other servers in the cluster. But that leads to the question: how many other servers need to confirm the replication before the original server can be confident that the update is fully recognized. If the original server waits for too many replications, then it will respond slowly - reducing liveness. But if it doesn't have enough replications, then the update could be lost - a failure of safety. It's critical to balance between the overall system performance and system continuity.
Solution. A cluster agrees that it's received an update when a majority of the nodes in the cluster have acknowledged the update. We call this number a quorum. So if we have a cluster of five nodes, we need a quorum of three. (For a cluster of n nodes, the quorum is n/2 + 1.)
The need for a quorum indicates how many failures can be tolerated - which is the size of the cluster minus the quorum. A cluster of five nodes can tolerate two of them failing. In general, if we want to tolerate 'f' failures we need a cluster size of 2f + 1
The cluster can function only if majority of servers are up and running. In systems doing data replication, there are two things to consider:
Number of Servers | Quorum | Number Of Tolerated Failures | Representative Throughput |
---|---|---|---|
1 | 1 | 0 | 100 |
2 | 2 | 0 | 85 |
3 | 2 | 1 | 82 |
4 | 3 | 1 | 57 |
5 | 3 | 2 | 48 |
6 | 4 | 2 | 41 |
7 | 5 | 3 | 36 |
More at https://martinfowler.com/articles/patterns-of-distributed-systems/quorum.html