System Design Notes
Don’t forget to get your copy of Designing Data Intensive Applications the single most important book to read for system design interview prep!

Quorum Variations

So far we have examined what is called a strict quorum in contrast to a sloppy quorum which we’ll discuss next. Consider a system composed of several dozen nodes. It may not make sense to replicate each data value to every node so we may choose a subset of nodes in the cluster for replicating each value instead of replicating the values on every node in the cluster. Going back to your inequality R + W >= N, the N in this scenario is less than the number of nodes in the cluster.

When a network partition takes place, it can happen that the chosen subset of nodes where we want to replicate a given value is not reachable partially. However, other nodes in the cluster are still reachable which aren’t designated to hold the value we want to write or update. At this juncture, if a write request is received the system can decline it since enough nodes aren’t available to record the write request or the system can accept the write request and temporarily write the received value to nodes that are reachable but aren’t part of the designated nodes for the value. In this manner the value is recorded at a subset of the designated nodes and at nodes which are part of the cluster but not among the designated nodes for replicating the value. This is known as a sloppy quorum and helps improve the write availability of a system. Cassandra, Dynamo, Voldemort, all offer sloppy quorum feature.

Later on when all the designated nodes that should replicate the value become reachable, the value is copied over to these designated nodes from the nodes that temporarily hold the value. This is known as hinted handoff.

Increasing the write availability comes at the cost of increasing probability of reading stale values, since a write may not be present on all the required W nodes. A successful write indicates that the write is present on some W nodes in the cluster but not necessarily on the designated W nodes. Sloppy quorums may make sense for applications that can work with reading stale values.

The leaderless replication model can also be extended to multi datacenter replication needs. Usually, a value is written to N nodes within the local datacenter before an acknowledgment is sent to the writer. Thereafter, the value is asynchronously copied over to other nodes in the remote data centers. There are tweaks to this model, for instance Cassanadra sends the write request to all the nodes in all the datacenters but waits to hear from the nodes in the local datacenter only. In contrast, Riak communicates between clients and nodes confined to the local datacenter and replicates asynchronously behind the scenes.

Read repair

One way of updating stale values on replica is using read repair. The idea is that a client when making a read request receives responses from different nodes. From the responses it can determine which is the latest value/data. The client then subsequently writes the latest value back to the replica that sent it a stale value.

Anti-entropy process

Some systems run a process in the background that constantly scans the replicas for differences and copies any missing data from one replica to another. The copies may not be made in any order and there may not be any SLA that governs the time it takes to copy missing data.