System Design Notes
Don’t forget to get your copy of Designing Data Intensive Applications the single most important book to read for system design interview prep!

Leaderless Replication

So far we have discussed leader-based replication but there also exists another strategy without any nodes in the system acting as leaders. Some of the databases that use leaderless replication include:

  • Amazon Dynamo DB
  • Riak
  • Voldemort
  • Cassandra

The idea powering leaderless replication is to always write or read from a majority (more than half) of the number of nodes in the system. This practice ensures that when a client reads a value from a node, at least one of the nodes in the system has the latest value and vice versa for writes, i.e. the latest value is written to at least one of the nodes in the system.

Note that when making read/write requests either the client can send requests to the desired number of replicas or a coordinator node can be responsible for forwarding client requests to the replicas.

Example

Let’s consider a concrete example of a system with three replicas labelled as A, B and C as shown below:

(image coming soon)

Furthermore, let’s assume a client wants to update a key key/value pair (x, 5) that is already stored in the system. In the above set-up the client will write to or read from a quorum of nodes. The quorum is half of the nodes in the system which is 2. In general the quorum is equal to n/2 rounded up, when the number of nodes in the system is odd and 1+(n/2) when the number of nodes is even.

Writes and Reads

Say the client sends the write request to a single node in the system. The node receiving the request successfully records the write and sends an acknowledgement to the client. Immediately thereafter, the node recording the write experiences a hardware failure and goes offline without getting the chance to replicate the write to other nodes in the system. Now even though the client has received the write confirmation, the system will serve the stale value stored on the remaining two nodes in the system.

(image coming soon)

Now consider the case when the client is sent a write acknowledgement only when a write has been recorded by at least two nodes, say node A and node C. When a read request is received at the system, the following outcomes are possible:

  1. All nodes are down: In this case the system declines the read request since at least two nodes must respond to the read request.
  2. Nodes A & C are up: In this case both nodes return the latest value for the key.
  3. Nodes A & B are up: Node A responds with the latest value while node B responds with the stale value. The client receives both the values and picks the latest value out of the two. We’ll shortly discuss how the client determines which is the latest value out of the two received. Also, in this case, we assume that the latest value hasn’t yet replicated to node B.
  4. Nodes C & B are up: The client picks the latest value received from node C and ignores the value received from node B.
  5. Single node is up: Read request is declined and the system is unavailable for reading as at least two nodes must respond to a read request.
  6. All nodes are up: In this case, if the latest value hasn’t been replicated to node B yet the client will receive the latest value from nodes A and C and a stale value from node B. The client picks the latest value and ignores the value from node B.

We can see from the above outcomes that as long as we always write to and read from at least two nodes in a three node system, our writes will be durable and reads will return the most up to date values. We’ll see a generalization of this relationship next.