System Design Notes
Don’t forget to get your copy of Designing Data Intensive Applications the single most important book to read for system design interview prep!

Multi leader Replication

In contrast to single-leader replication, allowing multiple nodes to accept writes is known as multileader replication or active/active or master-master replication. Each node processing a write must forward the change to all other nodes, thus a leader in this scenario is also simultaneously a follower.

One of the use cases of multileader replication is when a distributed system has data spread across multiple datacenters which are geographically distant from each other. Such an architecture is usually employed to move data closer to the users or be able to withstand failure of an entire datacenter. Additionally, with single leader-based replication, the leader resides in one of the data centers and all writes must be routed through that datacenter to reach that single leader. In a multi-leader replication architecture, each datacenter has its own leader and replication within a datacenter follows the single-leader replication mechanism. However, among the datacenters, leaders from each datacenter replicate their changes to all the other leaders from distinct datacenters.

You can also think of systems such as Google Calendar, Git, Perforce, etc., that involve users working with a copy of the data offline as examples of multi-leader replication. For instance, you can be disconnected from the internet on your desktop and make changes to your calendar while being offline. The changes you made while using your desktop should reflect correctly on your mobile device, once your desktop comes back online. This is an example of a multi-leader replication at a small scale, since writes (changes) can be accepted in both the devices and are replicated to the other device. Each device acts as a leader and has its own database where changes are recorded. A device is synonymous with a datacenter and has a highly unreliable network to other devices. The replication takes place asynchronously among the device leaders and the replication lag can range wildly from a few milliseconds to days depending when a device gets back on the internet.

Another example where multi-leader replication takes place is online collaborative editing software, such as Google docs, which allows multiple users to work on a document at the same time. A user making changes on their copy of the document has the changes committed locally and then asynchronously propagated to the server and from there to the rest of users. There are several commonalities between database replication and collaborative editing software. Naturally, conflicts are expected when multiple users edit documents and there are conflict resolution strategies that we’ll delve into later.

Multi-Leader vs Single Leader Replication

  • Performance: In case of single leader replication, all writes have to be routed to one datacenter. Imagine if the datacenter is based in the USA and a user from Asia makes a write, the request has to traverse all the way to the USA. This increases the latency for the write request and the user’s experience suffers. On the contrary, if the user’s write request is processed at a datacenter in Asia and then asynchronously replicated to datacenters across the globe, the user’s experience will be far more pleasant and the perceived performance much better. Note, we say perceived performance because for the write to be durably committed, it must be replicated in all the data centers but the user doesn’t have to wait for that to happen. The user is sent a successful write message when the user’s request is locally processed and replicated. In other words, the inter-datacenter network delay is hidden from the end-user. There is, however, the issue of data loss when an entire datacenter experiences failure and before it has had a chance to replicate its changes to other datacenters.
  • Datacenter Failure: In a multi-leader replication scheme, datacenters run independently of other datacenters and aren’t affected by the failures of each other. The failed datacenter can always catch up with replication when it comes back online. However, in a single-leader based replication, failure of a single datacenter can mean that a follower in another datacenter is promoted as the leader.
  • Network Problems: Inter-datacenter traffic usually flows over public internet infrastructure and is less reliable and prone to failures. In case of a single leader replication, write requests forwarded from peer datacenters can be delayed because of networking problems and since the write is synchronously committed, the end user can experience significant latency. Network issues between datacenters are much better handled in case of multi-leader replication, since the writes from peer datacenters are asynchronously replicated. A network partition between two datacenters doesn’t stop write requests from being processed at either of the datacenters, which is not true in case of single leader replication. If the datacenter not hosting the leader has its network link with the leader-hosting datacenter temporarily severed, a write request at the former datacenter will simply have to wait.