Followers in Replication

The topology of a distributed system can change over time. New follower nodes can join the system, old ones can fail and the leader can experience outages. Let’s see how these changes are handled.

New follower

The data on the system is continually changing as write requests keep flowing from the clients. If a new follower joins the network, it must be brought up to speed with the current state of the data the leader hosts and then be sent the new changes as they arrive at the leader.

The leader can’t pause, write requests, take the snapshot of the data and send it to the new follower. This would defeat the goal of being a highly available system. The procedure is that the new follower receives a consistent snapshot of the data at some point in time from the leader and thereafter all the changes that have taken place from the time the snapshot was taken. The leader maintains a log called the replication log, which associates the snapshot with a location within the log. The follower can then be sent all the changes that have taken place following the snapshot location within the log. The position of the snapshot within the log has various names in different systems, for instance. PostgreSQL calls it the log sequence number, and MySQL calls it the binlog coordinates.

Handling follower failure

A follower can fail due to a number of reasons such as disk failure, network partition, power outage etc or it can even be brought down intentionally for routine maintenance. When the follower comes back online it must catch up to the current data state the leader holds. However, in order to do so, the follower must know the last transaction it successfully completed before going offline. The follower maintains a log of the changes received from the leader on its local disk and can easily lookup the last successful transaction it processed and request the leader for all the transactions that occurred after. In this way the follower catches up to the leader.

Some of the issues that can come up in this scheme are:

On a very busy system if the follower comes back online after a significantly long time such that it is perpetually behind the leader and can never catch up due to sheer volume of changes happening in the system.
The leader may roll-over its replication log after a few days and not have the changes requested by the follower. For instance, if the leader keeps the replication log history for the last 15 days and the follower comes back online after twenty days then the follower may not be able to get changes for the earliest five days.

Handling leader failure

The leader is also prone to failures and can go down for a number of reasons. A leader failure is more complex than a follower failure and has a lot of nuances. When the leader fails, the following actions must take place:

1. Detecting that the leader has failed. If the failover isn’t manual then usually a timeout value is used to determine if the leader is down or not. If the leader doesn’t send out or respond to a heartbeat message it is assumed to be dead.
Promoting one of the followers as the new leader. The follower with the most up to date changes from the old leader is usually the preferred choice for the new leader as it results in minimal loss of data. Leaders can be elected in a variety of ways and getting all the followers to agree on a leader is a consensus problem. The new leader can be chosen by an *election* process where a majority decides on the new leader or the new leader can be appointed by a previously elected controller node.
Configuring clients to direct reads to the new leader.
Configuring followers to receive changes from the new leader.
Ensuring that in case the old leader comes back up, it assumes the role of a follower and doesn’t consider itself the leader anymore.

All the above steps are collectively referred to as the leader *failover*. There are a number of issues that can happen during a failover and some of them include the following:

Data loss

The newly elected leader may not have received all the changes from the old leader before the latter went offline. This is a possibility when replication takes place asynchronously and can cause data loss, which may violate clients’ durability expectations. If external systems depend on the data and certain writes are missing, it can cause failure for these systems. Another twist is the scenario in which the old leader rejoins the cluster. The old leader has writes which weren’t propagated to the new leader and are now suddenly available. Usually, in such a scenario these unreplicated writes on the old leader are simply discarded.

Split brain

The split brain scenario occurs when two nodes believe, each one of them is a leader and advertise as such to the clients. This can cause both nodes to accept writes from the clients resulting in conflicts, leading to loss or corruption of data. Hadoop’s distributed file system also known as HDFS goes to great lengths to avoid having *NameNodes* entering the split brain scenario during a failover.

Timeout period

Determining the suitable timeout value before declaring a leader dead is crucial. If the timeout is too long it can potentially make the system unavailable for that period and also prolong the recovery. Similarly, if the timeout is too short, it’ll result in unnecessary failovers. For instance if there are spikes in write requests or the network slows down for some reason, the leader may respond to health messages with delay triggering an unneeded failover.

There is no silver bullet for any of the problems/issues we discussed above, rather there are tradeoffs to be made when tuning for the replica consistency, durability, availability, and latency characteristics of a distributed system.