The topology of a distributed system can change over time. New follower nodes can join the system, old ones can fail and the leader can experience outages. Let’s see how these changes are handled.
The data on the system is continually changing as write requests keep flowing from the clients. If a new follower joins the network, it must be brought up to speed with the current state of the data the leader hosts and then be sent the new changes as they arrive at the leader.
The leader can’t pause, write requests, take the snapshot of the data and send it to the new follower. This would defeat the goal of being a highly available system. The procedure is that the new follower receives a consistent snapshot of the data at some point in time from the leader and thereafter all the changes that have taken place from the time the snapshot was taken. The leader maintains a log called the replication log, which associates the snapshot with a location within the log. The follower can then be sent all the changes that have taken place following the snapshot location within the log. The position of the snapshot within the log has various names in different systems, for instance. PostgreSQL calls it the log sequence number, and MySQL calls it the binlog coordinates.
A follower can fail due to a number of reasons such as disk failure, network partition, power outage etc or it can even be brought down intentionally for routine maintenance. When the follower comes back online it must catch up to the current data state the leader holds. However, in order to do so, the follower must know the last transaction it successfully completed before going offline. The follower maintains a log of the changes received from the leader on its local disk and can easily lookup the last successful transaction it processed and request the leader for all the transactions that occurred after. In this way the follower catches up to the leader.
Some of the issues that can come up in this scheme are:
The leader is also prone to failures and can go down for a number of reasons. A leader failure is more complex than a follower failure and has a lot of nuances. When the leader fails, the following actions must take place:
All the above steps are collectively referred to as the leader *failover*. There are a number of issues that can happen during a failover and some of them include the following:
The newly elected leader may not have received all the changes from the old leader before the latter went offline. This is a possibility when replication takes place asynchronously and can cause data loss, which may violate clients’ durability expectations. If external systems depend on the data and certain writes are missing, it can cause failure for these systems. Another twist is the scenario in which the old leader rejoins the cluster. The old leader has writes which weren’t propagated to the new leader and are now suddenly available. Usually, in such a scenario these unreplicated writes on the old leader are simply discarded.
The split brain scenario occurs when two nodes believe, each one of them is a leader and advertise as such to the clients. This can cause both nodes to accept writes from the clients resulting in conflicts, leading to loss or corruption of data. Hadoop’s distributed file system also known as HDFS goes to great lengths to avoid having *NameNodes* entering the split brain scenario during a failover.
Determining the suitable timeout value before declaring a leader dead is crucial. If the timeout is too long it can potentially make the system unavailable for that period and also prolong the recovery. Similarly, if the timeout is too short, it’ll result in unnecessary failovers. For instance if there are spikes in write requests or the network slows down for some reason, the leader may respond to health messages with delay triggering an unneeded failover.
There is no silver bullet for any of the problems/issues we discussed above, rather there are tradeoffs to be made when tuning for the replica consistency, durability, availability, and latency characteristics of a distributed system.