System Design Notes
Don’t forget to get your copy of Designing Data Intensive Applications the single most important book to read for system design interview prep!

Single leader replication issues

Leader-based replication is an ideal choice for read-scaling scenarios where the read requests processed by a distributed system are far more than the number of write requests. This is often true of internet applications. The number of followers can be increased as the read load on the system increases.

However, leader-based replication has its limitations.

  • All the writes process through the leader which becomes a bottleneck as the number of writes increases.
  • If the leader is down for any reason e.g. network outage, writes to the system are halted.
  • The leader-based replication architecture can scale for read requests when asynchronous replication is used. As the number of followers in the system increases so does the probability of at least one of them failing. Thus synchronous replication can potentially make the system unavailable for writes if any one of the follower experiences failure or a network outage hits the system.
  • Another limitation surfaces when an application tries to read from an asynchronous follower that has fallen behind. Running the same query on the leader versus a follower which is behind will yield different results since the state of the follower hasn’t caught up with that on the master. The state on the follower isn’t consistent with that on the master. The delay between a write happening on the master and the write being reflected on the follower is known as the replication lag. The follower does catch-up eventually with the leader and this phenomenon is called eventual consistency. However, it doesn’t specify when the follower catches up? Is it a few seconds or minutes? The term eventual consistency is deliberately vague on how long the follower may take to catch up to the leader. In practice it could be a fraction of a second.

Read-after-write consistency

Consider the scenario of a Twitter user making a new tweet. In a leader-based replication scenario, the tweet goes to the leader to be committed to the underlying data store. If the user refreshes his timeline, it is possible that her read request may go to a follower that is behind the leader and may not return the latest tweet of the user. The user may erroneously believe that his tweet has been lost. For such scenarios, we want read-after-write consistency or read your own writes. This is a guarantee that a user is always able to immediately read the writes she has submitted or in case of our Twitter user, she sees her tweet on her timeline immediately after submitting it and forcing a browser refresh. Note that the read-after-write consistency doesn’t imply that a different user is able to see the writes made by the first user, since the reads for the other user may be routed to a read replica which is behind the leader. However, eventually all users see the writes made by each other because of eventual consistency.

We can tweak the leader-based replication in some ways to achieve read-after-write consistency. These are:

  • Direct all read queries for entities that can be modified by a user to the leader and others to followers. A user can only modify her Twitter account, so any changes to her account such as adding or deleting tweets, profile changes, etc are always read from the leader. Since the writes were routed through the leader, the reads from the leader are always fresh and latest, thus ensuring read-after-write consistency.
  • Routing reads to the leader quickly falters when a user can potentially edit much of the dataset being replicated because the more and more reads will be directed to the leader and overwhelm it in the process. One mitigation is to note the time of the last write made by a user and for a certain time, say 30 seconds, after the last write, direct all reads to the leader. Thereafter the reads can be directed to the followers. Similarly, the replication lag on the followers can be measured and reads can be prevented from any follower that is 30 seconds behind the leader. Obviously, in this scheme the assumption is that after 30 seconds or whatever time durations you chose, all the followers have caught up with the leader.
  • Another option is for the user to note the timestamp of its last write and then query for reads only those replicas that have data at least up until the noted timestamp. Using the actual system clock for timestamp poses clock synchronization challenges and a monotonic increasing sequence number as a logical timestamp can be used instead.

Cross-device read-after-write consistency

You can tweet using your desktop and the refresh your Twitter timeline on your phone. A correctly designed system should show you your tweet on your mobile device even though it was sent from your desktop. Read-after-write consistency across devices adds complexity as now the schemes to remember the timestamp of last updates may not work because the mobile device can’t possibly when a write was made from the desktop browser. This information now has to be stored on the server side so that all devices in use by a user can know when the user last made an update.

Geographically distributed replicas

Modern architectures involve replicating data across geographies to move data closer to users and to also mitigate the loss of data from continental-wide calamities. If the replicas live in geographically distributed datacenters and data needs to be read from the leader then it must be ensured that read requests from all of the user’s devices are able to connect to the datacenter that hosts the leader.

Monotonic Reads

Suppose you follow the president of the United States (POTUS) on Twitter. POTUS posts two tweets in quick succession, the first one at timestamp T1 and the second one at T2. The writes go through the leader and then are in the process of getting replicated on followers F1 and F2. Luckily F1 almost immediately catches up to the leader and records both the tweets. However, F1 experiences a network partition for the time being and only replicates the first tweet.

Now, if you log into your Twitter account and navigate to POTUS Twitter timeline, your request may be routed to follower F1 which is in-sync with the leader and returns both tweets from POTUS. Say, now you refresh the browser and this time your request is routed to follower F2, which still has to copy the second POTUS tweet. Suddenly you see the POTUS Twitter timeline missing data that you had already seen the first time you navigated to that location.

In other words, you are reading older data after you already read newer data in the first query. Note that it would have served us just fine if the first query didn’t return any results as then the second query wouldn’t appear going backwards in time. The monotonic reads guarantee that one never observes stale data after already having observed the latest data. Without monotonic reads, one can observe the state/snapshot of the data at an earlier point in time after seeing state/snapshot of the data at a later time.

One possible solution is to direct all queries from a particular user to a given replica. In our example if all requests from our user account were directed to F2, we’d never see the Twitter timeline go backwards in time. Yes, we might not get all or the latest data but because of eventual consistency at some point in time we should receive all the updates. Monotonic reads is a weaker guarantee than strong consistency but stronger than eventual consistency.

Consistent prefix reads

Imagine a scenario where the North Korean dictator tweets “Successful missile tests!” and in response POTUS tweets “Sanctions imposed on North Korea”. The timeline of tweets will reveal that sanctions were imposed by the US after North Korea test-fired missiles, but if we flip the order of the tweets, users can comprehend that North Korea test-fired missiles after the US imposed sanctions.

Taking this example a step further, we can see there’s a casual relationship between the two tweets which translate into writes on the backend database. It could potentially happen that the tweet from the North Korean dictator is replicated on a set of servers in Asia while the one from POTUS is replicated on a set of servers in North America. Eventually, the tweets get copied over to all the servers but for a short period of time the regional servers may not have cross-continental tweets. In such a scenario, a user retrieving tweets in Brazil may connect to the North American servers and see the tweet from POTUS first and then see the tweet from North Korea, thinking that North Korea test-fired missiles in retaliation to US sanctions. Because of the lag in replication tweets globally the order of the writes is altered.

The consistent prefix reads guarantee says that if a sequence of writes happen in a certain order then the writes are served to the reader in the same order. This is especially important for distributed databases that have different partitions and there’s no global ordering of writes. One mitigation is to direct all the related writes to the same partition. Dependency tracking software is used in situations where writing to the same partition isn’t possible.

Going back to our tweet example, we can have a tweet and its responses, all written to the same partition so that the reader always sees the related tweets in order.