System Design Notes
Don’t forget to get your copy of Designing Data Intensive Applications the single most important book to read for system design interview prep!

Log Replication

In case of master-slave replication the master is responsible for shipping changes received via write requests from clients to its followers. These changes are initially recorded on the master’s end in a log file. We’ll examine the various ways log files are implemented.

Statement based replication

Relational databases such as MySQL or Oracle have associated database query languages such as SQL. These languages consist of statements that read, write or modify records within the database. One of the way to maintain a replication log on the master node is to simply record every statement e.g. INSERT, UPDATE or DELETE that is received from clients. Note, we don’t need to record statements that perform reads since they don’t mutate the database. The logged SQL statements can then be sent to the followers who execute these statements on the replica of data they hold to get in sync with the data on the master.

Problems

This replication may sound simple and effective but it comes with its own set of problems. Some of these are:

  • The SQL statements can consist of non-deterministic functions such as NOW() which returns the current time or RAND() which returns a random number. These functions are likely to evaluate to different values on different nodes. This issue can be overcome by having the master replace the call to the non-deterministic function with a fixed value and then passing the statement to its followers.
  • If a SQL statement involves an auto incrementing column or depends on data already present in the database then it must be executed in the same order as it was executed on the master. This can restrict execution of multiple transactions at the same time (concurrently) on the system.
  • Database constructs such as triggers, stored procedures, or user-defined functions can have different outcomes when executed on each of the follower nodes. Care must be taken to ensure that statements with side effects are deterministic.

Write Ahead Log (WAL)

The write ahead log is an append-only file that records all the changes that take place on the master node. In the context of a database it is a sequence of bytes. Depending on the storage engine in a database, this log can be:

  • The log that stores data for SSTables and LSM-Trees.
  • A log to track every change when the storage engine is a B-tree. In this scenario individual disk blocks are overwritten but the change is first recorded in the log so that the index can be restored to a consistent state after the crash.

Outside the context of databases, WAL or a form of it is commonly used. For instance in the Hadoop distributed file system, the metadata about the files is maintained by an entity called the Namenode. The Namenode follows a master-slave model, where the master or active Namenode maintains an up to date metadata about files. As changes to the filesystem occur they are recorded in an edit log. The slave/passive Namenodes observes the changes showing up in the edit log and applies them locally thus keeping its data in-sync with that of the active Namenode. In case, the active Namenode crashes, the passive can take its place.

Another popular distributed system Kafka is based on WAL

Issues

Some of the issues this approach suffers from:

  • The size of the WAL can become too large. In fact in Hadoop the edit log is checkpointed and a snapshot of the metadata is created called the FsImage. The edit log starts afresh onwards from the checkpoint. This allows to rein-in the size of the edit log.
  • In the context of databases, the WAL is strongly coupled with the choice of the storage engine. The WAL contains fine-grained details such as what bytes were changed on which disk blocks, which becomes problematic if the storage format were to be changed. So consider a scenario where the storage format is changed in the next higher version of the database software currently running on the master and the follower nodes. In this situation, it’s not possible to upgrade the system without bringing it down since the followers and the leader all must run the same version of software and use the same storage engine format. Otherwise, we could update the followers to the next version, initiate a failover to elect a new leader and then update the old leader to the new version. This zero downtime upgrade is not possible when using WAL shipping as a replication strategy.

Logical or row-based replication

This method of replication is applicable in the context of databases. It allows for decoupling the storage format from the log in contrast to the WAL shipping method of replication. The log in this case is called the logical log to distinguish it from the log created in case of WAL shipping which is a representation of the physical data in the storage engine. Since the logical log is decoupled from the internals of the storage engine, it is easily backwards compatible and allows for the leader and follower to be on different versions of the database software. In fact the two parties can run different storage engines and still be able to work with each other.

The logical log captures the changes happening in the database at the row level, or in other words it is capturing the delta in the state of database caused by the execution of each statement at the master node. For instance:

  • When a new row is inserted in a table, all the column values for that row are logged.
  • When a row is updated, all the changed values for a column are logged along with the information that allows for the row to be identified uniquely.
  • When a row is deleted, the primary key that uniquely identifies the row is logged. However, in case the table does not define a primary key then all the column values are logged.

This technique is also called change data capture. A benefit of using this method of replication is that the log can be exported to external systems which can easily parse logical data. These systems could be data warehouses, indexes or caches.

Trigger-based replication

The previous replication methods were implemented at the database layer, however, there are use case where the application decides whether to replicate the data, replicate a portion of it only or to replicate the data to another system. For such scenarios, triggers and procedures - a commonly found feature in relational databases - can be used. Application specific code can be registered with triggers and the code is executed automatically whenever a change in the database occurs. Generally, the action involving copying the change to a separate table which an external process can read and then apply application specific logic.

Trigger based replication has more overhead compared to other replication mechanisms and may also be limited in its capabilities to only what triggers can do. Nevertheless, it is still an option that may be suitable for certain use cases.