The data access layer of a million-dollar idea starts out as a single-server relational database. You’re hardly worried about scale issues – you have an application to write! However, if your million-dollar idea ends up being worth even $100K, you’ll likely find your database struggling to keep up with the scale. “No problem,” you say, “I’ll just add a bunch of read-only replicas. After all, my workload is read-heavy and distributing all of my reads between the replicas will significantly ease the load on my trusty old database.” Unfortunately, soon after deploying your read replica strategy you realize that you inadvertently stepped into the mucky marshlands of eventual consistency.
The loss of read-after-write consistency when employing asynchronous replication can be illustrated by the following example:
- A user performs a modifying operation on their account via your application. Your data access layer acknowledges the operation and the user is notified of its success.
- The user fetches the state of their account. Because the modifying operation has only been applied to the master database and hasn’t yet made its way to the replicas, the user is confused to find that their reportedly successful modifying operation has had no effect.
- The modifying operation propagates to replicas. Subsequent reads will observe the modification performed by the user.
This can cause serious issues if your application stores sensitive, mission-critical information, especially for platform use-cases where your users are actually other software applications. What now?
Common Approaches for Dealing with Asynchronous Replication
The read-after-write consistency problem has been very well known for a long time, however, no silver bullet exists for handling it. Common approaches to dealing with it boil down to one of two high-level strategies: wait for writes to make their way to replicas before acknowledging them or delay user reads until replicas are caught up to the master. Both of these strategies are explored vigorously in the industry. The former is usually attempted as synchronous replication at the database layer, which, while sounding good at first glance, is massively complex operationally and rarely works well in practice. At Box, we employ a variant of the latter and that’s what I’m going to focus on in this story.
Years back, when we first started scaling our relational data access, we separated our reads into those that required read-after-write consistency and those that didn’t. Indeed, there are use-cases that truly embrace eventual consistency and don’t mind stale values. If we’re rendering the name of the enterprise a user is part of, it’s really not that big a deal if in the incredibly rare occasion that an admin changes it, it takes a minute to have the change propagate to the enterprise’s users. Such reads are no-brainers in terms of migrating to use read-only replicas.
However, lots of our reads were not tolerant of replication lag. We further divided them into use-cases that were latency sensitive and those that were not. For example, if an asynchronous processing job takes an additional second to complete, no harm is done, while delaying information required to render a page a user requested significantly degrades their experience. For use-cases tolerant of higher latencies, we ended up starting with a common solution. Upon receiving a replication-lag-sensitive, but latency-tolerant read, we followed this algorithm:
- Get the replication position from the master. Relational databases generally offer ways to interrogate replication. In our case, we used the MySQL binlog position.
- Poll positions of replica databases until one passes the position we got from the master or a timeout is reached.
- If a replica caught up with the master is found, query the replica, otherwise fail the read request.
This approach remedies the read-after-write consistency issues, but it does so at the cost of higher latency. After all, the read is performed against a replica that observes all writes performed before the client requested the read. In addition to higher latency, however, this approach creates a more subtle, but nevertheless a critical problem.
Let’s consider how the system reacts to replication lag. The algorithm described above will fail requests if replication lag prevents a replica from observing the desired replication position within a reasonable timeframe. It’s important to note that failing reads over to the master database in case of replication lag may be tempting, but is very dangerous. Replication generally lags due to heavy writes and thus coincides with heavy load on the master. Reacting to replication lag by directing replica-bound reads to the master creates a thundering herd problem on the already overloaded master and is a recipe for a serious site outage.
With that in mind, simply retrying the operation is the only reasonable course of action available to application code. However, subsequent retries are not any more likely to succeed during persistent replication lag. The diagram below illustrates this phenomenon. In the diagram, each arrow represents a timeline of a database server applying writes and each bubble represents an individual write applied.
To summarize, the common approach to dealing with read-after-write inconsistencies is restricted by the following factors:
- High latency in times of reasonable replication lag.
- Persistent failures in times of severe replication lag.
An Alternate Approach
In order to address the downsides of serving replication lag sensitive reads described above, at Box we came up with an alternate design pattern. The pattern is based on the insight that in a multitenant environment, observing all writes is usually not necessary. In other words, if all writes relevant to a user’s read have been propagated to a replica, there is no sense in delaying the read even if database replication is lagged. Indeed, in the above diagram, suppose only the teal bubbles (a and s) are relevant to the user performing reads. Although the data access layer was not aware of this, even the original read could have safely used the replica, as it observed a at the time the user issued their read.
Our modified approach allows application code to identify the specific writes that their reads need to observe. We accomplish this by empowering application code to follow significant writes with retrieval of the master’s replication position. The retrieved replication position can then be passed in to the data access layer together with subsequent reads that need to observe it.
This approach yields some important advantages:
- In times of reasonable replication lag no additional latency is incurred. In practice, by the time a read is issued, usually the significant writes have already made their way to at least one of the replicas.
- In times of severe replication lag, each subsequent retry is more likely to succeed.
Box’s relational data access service provides primitives for easy implementation of this design pattern. It is used extensively at Box to send as much as 75% of our read traffic to read-only replicas without sacrificing either latency or read-after-write consistency. We haven’t yet needed to take this approach to its ultimate extent, as our attention was pulled towards other bottlenecks related to write traffic and storage. That said, there is no reason why this strategy couldn’t be used to send all user read traffic to read-only replicas. Indeed, if it’s possible to identify all writes relevant to a user, it’s possible to identify a precise position in your replication stream that must be observed by a replica in order for you to safely send any of the user’s reads to it.
Now that you know how to free your app developers from worrying about replication lag, you just need to convince them it’s important to shed load off the master in the first place :). If they’re anything like ours though, it should be an easy conversation.