2024/11/13

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part IV

Author:: resotto

, 2024/11/13

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part IV

In the previous part, we covered the mappings of the endpoints and the schema with client endpoint switches. In this part, we’ll discuss how to execute dual-write reliably. I hope this post provides valuable insights about how to design methods of online migration.

Part I: Background of the migration and current state of the balance service
Part II: Challenges of the migration and my approach to address them
Part III: Mappings of the endpoints and the schema, client endpoint switches
Part IV: How to execute dual-write reliably (this article)
Part V: Architecture transitions, rollback plans, and the overall migration steps

Dual-Write

Requirements

For online data migration, the functional requirement for dual-write is to support both reading and writing to v1 and v2 data. Specifically, a dual-write component will select both source and target data; if the target data does not exist, it will write the data to the target database. If it does exist, it will update the record.

The main non-functional requirement for dual-write is to minimize performance degradation, which is a challenge we need to tackle since some drop in performance is unavoidable when executing dual-write.

Dual-Write Component

Before we delve into the component responsible for executing dual-write, some readers may have questions about how we plan to implement it. This aspect will be detailed in the next Dual-Write Logic section, so for now, please assume that we can achieve dual-write through any suitable method.

Which component will execute the dual-write functionality? We have the following three options:

v1 balance service
v2 balance service
A new service

What if we consider using the v1 balance service as the component responsible for dual-write? It would work as follows:

Fig. 12: v1 balance service executes dual-write

At first glance, this approach seems reasonable. However, it actually introduces two types of race conditions as follows.

Fig. 13: Race Condition A – v1 clients switching their endpoints during dual-write

Fig. 14: Race Condition B – v1 clients switching their endpoints after completing dual-write

Race Condition A refers to a scenario where a CreateExchange request is processed on the v2 balance service before a CreateUserBalanceConsumption request is executed on the v1 balance service, both targeting the same balance account with an amount of 1000.

It’s important to note that CreateUserBalanceConsumption is a v1 API, in contrast to the CreateUserBalanceAddition logic discussed earlier, as this API deducts values from the credit side. Additionally, while the v2 CreateExchange API operates with double-entry bookkeeping, we will concentrate on the credit side for this explanation.

In this race condition, because the dual-write occurs from the v1 balance service to the v2 balance service (but not the other way around), any changes made on the v2 side won’t be reflected in the v1 data. As a result, the v1 balance service will detect a discrepancy between its data (Amount = 1000) and the v2 data (Amount = 0), ultimately leading to an inconsistent data error being returned to the client.

Race Condition B presents a variation of Race Condition A, where there is no dual-write involved. Even though the dual-write isn’t happening here, a similar situation can still arise. In this case, the consequences could be more severe than in Race Condition A, as the v1 balance service (which is supposed to handle the dual-write) would be unable to identify the differences between its data (Amount = 1000) and the v2 data (Amount = 0). This could allow the v1 CreateUserBalanceConsumption request to succeed, leading to further inconsistencies.

Could these race conditions occur in our environment? Yes, they can happen due to our canary deployment strategy, which allows us to test new images by deploying them as a single Kubernetes pod for a limited time. During this testing phase, some requests may be routed to the canary pod, while most requests will continue to be directed to the pods with the latest stable image.

What about the third option: using a new service? If we implement a new service that handles the dual-write instead of relying on the v1 and v2 services separately, the architecture would look like this:

Fig. 15: New service executes dual-write

With this option, client services would need to change their endpoints twice: first from v1 to the intermediate state (for the new service) and then again to v2. As mentioned earlier, we have two write clients and over 20 read clients, meaning the time required for all clients to make these endpoint changes would be considerable. Switching twice would take even longer due to high-priority tasks that may suddenly occupy the attention of those client service teams.

Considering all the options we’ve discussed, I believe the v2 balance service is the best fit for the dual-write component. However, we need to address one more important point regarding the timing of when v1 write clients should switch their endpoints to v2. Let’s explore this in more detail.

Race Condition C describes a situation similar to Race Condition A, with the primary difference being the direction of the dual-write (from the v2 balance service to the v1 balance service in Race Condition C). This means that similar issues could occur regardless of the choices made concerning the dual-write component.

FIg. 16: Race Condition C – v1 clients switching their endpoints when v2 balance service executes dual-write

As a result, v1 clients will need to switch their endpoints before executing the dual-write. This leads to a pre-transition period during which the v2 balance service internally calls the v1 endpoints for original v1 requests without executing any of the v2 logic. For more details, please refer to the upcoming Process Overview section.

Dual-Write Logic

In the previous section, I concluded that the v2 balance service is the most suitable choice for executing dual-write. In this section, I will discuss reliable methods for implementing dual-write, considering the following three options:

Google Cloud Datastream with Dataflow (CDC)
Single database transaction
Transactional outbox + worker

First, let’s examine the Google Cloud Datastream with Dataflow (CDC) approach. Google Cloud provides change data capture (CDC) through Datastream and data processing capabilities via Dataflow. Below are some important notes about Datastream, quoted from its documentation:

Question: How does Datastream handle uncommitted transactions in the database log files?

Answer: When database log files contain uncommitted transactions, if any transactions are rolled back, then the database reflects this in the log files as "reverse" data manipulation language (DML) operations. For example, a rolled-back INSERT operation will have a corresponding DELETE operation. Datastream reads these operations from the log files.

Question: Does Datastream guarantee ordering?

Answer: Although Datastream doesn’t guarantee ordering, it provides additional metadata for each event. This metadata can be used to ensure eventual consistency in the destination. Depending on the source, rate and frequency of changes, and other parameters, eventual consistency can generally be achieved within a 1-hour window.

https://cloud.google.com/datastream/docs/faq

Based on the above FAQ, Datastream supports only eventual consistency rather than strong consistency. Consequently, I concluded that it is not suitable for executing dual-write.

Next, let’s discuss the approach of utilizing a single database transaction for dual-write. By performing all database operations within a single database transaction, we can prevent any inconsistencies between the v1 schema and the v2 schema.

Fig. 17: Dual-write solution with single database transaction

Let’s revisit the non-functional requirements. Before we considered the single database transaction solution, our primary goal was to minimize API performance degradation. With the introduction of the Cloud Spanner database, we’ve identified an additional requirement, which can be summarized as follows:

Minimal API performance degradation
Compliance with the mutation count limit in Cloud Spanner

Regarding API latency, it’s clear that the v2 API latencies are likely to be worse than the current ones due to the extra database operations needed for the v1 schema. However, we’re uncertain about the degree of this degradation during the design phase. We’ll assess the performance metrics before moving forward with this approach.

The mutation count limit in Cloud Spanner refers to whether a single database transaction exceeds its allowed number of mutations, which is a specific term for the number of changes made within one transaction, with the limit set by Google Cloud. In other words, the more data we manipulate in one transaction, the more mutations we create, which can lead us to exceed the limit. If we surpass this limit, the transaction cannot be committed. We’ll address this topic in more detail in the dedicated Spanner Mutation Count Estimation section in Part V.

Finally, let’s consider the transactional outbox + worker approach. For a detailed explanation of the transactional outbox pattern, please refer to the documentation Pattern: Transactional outbox in microservice.io. In our case, its primary purpose is not to publish messages atomically, but to allow for atomic updates across different schemas.

In this approach, the v2 balance service reads the master data from the v1 schema and inserts a record as an asynchronous request into that schema. A newly introduced dual-write worker then retrieves this record and attempts to update the master data within the v1 schema. For this discussion, we will focus solely on the scenario after the v1 balance clients have successfully switched their endpoints, as concluded in the previous section.

Fig. 18: Dual-write solution with transactional outbox + worker

If we encounter the issues mentioned above, such as API performance degradation and/or exceeding the mutation count limit, it may be worthwhile to consider the transactional outbox + worker approach. This would allow us to reduce the number of database operations, helping to mitigate those issues. However, an important trade-off with this approach is that we must accept the possibility of inconsistent data between v1 and v2 as long as there are unprocessed asynchronous request records in the v1 schema.

Consequently, I would like to propose the single database transaction approach as a dual-write solution. The subsequent sections are written with this single database transaction solution in mind.

Process Overview

In this section, I will explain how the balance client handles requests and responses, as well as how the v2 balance service executes its logic and database operations in conjunction with the single database transaction dual-write solution.

To summarize, the following outlines the process. Important changes are indicated with underlined text.

Current state
- Proto interface
  - Request: v1
  - Response: v1
- Database
  - v1 balance service reads/writes only v1 data

This phase is consistent with the current state described in the Current State section in Part I. One important point to note is that the request proxy logic for the v2 balance service is developed in advance.

Fig. 19: State after introducing request proxy logic in v2 balance service

State while migrating v1 endpoints to v2
- Proto interface
  - Request: v2
  - Response: v2
- Database
  - v2 balance service reads/writes only v1 data for v1 requests

This phase describes the scenario where v1 balance clients switch their endpoints to v2 to call the v2 balance service APIs. With the request proxy logic implemented in the previous phase, v2 balance clients continue to manage their data in the v1 schema through the v2 balance service. At this stage, the request proxy logic invokes the v1 balance service logic to delegate the original processing and does not yet manipulate data in v2.

Starting from this phase, the client endpoint switch including any necessary mappings with wrapper APIs and the v1/v2 endpoint mappings will be applied, as the v2 balance service needs to accept v1 balance requests using v2 proto request interfaces while the v1 balance client must receive v1 balance responses through v2 proto response interfaces.

As previously mentioned, this phase is necessary to transition v1 balance endpoints to v2 without any significant impact, facilitating an easy rollback if needed. Even if some balance clients revert their endpoint switch, their data will have been managed solely by the v1 balance service logic, thereby avoiding any data consistency issues.

Fig. 20: State when v1 clients switch their endpoints from v1 to v2

State in dual-write
- Proto interface
  - Request: v2
  - Response: v2
- Database
- v2 balance service reads/writes both v1 and v2 data for v1 requests

This phase marks the beginning of dual-write functionality by the v2 balance service. After releasing the dual-write logic in the v2 balance service, it will start duplicating data from the v1 schema to the v2 schema based on the established v1/v2 schema mappings.

The v2 balance service attempts to fetch data from the v1 schema, and if the corresponding v1 data does not exist in the v2 schema, it will insert it there. If the data does already exist, the v2 balance service will read and update it.

Fig. 21: State after dual-write starts

Final state (after dual-write)
- Proto interface
  - Request: v2
  - Response: v2
- Database
  - v2 server reads/writes only v2 data for all requests

In this final phase, the v2 balance service completely transitions away from the dual-write logic and processes requests just as it did prior to this series of steps. At this stage, both v1 and v2 requests are managed seamlessly and without distinction.

Fig. 22: State after dual-write ends

Data Backfill

Data backfill refers to the migration of data from the source database to the destination database. In this context, it specifically involves the transfer of data from the v1 schema to the v2 schema.

Let’s consider the scenario without data backfill. For instance, if some users have used our payment functionalities prior to the implementation of dual-write and do not take any action during the dual-write phase, they may encounter a NotFound error when they later attempt to make a payment. This occurs because dual-write has not replicated the users’ data to the v2 schema, resulting in no corresponding data being available in v2 at that time. Therefore, executing data backfill is essential for a successful system migration.

An important requirement for data backfill is to address existing inconsistent data. This presents a valuable opportunity to identify critical inconsistencies that we may not have previously detected. We must enforce this requirement, as we assume that the invariance verification batch, which I will explain in the next section, will run for both the v1 and v2 schemas before initiating dual-write. However, it is possible that we might inadvertently migrate inconsistent data to the destination database, and I would consider this option in the future if necessary. Moreover, since the v2 balance service will continue referencing v1 data, we would need to address increased database load and latencies that may occur during the data backfill period.

The total number of records to be migrated to the v2 schema could be up to hundreds of billions, raising the question of how to reduce the volume of data backfill. Fortunately, dual-write can significantly reduce the need for data backfill, as it replicates v1 data to the v2 schema in real-time. We can benefit the most by performing the data backfill after running dual-write for a while, because by that point, we hope that most active user data will have already been migrated to the v2 schema.

We should execute data backfill during the dual-write phase rather than at other times for the following reasons:

If we execute data backfill before dual-write, some migrated data could become outdated when we start dual-write, as the migrated data would not be updated thereafter
If we execute data backfill after dual-write, the source data would likely be outdated since it would not be updated after finishing dual-write

We will execute data backfill using a dedicated batch application. Given that both the v1 and v2 schema reside in the same database, the batch application will perform the following operations for each identical pair of resources within a single database transaction:

Select the v1 resource
Select the v2 resource
If the v2 resource exists, do nothing (as it has already been replicated via dual-write)
Otherwise, insert the identical data into the v2 schema

Note: In the following figure, both the v1 and v2 schemas actually reside within the same database; however, they are depicted as separate databases for the sake of clarity and ease of understanding.

Fig. 23: Data backfill

When considering which data to backfill, it is easier to identify the data that will not be backfilled. While I will elaborate on this in the Development Tasks section in Part V, some data will remain untransferred since the v1 balance service will continue to manage it even after dual-write is complete. Conversely, we will definitely backfill the v1 master data, also known as v1 resource data.

For non-master data, such as logs and snapshots of specific resources, the decision to backfill depends on whether the v2 balance service logic references this data. If there are no corresponding records in v2, the v2 logic may not function properly.

More specifically, if no requests are made during dual-write for certain data (meaning it isn’t migrated to the v2 schema via dual-write), the v2 balance service may successfully locate the master data migrated from v1 through backfill, but it may not find the dependent data, such as logs and snapshots. If any v2 logic relies on this non-master data, the v2 balance service could return a data loss error or an inconsistent data error due to the absence of those records.

I plan to revisit this point in the future to clarify the exact targets for backfill.

We will consider continuing dual-write for a longer period than initially planned as a fallback option for the future. This is based on the premise that all source data will eventually be migrated to the destination database, as long as we theoretically maintain the execution of dual-write.

Another option is to forgo both dual-write and data backfill, allowing the v1 data to remain in the v1 schema. It’s important to note that this differs from continuing dual-write; both options would not involve data backfill, but the distinction lies in whether we persist in executing dual-write. Specifically, this option indicates that the v2 balance service does not replicate v1 data to v2, but instead manages the v1 data directly.

I’ve considered this approach because it has the advantage of eliminating the need for data migration. If we opt for this path, the situation would be as follows:

v1 balance clients switch their endpoints to v2
The v2 balance service manages v1 data for v1 requests via the v1 balance service, while handling v2 requests for v2 data

In this scenario, we would have migrated only the client endpoints from v1 to v2, while each service logic would continue to operate in its original location. This means that the v1 balance service and the v2 balance service would operate independently rather than interchangeably. As a result, the v1 balance service logic and data would still reside in v1, which means we would not leverage the migration. Additionally, we might still need to address any issues with the v1 logic if they arise, ultimately not reducing total costs.

Fig. 24: No data backfill, and v2 balance service handles requests respectively

Data Inconsistency Check

Using a single database transaction helps us minimize the risk of inconsistent data that could be introduced by dual-write operations. However, in the event that inconsistencies do occur, it is essential that we detect and resolve them as quickly as possible. To achieve this, we will develop a batch application that verifies the consistency of the data using Cloud Spanner’s ReadOnlyTransaction, which does not lock any rows or tables. I won’t go into the specifics of each consistency check here.

When verifying the consistency of bulk data, one important aspect is ensuring that the data is consistent at a specific point in time. I initially considered using BigQuery, which replicates data from our production databases. However, I realized that we cannot completely avoid inconsistent data because each table is replicated on its own schedules.

There are three types of inconsistent data:

Inconsistencies within the v1 schema
Inconsistencies within the v2 schema
Inconsistencies between the v1 and v2 schema

The first two types are relatively straightforward; for instance, the Amount value in the Accounts table should match the corresponding value in the latest AccountSnapshots table at the same point in time. The third type, on the other hand, is more complex.

It’s important to note that we will be matching the primary keys of v1 resources with those of v2 resources. Fortunately, since both v1 and v2 data reside in the same Spanner database, we can take advantage of this setup by selecting and comparing both resource types in a single query. While the schemas differ, there are certain consistencies between them that we will verify through the batch application.

Furthermore, we will ensure that the results of each read and write operation for both the v1 and v2 databases are identical during the dual-write process. Although this approach is more ad-hoc, it is essential for facilitating immediate verification without having to wait for the next execution of the data inconsistency check batch process.

In this part, we covered how we are going to execute dual-write reliably. In the final Part V, we’ll discuss architecture transitions, rollback plans, and the overall migration steps.

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part IV

Dual-Write

Requirements

Dual-Write Component

Dual-Write Logic

Process Overview

Data Backfill

Data Inconsistency Check

Related article

Good tools are rare. We should make more!

Mercari’s Seamless Item Feed Integration: Bridging the Gap Between Systems

From Good to Great: Evolving Your Role as a Quality Consultant