2024/11/13

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part V

Author:: resotto

, 2024/11/13

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part V

In the previous part, we covered how we are going to execute dual-write reliably. In this final part, we’ll discuss architecture transitions, rollback plans, and the overall migration steps. I hope this post provides valuable insights about how we achieve reversible actions at each phase.

Part I: Background of the migration and current state of the balance service
Part II: Challenges of the migration and my approach to address them
Part III: Mappings of the endpoints and the schema, client endpoint switches
Part IV: How to execute dual-write reliably
Part V: Architecture transitions, rollback plans, and the overall migration steps (this article)

Development Tasks

Here, I’d like to discuss the development tasks required to transition to the post-dual-write state. The topics we will cover include:

v1 batch applications, including accounting event processing
Accounting code processing
Historical data processing
Switching database client in bookkeeping service
Rewriting queries for BigQuery

Let’s begin with v1 batch applications. While I have previously covered the endpoint mappings between v1 and v2 APIs, I have not yet explained the mappings of batch applications. Currently, we have three kinds of v1 batch applications:

Batch applications with v1-specific logic, which can be further categorized into:
- Those based on business requirements, like the point expiration batch
- Those that don’t depend on business requirements, like the v1 data inconsistency validation batch
Batch applications without v1-specific logic, which are ad-hoc batch applications created for specific incidents

We won’t need to migrate batch applications that don’t have v1-specific logic. However, for those that do include v1-specific logic—regardless of whether they’re tied to business requirements or not—we need to create equivalent batch applications on the v2 side.

As I mentioned in the Accounting Event Processing section in Part I, we’ll still need to interact with the accounting service for event processing after dual-write is finished. Since the accounting event-related APIs guarantee idempotency, we’ll develop a batch application on v2 that replicates the logic of the existing v1 batches for sending and reconciling accounting events. During the transition, both batches will run in parallel. Once we’re nearing the completion of dual-write, we’ll phase out the v1 batch and ensure that all accounting events are successfully processed by the accounting service through reconciliation using just the v2 batch.

Now, regarding accounting code processing, the v1 balance service will continue to handle these even after dual-write is completed. To ensure backward compatibility, the v2 balance service will need to read from the v1 schema.

When it comes to processing historical data, we’re aware that it has developed without a well-defined ownership structure, and we plan to re-architect this area soon. As we move through this transition, we’ll need to modify how we write historical data during and after the dual-write phase.

In particular, the v1 balance service will be dedicated solely to reading historical data, while the v2 balance service will take over all write operations once the dual-write process is concluded. Now, let’s take a closer look at how the v2 balance service will manage the writing process for historical data.

While the accounting service ensures idempotency for processing accounting events, this guarantee does not apply to historical data managed by the v1 schema. Unfortunately, we can’t read results after a write operation, nor can we insert the same record multiple times within the same database transaction using mutations (for more details, please see the later Spanner Mutation Count Estimation section). As a result, when we finish the dual-write execution, we’ll need to implement the logic for inserting historical data from the v2 balance service into the v1 schema. At other times, the v1 balance service will take care of inserting historical data.

For the bookkeeping service, which currently connects directly to the v1 balance database, we’ll need to update its logic after the data backfill and before we complete the dual-write phase. This change will enable us to switch its single source of truth (SSOT) from the v1 schema to the v2 schema.

As for BigQuery, we’ll need to update all existing queries to focus exclusively on v2 data after the data backfill is complete. Considering that there are over 500 queries to modify, this task will take some time, so we will start it even before beginning the dual-write phase.

The following diagrams illustrate these changes:

Arrow A becomes A’, representing the revised logic for sending accounting events.
Arrow B becomes B’, indicating the updated reconciliation process for accounting events.
Arrow C becomes C’, signifying the bookkeeping service’s transition from the v1 schema to the v2 schema.
Arrow D marks the moment when we stop the dual-write logic.
Arrow E shows that the v2 balance service will start reading accounting codes from the v1 schema while simultaneously inserting historical data into the v1 schema.

Fig. 25: Architecture during dual-write phase

The following figure illustrates the final architecture once the dual-write process is complete:

Fig. 26: Final architecture after completing dual-write phase

Rollback Plans

Let’s describe the transitions of the architecture from figures A to E below, while addressing the availability of rolling back at each stage.

Transition from Phase A to Phase C (Request Proxy Phase)

In this transition, we can roll back without any additional effort since v1 requests will continue to be processed by the v1 balance service, aided by the request proxy implemented on the v2 balance service.

Transition from Phase C to Phase D (Dual-Write Phase)

Rolling back from the dual-write phase to the pre-dual-write phase would require us to remove any migrated data in the v2 schema. After the rollback, this data would no longer receive updates. When we resume the dual-write process, the latest data would need to be selected and replicated from the v1 schema to the v2 schema. In other words, if we don’t remove the outdated data from the v2 schema, subsequent requests could be processed based on this outdated data, potentially leading to errors or, worse, successful processing that results in data inconsistencies.

While it is safe to remove the migrated data from the v2 schema, we should have a mechanism in place to ensure that this data can be removed safely and efficiently.

Transition from Phase D’’ to Phase E (Post Dual-Write Phase)

Once we transition to the post-dual-write phase, rolling back will no longer be an option. Executing a rollback at this stage would require downtime, as the data in the v1 schema will become outdated soon after completing the dual-write.

Therefore, we must allocate time for synchronization to update the outdated v1 data with the latest information from the v2 schema. Only after this synchronization can a rollback be executed, if necessary.

Fig. 27: Initial state while developing the request proxy logic on the v2 balance service (A)

Fig. 28: Write client endpoint switch while initiating the request proxy (B)

Fig. 29: State when proxying requests (C)

Fig. 30: State during dual-write operations (D)

Fig. 31: State during dual-write operations and data backfill (D’)

Fig. 32: State before completing the dual-write (D’’)

Fig. 33: Final state after the dual-write process (E)

Spanner Mutation Count Estimation

When using Cloud Spanner, one key aspect we need to consider is the concept of mutation and its upper limit count.

Let’s visit the definition of mutation:

A mutation represents a sequence of inserts, updates, and deletes that Spanner applies atomically to different rows and tables in a database. You can include operations that apply to different rows, or different tables, in a mutation. After you define one or more mutations that contain one or more writes, you must apply the mutation to commit the write(s). Each change is applied in the order in which they were added to the mutation.

https://cloud.google.com/spanner/docs/dml-versus-mutations#mutations-concept

In Cloud Spanner, a mutation refers to the amount of data that will be affected in a single database transaction, quantified by a value calculated by Spanner. Although there is no specific formula for counting mutations, the documentation provides guidelines on how to count them for each insert, update, and delete operation.

Initially, Cloud Spanner supported a maximum of 20,000 mutations per database transaction. During that time, we faced significant challenges in avoiding the “Mutation limit exceeded” error. Fortunately, this limit increased to 40,000 and has now been raised to 80,000, alleviating our concerns about exceeding the limit in our processes.

With a dual-write solution, in general, we would be executing approximately twice as many database operations compared to those performed on either the v1 schema or the v2 schema. This will lead to a significantly higher total mutation count. As a result, it’s important for us to monitor the mutation count closely, particularly during dual-write operations, to ensure that we remain within the limit.

We have two options for measuring these counts:

Measuring them using the Go Spanner library
Estimating them based on database operations for each logic pathway

I would like to utilize both methods for measuring mutations. When measuring mutations using the library, we will need to prepare all the necessary test data to execute a specific logic path in the API. During the design phase, I dedicated one or two days to estimating mutation counts for all mappings of v1 and v2 APIs.

To estimate the mutation counts, I used formulas that incorporated variables representing the number of affected rows in specific tables. Since each API can have multiple execution paths, I focused on the paths that seemed most likely to result in the highest mutation counts.

To illustrate this process, let me provide a simplified example for easier understanding.

Consider an API called AuthorizeBalance, where user balances are represented as sums of individual BalanceComponents. For example, user A has a total balance of 200, consisting of four components: 100 + 50 + 30 + 20.

Now, if we update the Amount column in 1 row of the CustomerBalances table (which has 10 columns) and the Amount column in 4 rows of the CustomerBalanceComponents table (which has 15 columns), the initial mutation count could be calculated as 1 + 4 * 1 = 5. However, it’s important to highlight that when we perform these updates, we actually modify all columns—not just the ones being changed, but also any other columns that were selected during the read operations prior to the write.

In this case, we have:

Mutation count = 10 + 4 * 15 = 70

In reality, the total number of mutations could be significantly higher due to additional insertions and updates. Furthermore, as I explained in the example with just four balance components, the number of affected records can vary from user to user. Therefore, I represented this as a variable in the formula:

Mutation count = 10 + CustomerBalanceComponents * 15

With this formula, we can calculate the total mutation counts by substituting a specific number into the variable. I also analyzed how many rows could realistically be assigned to these variables based on results obtained in BigQuery. By querying how many resources were involved in a single request, I calculated the total mutation counts for each mapping and summarized how high they could be during dual-write execution. Fortunately, based on my estimation, the probability of exceeding the mutation count limit is nearly 0%.

Migration Steps

Let me summarize what we have discussed so far by presenting the migration steps as follows.

Bottom layer: The lowest square arrow represents each phase of the migration.
Second layer: The layer above indicates the transition when the read and write v1 balance clients switch their endpoints to v2.
Third layer: This layer represents when data backfill and data inconsistency check batch will be running.
Fourth layer: This layer details the execution of quality assurance (QA) before commencing the new phase.
Top layer: The topmost squared ovals encompass all development tasks necessary to transition to the subsequent phases.

One important thing to consider is how we approach this migration project as a whole. As we looked into the rollback options for each phase, we found that, in theory, we can move to the next phase and still be able to roll back to the previous one without major issues, except for the final rollback from the post dual-write phase. However, to be more cautious, we can first validate the entire migration process in a proof of concept (PoC) environment. Once we’ve validated everything there, we can follow the same procedures in the production environment.

The strong benefit of starting the migration in a PoC environment is that it allows us to make progress gradually. Therefore, I’d like to adopt this approach.

Fig. 34: Rough migration steps

Future Work

We have several tasks to complete before we can move forward with this migration. However, we currently have higher-priority work and are understaffed (we’re hiring!).

Given this situation, we’ll start with the pre-migration tasks when we can.

Key Takeaways

1. Focus on Minimal Goals

The saying "Those who chase two hares will catch neither" aptly describes the scale of this project. By minimizing the scope early and keeping it smaller, we increase our chances of success. External factors could disrupt the migration, necessitating additional fixes until completion. Thus, focusing our goals to the bare minimum is essential.

2. Importance of Research

At the outset of the project, I had no specific knowledge about system and data migration. However, after reading blog posts and articles, I’ve gained valuable insights into best practices and various perspectives that need to be considered.

3. Value of Thorough Investigations

We conducted a detailed investigation of the specifications for the v1 balance service. This investigation was crucial in designing a clear, well-informed solution. Even if the migration does not go as planned, the insights gained will be invaluable for managing the services.

4. Understanding the Details Accurately

Given the scale and complexity of this project, even small details matter. One minor misunderstanding can lead to disastrous consequences. That’s why I focused on following logic accurately, especially when new insights were provided by colleagues for each topic.

5. Evaluating Options and Trade-offs

Exploring various solutions and their trade-offs is essential, especially when preparing for unexpected situations. This approach helps identify critical issues and design the most suitable solutions.

6. Taking Calculated Risks

System and data migration is a substantial project, with some degree of risk being unavoidable. However, by breaking down the issues into manageable units, we can minimize these risks. For example, I estimated the Spanner Mutation counts for all v1 and v2 endpoint mappings.

7. Considering Reversible and Irreversible Actions

As we proceed, we must consider the rollback steps for every action. This is crucial for system and data migration, where an easy rollback process is essential for addressing issues. If we identify some irreversible actions during the design phase, those options may not be feasible or will require more careful consideration.

8. Example-Driven Communications

System and data migration is complex. Therefore, architects must provide clear and detailed diagrams to ensure other engineers understand the concepts without ambiguity.

Conclusion

In this series of posts, I have outlined the background of the migration and explained how I designed the solution for the system and data migration. I hope this information serves as a valuable reference for anyone considering various types of system and data migration.

Thanks for reading this far. Lead the future with these insights!