2024/11/13

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part II

Author:: resotto

, 2024/11/13

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part II

In the previous part, we covered the background of the migration and the current state of the balance service. In this part, we’ll discuss the challenges of the migration and my proposed approach to addressing them. I hope this post provides valuable insights about how to prepare for a massive migration project.

Part I: Background of the migration and current state of the balance service
Part II: Challenges of the migration and my approach to address them (this article)
Part III: Mappings of the endpoints and the schema, client endpoint switches
Part IV: How to execute dual-write reliably
Part V: Architecture transitions, rollback plans, and the overall migration steps

Challenges

We face several requirements during the migration, which include:

Zero downtime
No data loss
Strong data consistency (i.e., no eventual consistency)
Availability
Performance
Reliability (ensuring that no bugs are introduced)

The most challenging constraint is zero downtime, which prompts us to consider an online migration approach. However, adhering to other constraints makes the entire migration process significantly more complex than it would be if we were able to compromise on some of them.

As previously discussed, the v1 balance service has the following dependencies:

Accounting event processing
Accounting code processing
Historical data processing
Bookkeeping (which directly connects to the v1 balance database)
BigQuery (for querying v1 data)

More specifically, even during the migration, we need to ensure the following:

Continued sending and reconciling of accounting events to the accounting service
Ongoing reading and writing of accounting codes
Continuous reading and writing of historical data
Ensuring the bookkeeping service can execute its logic using up-to-date balance data
Guaranteeing that each query reads up-to-date balance data

Additionally, we must address the following concerns:

What range of data needs to be migrated
- Only specific data, which may require v1 data as a complete dataset
- All data
The timing and method by which read/write v1 balance clients will switch their endpoints to v2
- How read/write v1 balance clients will handle mixed logic for both v1 and v2 API calls
- How read/write v1 balance clients will be informed about the version in which their data exists
The ease of rolling back individual migration phases or even the entire migration after migrating certain v1 behaviors and their corresponding data to v2

These are not all of our challenges. An additional implicit challenge looms: the ongoing changes happening in both systems until we complete the migration.

What if we need to update the v1 schema in the midst of the data migration? Any changes made to the v1 schema will also have to be reflected in the v2 schema. Otherwise, even after completing the migration, some behaviors or data may be lost.

In essence, the longer the migration period, the more we need to migrate. This is particularly significant for a large-scale migration project like ours. We essentially need to track the types of behaviors and/or data introduced to the v1 system until we finish the migration. As you can imagine, this will be a substantial effort.

Approach

I’ve covered all assumptions for the migration while providing an overview of the system so far. Now, let’s dive into our migration approach.

Learning Best Practices

We don’t need to reinvent the wheel from scratch. Before diving into the design, I focused on learning the best practices for both system and data migration by reading over 80 articles. This gave me a comprehensive understanding of the migration process, including common approaches like online migration and typical pitfalls to watch out for as follows:

Whether each phase can be rolled back
Strong consistency or eventual consistency
Inconsistent data
How clients know where their data is located

For a list of the articles I read, please see the References section at the end of this post.

Migration Roadmap

How many months or years will this work require? I couldn’t answer this question with reasonable accuracy at the beginning of the project, but I can provide a more informed estimate now that I have developed a migration roadmap and a design doc.

Early in the project, I created a migration task list that outlines a range of specific tasks, presented as bullet points, which must be completed throughout the migration process. There are two main reasons for creating this list:

To identify essential tasks for the migration
To understand the scale of the migration based on those tasks

With insights gained from best practices in system and data migration, I was able to identify the necessary tasks for the entire migration, even before designing the solution. All tasks identified are listed below; however, it’s important to note that I have not yet completed all the tasks in phase 1.

Phase 1. Investigation
- Assess migration feasibility
  - Determine API migration granularity
  - Investigate compatibility between v1 and v2 APIs
  - Implement new v2 APIs
  - Check existing database logic such as stored procedures, triggers, and views
  - Verify compatibility between v1 and v2 schema/data models
  - Validate compatibility between v1 and v2 batch applications
  - Review PubSub-related logic
  - Identify dependent services
  - Identify deprecated v1 APIs
  - Read and understand v1 API code
  - Investigate and resolve issues
- Clarify dependencies
  - Application dependencies
    - Go version
    - Library/package version
    - Environment variables
    - Estimate Spanner mutation limit
  - Assess network limitations
    - Allowed ingress/egress namespaces
  - Review IAM/privilege limitations
    - Request validations
  - Upstream services analysis
    - Review v1 request parameters
    - Review v1 response parameters
    - Identify v1 API use cases
  - Evaluate subscribed topic/message (PubSub)
  - Downstream services analysis
  - Infrastructure
  - Environment setup
    - sandbox
    - test
  - DB clients
    - Bookkeeping service
  - Manual operations (e.g., queries for BigQuery)
  - Monitoring setup
    - SLOs
    - Availability
  - Tools
    - Slack Bot
    - CI
      - GitHub Actions
      - CI software
    - Linter (golangci-lint)
  - Stakeholder identification
    - Payment team
    - Accounting team
    - Compliance team
  - Compliance adherence
    - JSOX
- Documentation
  - Design document
  - v1 change log
  - v1 inventory
  - Migration schedule
  - Criteria for deleting PoC and production v1 environments
  - Cloud cost estimation
  - Risk assessment
  - Production migration instructions
  - Post-migration operation manual
  - Technical debt summary
  - Upgrade task list
  - QA test instructions
  - Rollback test instructions
  - Operation test instructions
  - Data backfill test instructions
  - Performance test instructions
  - Client team onboarding document
  - Balance team onboarding document
  - v2 playbooks for each alert
Phase 2. PoC
- Set up PoC environment
- Fix balance service
  - Update v2 proto interface
  - Implement request proxy logic
  - Develop data consistency validation batch
  - Migrate v1 test code to v2
- Fix client logic
- Set up tools
  - Datadog dashboard
- Conduct QA
- Conduct performance tests
- Conduct rollback tests
- Conduct operation tests
- Conduct tool tests
- Conduct data backfill tests
- Monitor data migration, performance, and Spanner mutation count
Phase 3. Migration on production environment
- Switch client endpoints
- Set up monitoring
- Fix v1 data to pass data consistency checks
- Perform data backfill
- Monitor data migration, performance, and Spanner mutation count
- Backup data
- Discontinue PoC environment
- Discontinue production environment

Furthermore, I organized these tasks by their dependencies and created a roadmap to provide a rough timeline. I provided estimates based on my experience, though I acknowledge that my estimates may not be entirely reliable. Ultimately, this process indicated that the overall timeline could range from two to four years. However, this estimate lacks precision due to the absence of a detailed design and additional supporting resources.

Fig. 8: Roadmap based on the migration task list

In our case, we didn’t need to provide a strict estimate for the schedule at the start of the project. If you’re required to estimate the overall timeline, you can create a roadmap as described above. Once you prepare a design document, you can then refine and support each estimate based on the detailed design.

I admit this is not the most polished format for a migration roadmap. However, I believe it works effectively for estimating the schedule, identifying dependencies, and designing a solution for the migration.

Investigations

With significant assistance from @mosakapi, we gathered almost all the necessary information on the following topics:

The request/response parameter mappings between v1 and v2 APIs
The schema mappings between v1 and v2 tables
The locations where v1 APIs are invoked by all read/write clients
v1 API specifications
v1 batch specifications
Dependent services
PubSub messages and their subscribers
Spanner DB clients (bookkeeping service)
Queries for v1 data (BigQuery)

Since the v2 balance service was released in February of this year and is still relatively new, we were able to collect information about the v2 specifications efficiently, without consuming a significant amount of time.

Alignment

Before designing the solution, I reviewed documents outlining the future roadmap of the payment platform to which my team belongs. It is essential to align the post-migration architecture with the vision described in the future roadmap.

However, it’s also important to acknowledge that we cannot achieve the architecture described in the future roadmap through a single, comprehensive system migration. Therefore, as we proceed with any type of migration, we need to clearly define the migration scope and plan for the subsequent steps following the initial migration.

In fact, we have a roadmap for migrating the accounting service to a newer version, as outlined in the future roadmap document. Initially, I included this migration in the project’s goals. However, I’ve come to realize that completing the accounting system migration in this phase is not feasible due to the additional effort and timeline required. The migration involves extra tasks, such as replicating the functionalities currently offered by the existing accounting service in the new version and ensuring their reliability and performance.

Design Direction

Are you familiar with the book Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith? It’s an excellent resource. The book advocates for the Strangler Fig application pattern, where developers gradually break down a large monolithic application into smaller microservices.

We initially considered this approach as the foundation for our migration, intending to migrate smaller parts of v1 behaviors and data into v2 one by one. However, during the design process, I discovered that this gradual migration strategy could be significantly challenging with our dependencies and concerns outlined in the earlier Challenges section.

Take a look at the figure below, which illustrates the API dependency graph. Some APIs are used exclusively by specific resources, while others are accessed by many resources. There are also loosely grouped API suites called by certain sets of resources. However, this loose grouping—with some APIs being accessed by other resources—makes it challenging to gradually migrate smaller parts of the v1 balance service.

Fig. 9: API dependency graph

To be honest, designing a gradual migration plan while considering these dependencies and concerns to resolve them properly would have taken me much longer than six months.

Therefore, I prioritized reversible actions over gradual migration, particularly regarding the ease of rollback. In some situations, rollback may be impossible, leading to potential downtime if we encounter issues. We can experiment with reversible actions more rapidly than with irreversible actions, allowing for quicker iterations through trial and error. In the following sections, I will explain the solution based on this principle.

As I mentioned in the Challenges section, the most critical constraint is achieving zero downtime while simultaneously managing other constraints. To address this, we plan to execute an online migration with data backfill, enabling us to migrate data without incurring any downtime. I will explain how we implement online migration while also addressing various other concerns. For more details, please refer to the Dual-Write section in Part IV.

In Part III, we’ll discuss the mappings of the endpoints and the schema with endpoint switches on client sides.

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part II

Challenges

Approach

Learning Best Practices

Migration Roadmap

Investigations

Alignment

Design Direction

References

Related article

Good tools are rare. We should make more!

Mercari’s Seamless Item Feed Integration: Bridging the Gap Between Systems

From Good to Great: Evolving Your Role as a Quality Consultant