2024/11/13

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part I

Author:: resotto

, 2024/11/13

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part I

At our company, we have a payment platform that provides various payment functionalities for our users. One key component of this platform is a balance microservice that currently operates in two versions: v1 and v2.

The v1 balance service is designed as a single-entry bookkeeping system, while v2 is designed as a double-entry bookkeeping system. Although there is currently no direct compatibility between v1 and v2, achieving compatibility is not impossible.

Over the past six months, we’ve been investigating how to migrate from the v1 service to the v2 service. The main reason for this migration is that v2 is built with more modern and organized code, which could significantly reduce development costs when fixing bugs and adding new features.

Another motivation for using the newer version of the balance service (v2) lies in the power of double-entry bookkeeping. One key aspect of double-entry bookkeeping is its ability to handle two sets of accounting data as a single transaction: credit (the provision side) and debit (the receiving side). In contrast, single-entry bookkeeping only allows us to track one side of a transaction, which can leave us uncertain about the source or target of that transaction. However, double-entry bookkeeping provides a complete view, enabling us to validate whether the combinations of credit and debit are valid.

The goal of this migration is to transition nearly all functionalities from the v1 balance service to the v2 balance service. While we aim to migrate most features, we recognize that there may be exceptions where some functions might still need to be managed by the v1 balance service. The scope of the migration encompasses all components that are impacted by this transition.

Disclaimer:
Please note that we have NOT yet gone through the actual migration process. Also, the design might change after this series of posts goes live. Even without having experienced the migration process myself, I am publishing this series of posts because I believe I can contribute to the industry by offering valuable insights on considerations and design methods for system and data migrations, which can be quite massive in scale and significantly complex.

I will cover the following topics to give you a clearer understanding of our system and data migration solution:

Details of the solution we intend to execute
My design approach for the solution

What I won’t be discussing includes:

Our experiences with system migration
Proven best practices for system migration
Specific domain knowledge related to accounting, bookkeeping, and payment transactions

This blog is divided into 5 parts as follows:

Part I: Background of the migration and current state of the balance service (this article)
Part II: Challenges of the migration and my approach to address them
Part III: Mappings of the endpoints and the schema, client endpoint switches
Part IV: How to execute dual-write reliably
Part V: Architecture transitions, rollback plans, and the overall migration steps

I hope this series of posts provides valuable insights for anyone involved in migration projects.

Acknowledgments

I extend my heartfelt gratitude to @mosakapi, @foghost, and @susho for their invaluable assistance. Special thanks also go to all teams involved for their continuous support.

Current State

Let’s outline the tech stack and current architecture of the balance service first.

The tech stack is as follows:

Go
Kubernetes
gRPC (with protocol buffers)
Google Cloud Platform
- Cloud Spanner
- Cloud PubSub

Both v1 and v2 have their own gRPC services managed by a single Kubernetes deployment, which means they feature distinct APIs (proto interfaces) and batch applications. Additionally, we use canary deployments when deploying new images.

Also, they each have different database schemas (data models) managed by a single Cloud Spanner database. There are no (materialized) views, triggers, or stored procedures in either version.

The following figure illustrates the architecture more clearly:

Fig. 1: Two versions (v1 and v2) of the balance service

Then, let’s explore the architecture of components related to the balance service.

Accounting Event Processing

When Mercari awards points to users, we need to keep track of their addition, subtraction, expiration, and consumption. To handle this, we have a dedicated accounting microservice, while the v1 balance service delegates these accounting tasks to it.

Right now, the accounting service functions as a single-entry bookkeeping system, just like the v1 balance service. Client services must perform two key actions: sending accounting events and reconciling those events afterward. The accounting service supports a Pub/Sub system for sending events and an API for reconciliation. To ensure timely publication of accounting events, multiple services are involved in publishing/reconciling these events, and the payment service also sends and reconciles accounting events on its own.

Currently, the accounting team relies entirely on the accounting service for their operations. Therefore, even after we migrate to the new system, it’s essential that the v2 balance service continues to publish accounting events to the Pub/Sub topic and also handles reconciling those events.

Fig. 2: Architecture of the accounting service

Accounting Code Processing

Along with processing accounting events, there’s another internal concept related to accounting called “accounting code”. This is a string value that indicates the purpose of payment actions.

The payment service calls the v1 balance APIs using the accounting code, and the v1 balance service checks the validity of the request by verifying whether the specified accounting code exists in the balance database.

Registering a new accounting code can be done through Slack using a slash command. This command triggers a webhook to the Slack bot server, which then publishes messages for the accounting code registration, allowing the v1 balance service to subscribe to them and insert the specified code.

Additionally, the v1 balance service offers a GetAccountingCode API for GET requests, enabling client services to verify whether an accounting code exists before submitting their requests.

Fig. 3: Architecture related to accounting code

Historical Data Processing

The v1 balance service not only manages the latest values of user funds, points, and sales, but also maintains historical data for them.

When users initiate specific payment actions, the payment service calls the v1 balance APIs and includes relevant historical information as metadata. The v1 balance service processes this request and saves the provided metadata.

To access historical data, the v1 balance service offers GET APIs. When these APIs are called, they return a history entity along with the metadata in the response.

The history service uses these APIs to construct the finalized historical record based on the returned information and then provides it to the client. Additionally, they may call other service APIs to retrieve details about the original payment information.

Fig. 4: Architecture related to historical data

Bookkeeping

We have a bookkeeping service that functions as a legal ledger component and consists entirely of batch applications.

Ideally, each microservice should maintain its own database and access information from other services via API calls. However, since the bookkeeping process demands a significant amount of balance data, the bookkeeping service directly connects to the v1 balance database to carry out its operations most efficiently.

Fig. 5: Bookkeeping service

BigQuery

Certain business operations rely on queries against the v1 schema in BigQuery, meaning there are dependencies on v1 data managed by the v1 balance service. In fact, there are more than 500 queries that utilize this v1 data.

Fig. 6: BigQuery depending on v1 data

The following figure summarizes all the related components described so far, serving as a blueprint that I created for designing the solution. Please note that for convenience, I have split the v1 and v2 balance services and their databases (schemas) into two distinct components.

Fig. 7: Current components related to the v1 and v2 balance services

In this article, we covered the background of the migration and the current state of the balance service. In Part II, we’ll discuss challenges of the migration and my proposed approach to addressing them.

Designing a Zero Downtime Migration Solution with Strong Data Consistency – Part I

Acknowledgments

Current State

Accounting Event Processing

Accounting Code Processing

Historical Data Processing

Bookkeeping

BigQuery

Related article

Good tools are rare. We should make more!

Mercari’s Seamless Item Feed Integration: Bridging the Gap Between Systems

From Good to Great: Evolving Your Role as a Quality Consultant