Migrating user payouts to Spanner

The 17th day’s post of the Merpay Advent Calendar 2020 is brought to you by Robert, a Backend Engineer/Engineering Manager at Merpay.

Hope December is treating you well 🙂

I briefly touched upon how as part of Cash I/O we migrated user payouts from a monolith to a microservice in my previous blog post.

This time I will write a bit about how we are migrating to Cloud Spanner.

Cloud Spanner

I’m sure by now you have already heard of Cloud Spanner.

In Google’s own words, it is a:

fully managed relational database with unlimited scale, strong consistency, and up to 99.999% availability.

Here at Mercari and Merpay we work closely with Google to take advantage of the latest and greatest technologies available. Google also has always provided excellent support, since even before we launched Merpay.

Merpay believes in Spanner so much that it has become the default database for any new microservice, not just because it gives us easy scaling and high availability guarantees, but it also makes management easy with access controls as well as audit logging.

We also believe in open source so we released Spanner Autoscaler and wrench which we are using to make the experience even better.

Of course nothing is perfect and I will share some considerations you might want to take when choosing your database.

Migrating from what?

You’re wondering what we’re migrating from?

Well, that depends, as so many things do.

The monolith I mentioned in my previous blog post was, and still is being, migrated from an on-premises MySQL database, some services migrated from Cloud SQL, and some even from Datastore.

During the course of development different decisions are made, which at that time might have been the best ones, but that due to different reasons might prove to no longer be as great.

This post is mainly about migrating user payouts so some parts are specific to migrating from MySQL, but I will also share other general advice.

Migrating how?

To repeat myself, it depends 🙂

Usually there are two main strategies you can take when switching databases:

  1. Downtime and switch
  2. Double-write and switch

There are pros and cons to both strategies, and we’ve actually used both on different services.

When we migrated user payouts from the on-premises MySQL to Spanner we decided to go with the first approach, primarily due to not requiring high availability.

To start at the end, we made a lot of requests to SRE to help us execute the data migration in batches looking like this:

for i in {1..9000..1000}
do
    /php /batch user_payout:migrate --start=$i ~-end=$((i+1000)) --go 2>&1 > out$i.txt &
done

Of course the numbers were a bit bigger 😉

As you can imagine though, before we were able to do that, we had to do a lot of testing.

First on our local environment with dummy records, then on our test environment with QA records, and lastly also using an anonymized production dump to ensure data consistency and prevent any issues.

Due to a lot of legacy data there were many cases where special or old values were being handled in the code, so we also took this opportunity to clean it up.

There were many discussions both before we even started writing a single piece of code, to many more while implementing the logic. Even just discussing whether we should migrate to Spanner took some time 😛

Many people were involved in this undertaking, some that are even no longer here at Merpay, but everyone truly gave it their best.

Kudos to everyone!

Lessons learned

We learned a lot!

Let me first just write some thoughts I have.

Building your data model around a specific database allows you take advantage of everything the database has to offer.

This can be both good and bad.

You can create custom queries that leverage all the capabilities the database has to offer, providing your users with fancy new features, giving them immediate results and so much more.

On the other hand though, if your database choice influences your API design, it can cause a lot of issues.

Your API clients start building their internal data models around the database you choose, making any potential migration to a different database a painstakingly long process.

If you can afford to version your APIs and not support older API versions, then maybe you won’t have any issues. Just make sure that your APIs and clients support deprecating APIs properly 😉

On the topic of API design, make your APIs idempotent! This will make your life much easier, both for normal usage and also for migrating data.

During our migration we saw, at random times, several aborted transactions, which of course you should retry, but then by making your APIs idempotent you can also just retry the batch to migrate the data. Regardless, handle retries idempotently.

If you’re using allow_commit_timestamp=true, always use spanner.commit_timestamp(), don’t use time.Now(), or you might get interesting errors like Cannot write timestamps in the future.

In case you have timestamps, it’s very likely you will have to support paging and ordering so do think about it, because if you migrate to Spanner, then you will have to think about it.

Also pay attention to your primary key and hotspots.

This last one caused us a lot of struggles, because, as the choice of a monotonic increasing primary key was propagated into the API design, one of our clients implemented a certain caching logic based on it.

We updated our APIs to return UUIDs already a long time ago and also changed our clients to use them when calling other APIs. However, on one client, the old IDs were still being used for caching, which resulted in old records behaving inconsistently on that client.

That was a lot of fun to debug and fix. /s

To summarize:

  1. decouple your APIs from your database choice (as much as possible)
  2. handle API deprecation
  3. make your APIs idempotent
  4. think of paging and ordering
  5. choose your primary key wisely
  6. ???
  7. profit!

Lastly

When you’re building an MVP or just starting a company, it makes little sense to think about what might happen in 5 or 10 years.

Once your company starts growing though, at some point you should consider your technical roadmap and prepare well in advance for any potential migration you might want to do, because, as I hope I gave you a glimpse, migrations take time 🙂

I said that this was not the only service that is migrating to Spanner. If you enjoy challenges, believe in our mission, and share our values, do talk to us, we’re looking for engineers!

Tomorrow’s blog post —the 18th in the Merpay Advent Calendar 2020 — will be written by miisan, a QA Engineer. Hope you are looking forward to it!