This post is Merpay & Mercoin Advent Calendar 2024 , brought to you by @timo from the Merpay Balance team.
This article is going to discuss the race condition happening when using multiple database (DB) transactions in one API / request. And give you some insight of how we overcame it.
Background
The Balance team is responsible for storage balance / debt for the Merpay users and the related accounting books.
When a user buys something from Mercari or pays in a store using the Merpay Smart Payments (メルペイのあと払い) option, our service creates records to track user’s debts and needs to repay to Merpay before the deadline.
It’s normal that users repay all the debts at the same time.
We are not limited to the number of debts when users repay, so the request might look like the following:
Request: {
idempotencyKey: "foo",
CustomerID: 123,
repayingDebts: {
{amount: 100, ID: "AAA"},
{amount: 200, ID: "BBB"},
{amount: 300, ID: "CCC"},
….
// the number of repayingDebts is not limited
},
}
It’s common to use a DB transaction to ensure consistency when executing write operations. However, database engines usually have a maximum size of insert/update in one transaction.
Cloud Spanner, which is used in Merpay as the database service, has limitation with Mutations per commit (it was only 20,000 Mutations allowed as of 2021/05). Since there are many records to be inserted/updated for one debt, it was very easy to hit the limitation and get the error.
Multiple DB transactions in one request
To work around the mutation limitation, we tried to break down a single DB transaction into multiple ones. For example:
- 1st transaction : Insert the received
repayingDebts
into tables - 2nd transaction: Execute the repaying
- 3rd transaction: Mark the status as repaid
Note that for the “4.Create & update associated records” in the 2nd transaction, it creates an independent DB transaction in each loop to handle all of the debts and executed in parallel. Without doing that, the performance would not fulfill our service level objective (SLO).
One problem fixed, but another came – Race condition
Everything looked good at the beginning, but we got another new trouble later that our system detected the inconsistency of our data. It happens when two requests (A and B) trying to repay the same debts.
In the following example, request A and request B do the parallelProcess at the almost same time, request A finishes the 1 ~ 3 set, request B finishes the 4 ~ 6 set. In this case, request A can not repay the 4 ~ 6 set anymore because the amount of debt has been repaid by request B, so it returns INVALID_AMOUNT. Request B has the same situation, in the end it leads to deadlock.
The race condition happened once or twice a month, it triggered the inconsistency alert and our on-caller needed to recover it manually. The related records might be updated anytime and it makes the fixing query operations become more complicated. It took about half a day for the manual operation, which affected our team performance.
Possible Solutions
To solve the race condition, we considered some solutions as following:
Rollback mechanism
When the race condition happens and is detected, roll back the status and amount to the values before repaying. It can be imaged as a manual operation but executed by programming.
Lock mechanism
Since the race condition occurs when two requests repay the same debts, it can be guarded by only allowing one request to process repaying, blocking others until the request finishes.
Merge into 1 DB transaction
Going back to the one DB transaction can also prevent the race condition from happening.The root cause is the limitation of mutations. Therefore , a method is to find the most mutation usage and do it asynchronously, to make the total number of mutations under the limitation.
We evaluated the pros and cons for each solution. By considering our database schema design and business requirements, we chose the Lock mechanism.
Challenges of lock mechanism
Challenge 1: Design the key of the lock
How to decide the key of the lock depends on what you want to protect.
In our case, our target is: only one repaying request for the same debts can be processed at the same time for the same customer. Other requests with different idempotency keys will be rejected.
So the schema to store lock information is designed as below:
CREATE TABLE Locks (
HashKey STRING(100) NOT NULL,
CustomerId INT64 NOT NULL,
IsLocked BOOL NOT NULL,
IdempotencyKey STRING(100) NOT NULL,
CreatedAt TIMESTAMP NOT NULL,
UpdatedAt TIMESTAMP NOT NULL,
) PRIMARY KEY(HashKey, CustomerId);
Note that using HashKey (same debt IDs generate the same HashKey) and CustomerId as PRIMARY KEY to ensure that only one request can use the lock at the same time.
Challenge 2: When and how should the lock be unlocked
All of the use cases should be considered because it’s dangerous if any of the records are locked or unlocked unexpectedly.
For example, a request is failed when repaying is an edge case.
Should it be unlocked or not?
=> If all the target records have been repaid, it can be unlocked. Otherwise it can not be unlocked.
(List the use cases and check if working properly)
Challenge 3: Does it need to record all the locking operations?
The lock key is created by the customerID and all repaying debt IDs, and there is a use case that a debt can be repaid partially with a different Idempotency Key. That means: the column IdempotencyKey
can be overwritten. We have considered storing all the operations in the database for debug and investigation. However, we found that it’s not really helpful for investigation and it’s enough to output the minimum information to log service for debugging.
Other perspectives
Keep in mind that the responsibility of your service
We also considered the passing parameters from our clients during the design, tried to judge the specific repayment scenario which caused the race condition, and handled it with exception handling. However, it will make your services more complicated and harder to maintain. In our case, we just need to ensure that the debts given from clients exist, and repay it if the amount is sufficient.
The performance in parallelProcess
As the above mentioned, the parallelProcess loops all the debts to repaying. The more records we get, the slower a request goes. Our next goal is to identify how to break through the limit.
Summary
Race condition is a common issue when making processes run in parallel. It is easy to be introduced but painful to be lifted.
We have released our solution for half a year and everything is good and safe for now.
It took one year to solve this problem from design and discussion to implementation. But now our team is not bothered by the race condition issue anymore, and it really saves our time from dealing with the recovery operations.
This article shared our experience of the race condition and gave some possible solutions to you. I hope one of the solutions may inspire you something new! 🙂
Next article will be by @siyuan. Look forward to it!