* This article is a translation of the Japanese article written on June 7, 2019.
—
This article is the 15th day of MERPAY TECH OPENNESS MONTH 2019.
Hi. I’m @foghost, an engineer of PaymentService development in the Payment Platform team at Merpay.
Merpay is developing a payment system with a microservice architecture. Within this architecture, PaymentService is the central service for payment transaction management. It uses the various payment methods provided by downstream services (including external services) to provide the necessary payment flows to upstream services (Mercari, NFC, QR code payment, etc.) as a common API. The payment flows provided by PaymentServices need to manage transactions of money across multiple downstream services. Since we started building PaymentService, payment transaction management has been one of our most important issues. For this reason, we have created a system that can ensure transaction consistency across multiple services in PaymentService.
In this article, I will illustrate the challenges of managing payment transactions in microservices with a simple example. Then, I would like to briefly introduce some approaches we have been practicing in PaymentService.
- Challenges in Ensuring Transaction Consistency in Distributed Systems
- Known Approaches
- The approaches we have been practicing at PaymentService
- Concludion
Challenges in Ensuring Transaction Consistency in Distributed Systems
To illustrate the challenges of transaction management in distributed systems such as microservices and SOA, I will introduce an example.
In this case, we will consider a case where a customer pays 1,000 JPY (600 JPY in points managed by an internal service + 400 JPY by credit card via an external service) to a merchant using two payment methods. At this time, the payment process has the following steps.
- Create transaction data for payment processing
- Consume 600 JPY in points
- Consume 400 JPY in credit card
- Add 1,000 JPY to the merchant’s sales receipt
- Send a notification of the settlement result
The above is a sequence diagram for the case where no special handling is made in the system. If all the processes are completed successfully, the payment process will be executed up to #7 as shown in the above diagram and the payment process will be returned "success". However, in the real world, there will be network failures or incidents of dependent services. As a result, if a problem occurs in the middle of the payment process, guaranteeing the consistency of payment transactions across multiple services becomes a major issue.
This is by no means an exhaustive list of abnormal conditions, but if there is an error in the payment process along the way, the following conditions may occur in the system.
- There was a timeout at #1
- There is a possibility that 600 yen of points were consumed in InternalServiceA
- The request was timed out because the processing time set by the upstream service was exceeded in #2 (or later).
- Upstream service don’t know if the 600 yen points have been spent at InternalServiceA.
- There was a timeout at #3
- 600 yen of points were consumed at InternalServiceA
- Don’t know if the 400 yen credit line has been consumed at ExternalService.
- At #3, ExternalService returned an application error such as insufficient balance.
- In InternalServiceA, 600 yen points have been consumed.
- There was a timeout at #4
- In InternalServiceA, 600 yen points have been consumed.
- In ExternalService, a credit line of 400 yen was consumed.
- Don’t know if the merchant has been granted 1,000 yen at InternalServiceB.
- There was a timeout at #5
- In InternalServiceA, 600 yen points have been consumed.
- In ExternalService, a credit line of 400 yen was consumed.
- In InternalServiceB, 1,000 yen was granted to the merchant.
- Don’t know if an event has been sent to MessageQueue
- The local DB commit failed at #6
- In InternalServiceA, 600 yen points have been consumed.
- In ExternalService, a credit line of 400 yen was consumed.
- In InternalServiceB, 1,000 yen was granted to the merchant.
- An event was sent to MessageQueue.
- If the DB commit fails, the transaction data and the transaction ID used in the above process will be lost.
When you consider these cases and how you can accurately process the customer’s money in each case, you will understand the challenges of ensuring transaction consistency in distributed systems.
Known Approaches
The XA protocol has long been advocated as a distributed transaction model, and [2PC](https://en.wikipedia.org/wiki/Two– phase_commit_protocol)(2 phase commit) using XA is well known. MySQL also provides XA transaction functions by 2PC. However, there are performance and availability concerns with 2PC, such as blocking all participants during execution.
Then, instead of using an approach to guarantee the ACID property as a global transaction like XA, Eric Brewer proposed the principles of a distributed system that emphasizes availability and performance called "BASE" together with the "CAP Theorem" around 2000.
“CAP Theorem” shows that only two of the following three properties can be satisfied simultaneously in a distributed system.
- C: Consistency
- A: Availability
- P: Tolerance to network Partitions
"BASE" is a characteristic of a system that is basically available at all times, but allows for temporary inconsistencies to eventual consistency in results rather than real time. In other words, this is the characteristic of a system that emphasizes "AP" and makes "C" a little more flexible.
For examples of the application of BASE in actual service development, please read "Base: An Acid Alternative" written by Mr. Pritchett of eBay in 2008. It will give you a concrete idea of how BASE is applied in actual service development.
Also, a keynote published around 2008 by Alipay, a popular mobile payment service in China (Alipay is used by 1 billion users as of the time of writing in June 2019), shows an example of BASE application in a large-scale SOA system. They have been developing a payment system that follows BASE characteristics since that time, using methods such as TCC (Try/Confirm/Cancel) and compensating transactions.
TCC is a model of compensating transactions that each service of an application provides three APIs, Try/Confirm/Cancel, for a single process on the dependent service side. Then, when the main service performs a process (e.g., reducing the balance), the main service acts as a coordinator and uses TCC to ensure overall data consistency (eventual consistency).
- Try: Reserve resources, state, etc. required for processing on the dependent service side.It is important that the Confirm process always succeeds when a reservation is made in Try, or that it can be rolled back via the Cancel process.
- In the Try process, if even one of the Try processes of the dependent service fails (e.g., times out), execute the Cancel process of the succeeded service to rollback.
- After all Try processes have succeeded, execute the Confirm processes of the dependent services.
The approaches we have been practicing at PaymentService
As I have already written, PaymentService must process customer’s money accurately. For this reason, we have been tackling transaction management since the early stages of development. And we have been searching for a solution to ensure the consistency of the entire payment transaction, even if multiple services are involved.
In our payment system, the storage used by each microservice is different. It is also necessary to ensure consistency with external services such as credit cards. So, from the beginning of the development of PaymentService, we have adopted a mechanism to ensure transaction consistency at the application layer, without relying on distributed transaction management functions such as XA (2PC) provided by Database.
The result is the system can ensure the "eventual consistency" of payment transactions with the emphasis on availability and performance, as has been practiced by eBay and Alipay. I will now introduce some of the approaches we have practiced.
Segmentalize payment transaction process
When an error occurs during payment processing, the error handling that should be done depends on both the type of error that occurs and how far the processing has proceeded.
- If the problem can be fixed by retrying, perform the necessary retry process.
- If an immediate retry is not possible, we can also consider to retry the process later with batch job
- If the problem cannot be fixed by retrying, execute the necessary rollback process of the dependent services, and then complete the whole process.
In order to determine to execute the necessary retry process or the necessary rollback process, we also need to know how far the transaction process has progressed.
In the previous example we introduced to explain the challenges of distributed transaction management, it is executed as a single large transaction process. Therefore, if the process fails in the middle of the transaction, it will be difficult to know where the failure occurred, and recovery will be difficult.
In PaymentService, we segmentalize a single payment transaction process into multiple phases to execute them.
- When accepting a payment transaction, the internal transaction data and ID must be finalized as a single phase before processing.
- Recording the phases of the payment transaction.
- The granularity of the phase is determined by the ease of retry and rollback processes.
- Basically we prefer to segmentalize a phase for operation dependent on other services, and also save necessary journal logs for the operation..
By segmentalizing the payment transaction process, we can refer to the recorded phases and narrow down the scope of the retry or rollback process to be performed, even if it fails during processing or in the middle of processing. This makes it much easier to recover from failures.
Transaction Coordinator implemented with state machine
The above figure is an example of a state transition in the payment process.
- Created: Acceptance of payment
- Paid: Successful payment
- Failed: Failed Payment
- Refunded: Refunded payment, if there is a refund
After segmentalizing the payment process as explained above, each processing phase will be in some sort of intermediate state. PaymentService defines the state definition, necessary processing, and state transitions for each processing phase for each payment transaction. We use the state machine mechanism to implement a transaction coordinator for payment transaction processing.
The advantages of implementing a transaction coordinator in a state machine are as follows.
- Since the process is framed, developers need to implement the payment processes by thinking about segmentation from the beginning.
- When an error occurs in each phase, retry or rollback processes can be executed as defined transitions of each one.
- The scope of the retry process becomes easier to narrow down, and only necessary processes need to be executed.
- The transactions of rollback can also be implemented by adding the necessary rollback processing for each phase to the whole state transition.
- The execution state of each segmentalized phase is managed by a common mechanism in the state machine, so it can be managed naturally and uniformly without any particular awareness.
- Since the management will be generally unified, it will be easier to develop a common automatic recovery mechanism.
- It also facilitates the reusability of phases.
There are also a few disadvantages.
- The programs are not written in order from the top like the normal programming, which makes programming more difficult.
- When a new member joins, it takes longer for them to onboarding.
- It takes effort to write a program that is predictable in our mind because the processes connected by state transitions may be routed differently depending on the triggers.
- Writing tests to confirm the final result will be very important.
- Increases the difficulty of debugging. It is important to keep proper logs for debugging.
- If the segmentation of phases is too detailed, it may affect performance.
Idempotency
Without taking special care, when a retry operation occurs during the payment process, the operation thrown to the dependent service will be executed multiple times, and the customer’s balance may be deducted multiple times.
The property, where the result (the balance is subtracted only once) does not change even after multiple executions, is called idempotency.
There are two main ways to ensure idempotency in payment transaction processing.
- Provide necessary I/F for consulting the final result in the dependent service side, and be sure to check the result before re-execute it on the caller.
- The I/F provided by dependent services should be created on the assumption that idempotency is guaranteed, so that the same result will be obtained even if it is executed many times.
#1, performance may be affected because the system needs to make one request for consulting the result when the process is rerun. Furthermore, if the process of checking the result is forgotten to implement, it will lead to an accident. For this reason, in Merpay we strongly require that each service provide APIs with idempotency, basically in accordance with policy #2.
- The API provider should always receive the Idempotency Key as a parameter of the request and use it to ensure idempotency
- The API caller generates a unique Idempotency Key and calls the API to prevent multiple processing.
In the case of PaymentService, when a payment transaction is accepted, the internal transaction ID is fixed before the operation to reduce the customer’s balance is performed. Then, when reducing the balance, as long as the finalized internal transaction ID is passed as an Idempotency Key to the balance manipulation API (which is guaranteed to be idempotent API) provided by the dependent service, it is guaranteed that the balance will be deducted only once, even if it is executed many times.
Compensating transaction
When a program is in the middle of a payment process and it just can’t proceed to the next step, it needs to rollback all the operations it has performed so far and then return a failure result. The process to do it is called as compensating transaction.
Let’s use the example I wrote in the introduction. Suppose the credit card transaction of an external service (ExternalService) fails after the payment process has confirmed the consumption of points managed by the internal service (InternalServiceA). If the process is terminated and a failure is returned, then for the customer, the payment has failed but the points have been consumed, which cause an inconsistency.
In PaymentService, along with the normal transaction process, the compensating transaction process is also managed by the state machine, and rollback process is executed as necessary to ensure the consistency of payment transaction processing.
When we want to rollback more completely as a compensating transaction, we also adopt the method of dividing one operation into two steps: temporary reserve and real execution, similar to the way of TCC introduced above.
For example, if the balance consumption process includes both the consumption of the amount and the history recording process, compensating them together in one single process can also ensure to return the correct balance after the rollback process is executed. However, one side effect is that the history records will be dirty.
On the other hand, if we divide the process into two steps, balance reservation and real execution, get the following.
- In the reservation stage, the balance is just held as a temporary reserved status.
- Once the execution process is called, the actual balance consumption and history recording process will be run.
Performing the rollback process after the reservation process will allow for a more complete compensation process with less side effects such as dirty history records.
Reconciliation
So far, I have discussed how to use a state machine based coordinator to ensure the transaction consistency of payment processing at the application layer. However, we also need to check or monitor whether the transaction consistency of the payment processing is really ensured after it is actually integrated into the system and released. At Merpay, we reconcile the final transaction data between each service. If any inconsistency is detected, we need to take immediate action to recover the data and take fundamental measures for the future.
The PaymentService performs inconsistency detection and automatic recovery of internal transaction processing, and also performs batch processing to reconcile with the accounting data and balances management system. In addition, the upstream services that depend on the PaymentService also perform reconciliation with the PaymentService to ensure the consistency of the entire Merpay payment system.
The main strategies of reconciliation are the following two.
- Synchronous reconciliation batch by calling transaction data consulting APIs provided by dependent services with locally determined Idempotency Key and transaction IDs.
- Asynchronous reconciliation workers compare the local transaction data with result events sent asynchronously by the dependent service.
Fault Injection Testing
The implementation for exception processing is much more difficult than normal processing when implementing a distributed system like our payment system.
Ideally, if there is a list of all the anomaly cases that possibly occur, and if using all the techniques described so far, we can implement all the exception handling before release and automatically guarantee transaction consistency.
However, such a list of anomaly cases does not exist anywhere in reality. It is still easy to list anomaly cases caused by internal application factors in the business logic, but it is not easy to list and predict exception cases caused by external factors such as network or dependent service failures.
PaymentService has a Fault Injection Testing mechanism that randomly generates abnormal cases caused by external factors during execution. For example, network timeout, exception error from DB, unstable connection to dependent external services, etc. are prepared as common fault injection cases. Not only Unit Test and e2e Test, all payment flows also require Fault Injection Test to verify the coverage of the implemented handling of anomaly exception cases.
With this mechanism, newly developed payment API can be tested prior to release by generating anomaly exception cases which are normally difficult to predict. Then, data inconsistencies and idempotency issues can be founded and resolved before release, so the API can be released with higher quality.
Conclusion
This is the story of transaction management which is one of the most important issues when adopting a microservice architecture. I am sure there are many other approaches than the ones I introduced here. As Merpay is the service that has just started, we will continue to improve the payment transaction management system as the product evolves and we are always looking for better ways to improve it.