Chaos engineering with chaos mesh in payment-service

This post is for Day 12 of Merpay Advent Calendar 2021, brought to you by @Po-An from the Merpay Payment Platform team.

Chaos engineering and chaos mesh

Since NetFlix created the chaos monkey in 2010, chaos engineering has become one tooling that could help the product foresee and check system reliability under extreme circumstances. Teams would artificially add some failure to the system and experiment to see how the system works in such a situation.

In Merpay, we have tried to introduce this concept to secure our system further and have been attempting the tooling chaos mesh, which works well with our Kubernetes clusters. Using chaos mesh, we were able to inject network instability such as additional latency, introduce random network failure or manipulate the clock time on the service pod.

Why we are doing chaos engineering

Before we get into the detail of how we set up the chaos engineering for one of our core services: payment-service, we will discuss why we are doing this first. We started the chaos engineering because we encountered some incidence and availability degradation when one of the third-party payment gateway services we are using had a latency spike on their APIs. Unfortunately, lack of isolation on different payments and naive timeout settings to external calls amplified the impact of the latency spike in our payment service.

In our payment service, we have an in-memory job queue that can process the payment request asynchronously. However, the service shared the queue for all "payments." So even payments without the need to use the third-party service will be delayed/impacted as they have to wait for those originally-impacted payments to be first failed after their timeout so that the next payment can be processed.

After making some changes to mitigate the impact (e.g., fine-tune the timeout to let it fail faster), we still need a way to verify whether our fix is working as expected to limit the impact of the incidence. As a result, we started to use chaos engineering! As you can imagine, one of our earlier experiments is to try to add the network latency to the external API calls artificially and verify whether our mitigation can decrease the impact of dependency services’ availability spike or not.

spike on our dependency service

error rate spiked in the payment service

the error rate is caused by increasing amount of resources in the queue and several of them are even dropped

How have we applied chaos engineering so far

If you have a certain familiarity with Kubernetes configs (yes, the YAML files 😅), the chaos mesh configuration is pretty much just adding a Kubernetes file for setting up the chaos. We would not go into the details of installing the chaos mesh into the cluster but share some examples of how we set up the chaos for our experiments.

Network latency experiment

As mentioned earlier, one of our goals for the chaos engineer is to simulate the situation where our dependency service got a latency spike, how our service will react to that. As a result, the first chaos we were trying was to add network latency to see what would happen.

Following is the setup (we have changed a few variables for display purposes) that we used for our experiment:

 apiVersion: chaos-mesh.org/v1alpha1
 kind: NetworkChaos
 metadata:
   name: network-delay
   namespace: payment-service
 spec:
   action: delay
   mode: all
   selector:
     namespaces:
       - payment-service
     labelSelectors:
       "app": "payment-service"
   direction: to
   externalTargets:
     - "external.api.endpoint.url"  # the external API endpoint
   delay:
     latency: "1s" # <-- set this to the latency that you would like to test

With this, we could easily add the latency to all the calls to our external service. We then check both the metrics for original service implementation and the updated version to verify that this is indeed mitigating the impact of the latency increase.

Chaos mesh + spinnaker pipeline

Although it is pretty easy to run kubectl apply -f network-delay.yaml to inject the chaos to the cluster, and run kubectl delete -f network-delay.yaml to stop the chaos, you still need to run the command manually and set up the Kubernetes file each time. As a result, we came up with the idea of using Spinnaker pipeline to inject the chaos, let the chaos run for a certain period, and delete the chaos after the experiment time has passed.

Since Spinnaker can deploy and delete a Kubernetes file to the cluster, the setup is pretty straightforward. To start the chaos, we use the Deplay (Menifest) stage to deploy the chaos manifest to the cluster as the first stage of the pipeline. The second stage of the pipeline is simply a Wait stage that will wait for a certain period as specified in the pipeline parameter. After letting the chaos run for a certain period, the last stage is to clean up the chaos by adding a Delete (Manifest) stage.

Different types of chaos that we have tried

We have tried a few types of chaos within our team:

  1. additional network latency to downstream service
    • this is used for the original experiment to test how would our system react if a external endpoint has latency spike
  2. random network failure from upstream service
    • this is for upstream services to test for themselves if they want to know how they will react if payment-service is degrading on availability
    • upstream service QA can trigger the chaos before running their tests
  3. time chaos to reverse back the time
    • this changes the time for the payment service. We tried to use this for our QA as we have cases that require testing how payment would look like after a month as we have some monthly occurring events such as monthly repayment for the "smart-pay".

In our team, we set up the generalized chaos with spinnaker pipelines. So other engineers or QA in our company can easily use the pipelines to inject faults into the payment service without manually writing and applying the chaos manifest.

Issues and problems with the chaos mesh so far

Despite a fun journey to play around with the chaos mesh, we did encounter some unexpected bumpers in the road.

Extra handshake latency

In our experiment that adds latency to the network, we consistently observed that chaos mesh delayed our API calls four times of our setup. For instance, a config that adds 1-second latency will end up adding 4 seconds latency. We soon realized that there was extra latency on the handshake, and unfortunately, the dependency service could not tune the keep-alive value to reuse the connection. As a result, we got 4x of our latency chaos setup.

Time chaos is fantastic, but not really for QA

In merpay, we have a scenario that a "smart-pay" will be repaid next month (similar to a credit card). As a result, the services have some time validation on dates, and we will need to either create test data "in the past" or "fast-tune to future" for QA. Since chaos mesh supports time chaos which can manipulate the time in the pod by specifying an offset of how long you would like to change the time. For instance, we tried to run the service "in the past month" and then create the test data (payments) before running the repayment QA.

Unfortunately, despite the chaos mesh working well to let the service run with the clock in the past, it will also break the authentication token. As part of the GCP authentication, it will put the current time into the JWT token, and then it will fail the authentication because running in the past month is lagging from the GCP’s point of view. So our service would run for a while (10~30 minutes) and then crash once the authentication token refreshes.

Not working out of the box to inject fault to Spanner DB

With our success in injecting the network latency to the dependency service, we were also thinking of injecting fault on one of our most critical dependency services/tool – spanner DB. Unfortunately, we still have not found an excellent way to inject chaos yet.

We tried to add latency or random network failure to the spanner URL as we did for other experiments. The chaos mesh would first fetch the IPs of the domain name, and for those IPs, it will inject fault on the Kubernetes layer. However, it seems like spanner has a robust DNS system as it will automatically use new IP for the URL, and we will not be able to catch on to the spanner with our chaos injection. So our fault rarely works during our experiment, and we are still working on finding a solution to this.

Summary and future work

We have tried to apply chaos engineering with chaos mesh in Merpay’s payment service. It helped us reproduce scenarios for experiments such as latency spike and random network failure. We also got some troubles while trying it, such as manipulating the time for QA.

In the future, we would like to continue to spread the knowledge and usage of chaos engineering inside merpay. Also, within our team, we would like to try to add this to our standard release pipeline so we can always have some automatic chaos testing in our release pipeline to ensure we always have a reliable release.