Promote Zero Touch Production – further features of Carrier

Author: Morito Ikeda (@moricho), Platform DX team

In our previous blog, we gave a basic introduction to Carrier, our first step towards Zero Touch Production (ZTP). In a nutshell, Carrier is a mechanism for granting temporary privileges to developers. For more details, please see Shifting to Zero Touch Production.

In this article, I will build on top of the previous article and introduce a couple of more features and improvements we have added to Carrier to further promote ZTP in Mercari.

Recap

Let me give you a quick recap, the two key components of the system are:

  1. Carrier. Carrier itself is a custom Kubernetes controller that handles the logic of a permission "request" and "review".
  2. Clutch. Clutch is an open source frontend platform from Lyft. Its main purpose is to do infrastructure operations. For Carrier, it is the primary interface for creating requests and reviews.

When requesting GCP/Kubernetes permission, developers create a request via Clutch as below.

  • Service ID: ID of the target microservice
  • Period: How long you need the permissions for
  • Reason: Why you need the permissions
  • GCP Roles: List of GCP permissions you want
  • Kubernetes Roles: List of Kubernetes permissions you want.

Clutch will then create a RoleBindingRequest object for the user. A reviewer can approve or deny the request. Then Clutch will create a RoleBindingRequestReview corresponding to the review. When a request is approved, Carrier creates an IAMPolicyMember object for GCP permissions and RoleBindings for Kubernetes.

In the next section, we will take a look at more features of this Carrier.

Period Extension

In the request for getting permission in Clutch, developers have to set “Period” in addition to the service name and GCP role they want to get permission for. After this period has elapsed, Carrier will automatically delete the GCP Permission (IAMPolicyMember) and Kubernetes permission (RoleBindings).

However, what happened was that when developers were responding to incidents, they often had to make additional requests to grant them permission for longer periods of time. Each time developers request permission, it must be approved by a reviewer. And until the request was approved, the developer’s work had to be suspended.

To remedy such a situation, we provide the feature to extend the Period. Developers can extend their request period from a page for request extension in Clutch.

Extension requests are applied without requiring approval from anyone since a permission request itself has been already approved. However, only the person who creates the permission request can request extensions.

Technically, this extension request is a RoleBindingRequestReview object. Carrier handles extension requests like any other approval/denial reviews. When an extension request is created, Carrier updates the original permission request status with the requested period.

While it is important to promote ZTP, it is not good if this leads to a sluggish response when an incident occurs. With this feature, it is now possible to grab temporary permissions more smoothly.

Self Configuration

There are some hard-coded configurations in the Carrier code. For example, BreakGlass. BreakGlass is a way for developers to get permissions without reviewer approval in emergency situations. It is intended to be used, for example, when responding to an incident in the middle of the night and no one has the authority to review the permission request. Since BreakGlass request is very security-critical, Carrier sends an alert to the special channel in Slack when that request is created and we’re constantly auditing its usage.

When creating a request in Clutch, developers can request it by marking the Checkbox for BreakGlass.

However, since this is a very dangerous request, it is not opened by default. And only when the developers of the service decide that BreakGlass is necessary, they ask us to enable it on Slack and then the service is added to the Carrier’s allowlist and enabled each time by us. However, this is a burden for both the developers and us and slows down the development process. So we introduced a self-configuration feature to improve Carrier opt-in/opt-out configurations. This changed the hard-coded configuration to a file-based one.

Specifically, we have made it possible for developers to configure Carrier on a per-service basis with the microservice-starter-kit. This is our in-house Terraform module, which bootstraps required infrastructures for a microservice (e.g., Kubernetes namespace, GCP project, SaaS accounts, and so on). Developers only need to make a few settings in the Terraform file and the microservice-starter-kit will create those resources. By using this, we have allowed developers to enable BreakGlass on their own as follows.

carrier = {
    enable_breakglass = true
}

We, the Platform team, have to tackle numerous support requests from developers every day, in addition to our own daily development. Among the requests that came to the DX team, the request to enable BreakGlass was very common. Not only did this feature allow developers to change settings in a flexible manner, but it significantly reduced the amount of work that had to be done to respond to these requests.
Currently, BreakGlass is the only opt-in option that is supported, but we will continuously be adding more configurable options in the future. For instance, we are planning to enable users to set max request period specific to a service, specify who they mention in a notification for a request, and so on.

For more information about microservice-starter-kit and other components that we, the Platform team, provide to improve developer experience, please refer to Developer Experience at Mercari.

Conclusion

Throughout this article and the previous one, we have taken a closer look at the introduction of the concept of Zero Touch Production and the temporary permission granting system called Carrier to make it happen.

However, the uses of Clutch and Carrier are not limited to the functions introduced here. By extending them, we can reduce many of the manual operations in production (e.g., restarting pods, updating HPA, and so on) that occur in daily development. By providing developers with such automated workflows, we will make sure that Zero Touch Production (ZTP) is more widely adopted, and we will make our platform more secure and robust.