Shifting to Zero Touch Production

Author: Dylan Lau (@aidiruu), Platform DX Team

Zero Touch Production (ZTP) is a concept where all changes made to production are done by automation, safe proxies or audited break-glass systems. There are many kinds of production outages that stem from human error, such as:

  • Configuration errors
  • Script errors
  • Running commands in the wrong environment

ZTP can mitigate the risk of outages from these errors. At Mercari, we are working on shifting to a ZTP environment. Our first step is implementing Carrier, our temporary role granting system.

In this post, we cover:

  • The importance of ZTP
  • The process of implementing ZTP and why we started with Carrier
  • The implementation of Carrier

Why is ZTP valuable?

Consider a production environment where engineers and SREs have access to do operations. Those engineers and SREs have write access to do production operations when needed. When doing operations there are many places where things could go wrong. To list a few examples:

  • Typo in a script
  • Cut and paste error
  • Test run and forgetting --dry-run

There are areas for improvement in this environment as a whole. Since engineers and SREs have write access, they are trusted to execute arbitrary scripts on their own. The lack of a formal review and approval process is a concern. Plus, if their credentials are compromised, a malicious user could exploit this. ZTP aims to solve these issues, make production safer and prevent outages.

Every change in production must be either made by automation, prevalidated by Software or made via audited break-glass mechanism. – Seth Hettich, Former Production TL, Google

With automation, all production changes are done with some system, which can have least privileges and is only capable of doing what is required. The system’s code can be validated for auditing, and permissions are limited to only what is required.

An approval or audited break glass system should be used for manual changes. This is a similar concept to code reviews. We require reviews for code to enter production – we should require reviews for operations that affect production.

Implementing ZTP

Before the transition, microservices were configured in our Terraform monorepo in Github. Team members or a subset of team members were given write permissions in production for fast response to incidents.

Our first goal with ZTP is to ensure that we use the principle of least privilege for engineers and SREs. In ZTP, read only permissions should be sufficient, so all engineers and SREs should have read only permissions in production. However, all production tasks should be possible even after reducing all permissions. With this in mind, let’s look at our end goal of ZTP:

The end goal of ZTP is to have users with view permissions by default, and must use automation or approval systems to get edit permissions.

Covering every potential production operation with automation in the initial implementation is very difficult. Plus, even after automating operations, there will still be times where manual operation is required. Thus, an approval system for manual operations is the first priority. Engineers can use the system to get elevated permissions to perform required tasks, with approval. Our Terraform monorepo can be used for this purpose – approvals are required to merge and write permissions can be given. There are a few issues with this:

  1. Slow. Engineers need to create a PR and merge it, and CI needs to run on the PR and when it is merged.
  2. Easy to forget. After granting write permissions, an engineer may forget to revert their PR.
  3. Single point of failure. If GitHub or our CI is down, we cannot get write permissions.

The GitHub process can serve as the approval system on paper, but these issues make it not ideal. Therefore, we created a fast approval system, with automatic revocation and is independent from GitHub. In Mercari, this temporary role granting system is named "Carrier".

Carrier

Carrier, named after an aircraft carrier, is our system for quickly getting temporary elevated permissions. The system itself consists of 4 main components:

  1. Carrier. Carrier itself is a custom Kubernetes controller that handles the logic of a permission "request" and "review".
  2. Clutch. Clutch is an open source frontend platform from Lyft. Its main purpose is to do infrastructure operations. For Carrier, it is the primary interface for creating requests and reviews.
  3. Config Connector. Config Connector is a GKE add-on from Google which creates GCP resources declaratively in our Kubernetes cluster. Config Connector has different Custom Resource Definitions (CRDs) for every supported GCP resource. In this case, only IAMPolicyMember is required.
  4. Service Catalog. We have an internal service catalog that collects and normalizes data about microservices. It exposes the information in a GraphQL API and is used to determine owners and contact links for a certain microservice.

Request Lifecycle

These components work together to function as our temporary role grant system. Let’s go through the lifecycle of a request. First, a user will use Clutch to create a request.

Clutch will then create a RoleBindingRequest object for the user. Carrier will then check the service catalog to confirm that the service exists. The Slack alert channel is also retrieved and Carrier sends a notification.

A reviewer can then log into Clutch and approve or deny the request. Clutch will create a RoleBindingRequestReview corresponding to the review. When a request is approved, Carrier creates an IAMPolicyMember object for GCP permissions and RoleBindings for Kubernetes. Kubernetes permissions are handled natively, but Config Connector handles GCP permissions. Config Connector will detect the new IAMPolicyMember object and create the IAM binding in the target GCP project. When a request expires or is rejected, the IAMPolicyMember and RoleBinding objects are deleted if they exist.

Migration

Services were slowly migrated to Carrier, starting with some platform components. The default write permissions were replaced with read permissions, and we used Carrier to get write when needed. After using Carrier for platform components for some time, we began migrating all other microservices. Now, all microservices have been migrated to Carrier, and updated Carrier based on user feedback. Some new features will be described in a later blog post.

Conclusion

Zero Touch Production (ZTP) is a concept where automation, safe proxies or audited break-glass systems perform all production operations. Successful implementation can make the production environment safer and prevent outages. The key components are automation systems and a system for manual operations that requires approvals and can be audited.

We implemented a system for manual operations named Carrier as our first step to ZTP. It is a system used to grab elevated permissions within a microservice for a short period. It is by no means complete, but is our important first step towards full ZTP. Our next steps are to start creating our automation system and improving Carrier based on feedback. A later blog post will cover improvements added to Carrier from user feedback.