Defense Against Novel Threats: Redesigning CI at Mercari

This article is part of the Developer Productivity Engineering Camp blog series, brought to you by Michael Findlater (@michaelfindlater) from the CI/CD team.

Introduction

This article discusses the effort to build Mercari’s next generation CI system and some of our engineering solutions towards this effort. It also explores supply chain security as an increasingly important area of focus for CI/CD engineers.

Background

While threats to the CI/CD pipeline are nothing new, the proliferation of attacks in recent years is a cause for concern. The explosive growth of open source software is seeing an exponential increase in its demand— as much as 73% YoY growth in developer downloads across the top four ecosystems (Java, JavaScript, Python, .NET). This growth has consolidated the supply chain as a new hot spot for attackers.

While more traditional attack vectors focus on exploiting known vulnerabilities in the wild, recent supply chain attacks are starkly different. Attackers are no longer waiting for public disclosure of vulnerabilities, they are creating the vulnerabilities themselves. Bad actors are aggressively implanting malicious code directly into open source projects with the aim of infiltration and extraction of sensitive data like PII, or credentials that might enable further lateral intrusion.

The trend toward supply chain vulnerability injection is being referred to as “next generation” because of this key difference. In just the past year or so there have been a number of prominent examples: SolarWinds Orion, Codecov, Microsoft’s WinGet and Kaseya. Another widely known instance is the Linux Hypocrite Commits (2020) that targeted the Linux Kernel. The E.U. (ENISA) has estimated that up to 66% of all cybersecurity attacks focus on the supply chain, and they’re increasing at an alarming rate— observed to be over 400% in the past year— that’s a big concern in the CI/CD world.

These attacks are making waves globally and many private sector enterprises are bolstering their security as a result. The United Nations has released new cybersecurity guidelines for states, the culmination of a two-year effort, recommending steps to ensure the integrity of supply chains. Also in the public sector, frameworks and guidelines are being developed or improved to help guide the strengthening of software supply chains (C-SCRM & SSDF in the U.S., The Supply Chain Security Training Act in the U.K., Understanding the increase in Supply Chain Security Attacks by ENISA in the E.U.).

Threats and attack vectors are constantly evolving and, therefore, so should the requirements and mitigation measures of the systems we build. Finally, we’ve found it important to make the requirements that we set measurable, to help pragmatically steer development towards the needs it seeks to address.

Motivation

Last year, we kicked off a new project to re-design and build the CI platform used by developers at Mercari. This new system is aimed at offering enhanced protection against novel threats, such as next-gen attacks like Codecov.

In addition to concerns over new threats, there were several specific areas in which we hoped to introduce improvements.

Limit Modification of CI

We aimed to avoid the threat of bad actors modifying CI workflows on their branch and injecting custom changes. To counter this, we sought to separate CI configuration or put more restrictions on its modification.

CI job code injection attack
Example of an attack that modifies CI.

Improved Secret Management

We hoped to improve the way credentials are used in CI. For example:

  • Control of secrets based on environment
  • Require review before jobs have access to secrets
  • Automatic secret rotation

Improved Security Controls

We also planned to introduce greater security controls. Some key areas for improvement were:

  • Network restrictions
  • Better security monitoring capability
  • Better audit logging capability
Next-gen supply chain attack
Lack of egress restrictions: Example of a next-gen supply chain attack that extracts data.

Vision

Overall, our goal was to provide a solution that offers better protection to known threats and is flexible enough to be adapted for future unknowns. Security needed to be a key pillar of our design. As mentioned in the introduction, it was important for us to define measurable requirements from the beginning. The Platform CI/CD and Security teams have been working in close collaboration throughout this project.

Although we intended not to make compromises in security, the system had to be highly usable for developers. In effort to achieve this, the Platform Developer Experience (DX) team worked throughout the design phase to assess the developer impact of our plans.

A key aspect of our design was to deliver the system to developers via a higher-level abstraction. This way, the underlying architecture and components would not be exposed and could be more easily open to future change. This is done by using a terraform module called microservice-starter-kit, provided by our Platform DX team.

Threat Modeling

Given the security centric motivation for this project, in order to develop a more comprehensive set of requirements, we began with examining common CI/CD pipeline attack scenarios. Again, a joint effort in conjunction with Mercari’s Security team.

The MITRE ATT&CK framework was used as a starting point for research. From there, a threat model specific to CI/CD pipelines was developed. The model was shared in October 2021 at the Code Blue security conference in Tokyo (slides). It is available for public use on GitHub at rung/threat-matrix-cicd. Please feel welcome to try it out.

After we had assessed what the threats were, we planned mitigation strategies for each threat. We also noted any limitations our mitigation strategies had.

Design feedback loop
We followed a feedback loop to aid this process and ensure the end-product had received ample critical review.

Requirements

Based on the threat modeling, we created a set of user and security/compliance requirements. Afterwards, we transformed these into actionable user stories. These requirements were reviewed again and then transitioned into our design doc.

Below is a selection of some of the specific requirements we came up with:

User

  • CI should be fast so it doesn’t block users
  • Workflows need to be customizable, reusable and independently configurable
  • Log output should be easily viewable so users can independently debug issues
  • Environment controls and protection rules are a must

Security & Compliance

  • GitHub access should be restricted by IP address
  • Limit egress from CI to prevent exfiltration of data
  • Security monitoring via IDS/IPS/EDR
  • Don’t allow untrusted tools and dependencies
  • Each job should run in a clean environment
  • CI should run in its own cluster to prevent lateral movement in the case of vulnerabilities
  • Users shouldn’t have access to the CI cluster or its components
  • Disallow CI config modification without review
  • Require additional review before allowing jobs to use secrets
  • Integration with secret management tools so users don’t need to worry about management and rotation
  • Audit logging is a must

Measuring Supply Chain Security

Given the sharp rise in supply chain attacks, there has recently been much positive effort made towards developing or improving frameworks applicable to CI/CD security.

In June of 2021, Google’s Security Blog introduced SLSA, an end-to-end framework for assessing the integrity of software artifacts in the supply chain. The framework is inspired by “Binary Authorization for Borg”— already battle-tested with over eight years production use at Google.

The great thing about SLSA is it provides a framework for improving security in a gradual way with measurable intermediary milestones.

We have referred to SLSA numerous times in our internal design and are working to ultimately provide a CI solution that is compatible with Level 4.

Selecting a Platform

Once we had all the requirements in place, we compiled a list of available platform options. A matrix was then built to compare the features of each option vs. our requirements so that we could select the best fit.

We selected GitHub Actions as a candidate to PoC because it was the closest fit to our own specific user and security requirements. Ultimately, the PoC went well. In the next few sections let’s look at our design in more detail.

GitHub Actions on GKE

GitHub provides support for self-hosted runners, which meant we could more easily satisfy requirements such as IP and egress restrictions. Self-hosting runners would also mean we can customize the resources we allocate to each runner, giving us some additional control over helping them meet the needs of specific workloads.

We developed a plan to self-host runners on GKE using actions-runner-controller, which enables deploying GitHub Actions runners on Kubernetes.

What appealed particularly about this was that we could scale each runner to/from zero replicas. Runners could also be ephemeral to satisfy our security requirements, an option provided by actions-runner-controller. After a job completes, the pod is destroyed, providing a clean environment for each new invocation.

GitHub Actions on GKE

There are additional security benefits we get from running on GKE. Such as GKE Sandbox (gVisor) to provide extra security, Network Policy and Workload Identity.

GKE Sandbox is a managed service that handles the internals of running gVisor. For our use case, gVisor is a crucial component because it adds an extra layer of security, isolating containers running in our CI cluster from one another.

Kubernetes Network Policy using Dataplane V2 allows us to lock down ingress/egress for each actions runner.

With Workload Identity, we can assign unique identities to each runner. This allows access to other GCP components without service account keys.

Architecture

Self-hosted GitHub Actions runners on GKE
Self-hosted GitHub Actions runners on GKE.

Clusters

Each of our environments has its own dedicated private GKE cluster. Users are not assigned access to the Kubernetes cluster. CI/CD team members also do not need access, all of the components are deployed through infrastructure-as-code via terraform with microservice-starter-kit (examined later in this article).

Scalability

Each runner can scale to and from zero replicas. The benefit of this is that we can also deploy several runner types, for example, different resource allocations (like high-cpu, or high-memory, etc.) with little overhead. This gives developers more flexibility for their workloads.

Runners are scaled with action-runner-controller’s HorizontalRunnerAutoscaler. This is done in near real time, by listening to workflow_run events sent by GitHub to a webhook server.

With this said however, in the longer term, we’re interested to see if the Kubernetes VPA might be able to help scale runners.

Private Assets

For each repository runner, we deploy a storage bucket that can be used to store build artifacts. An Artifact Registry repository is also provisioned where container images built in CI can be pushed. Each runner has exclusive access to its own set of secrets (which can be automatically rotated) via Secret Manager.

GitHub environment secrets can also be used to control how secrets are used in environments and review, if necessary, before CI jobs can access secrets. We’re currently working on a solution that allows us to use Secret Manager and GitHub secrets/environments in harmony. More on that in a future post.

Egress Restrictions

Egress restrictions are defined in a shared whitelist which all runners are subject to. We also may introduce the ability to add or subtract from the whitelist on a per-runner basis if needed. Our team has spent considerable time examining different approaches to enforcing restrictions. For instance using Istio and the istio-proxy sidecar or, alternatively, a standalone proxy. This subject in its own right warrants another blog post. Stay tuned for that too!

Microservice-starter-kit

At Mercari, new microservices are being created all the time. To make this manageable, the Platform Developer Experience team provides and maintains a terraform module called microservice-starter-kit. This module bootstraps all the necessary configuration needed for creating new microservices.

We have integrated our self-hosted GitHub Actions service as an option that developers can enable either when creating a microservice, or afterwards. When a user enables runners, the kit integration sees that all of the necessary components are deployed via terraform.

Some of these components include:

  • Service accounts
  • Workload identity configuration
  • Namespaces in our CI clusters
  • Artifact registry access for workloads
  • Runner deployments

To enable self-hosted Actions via a given service’s microservice-starter-kit configuration:

module “service-name” {
  ...
  enable_github_actions = true
  ...
}

Then workflows are modified as follows to use our self-hosted runners:

...
  runs-on: ['repository-name', 'prod', 'high-cpu']
...

The above configuration would execute on production environment runners tailored for compute-intensive tasks.

Rolling out runners with microservice-starter-kit allows us to avoid exposing unnecessary configuration to developers, such as how and where runners are deployed. This leaves us free to change much of the underlying architecture without requiring users to update their configuration.

Our service aims to provide the same experience of GitHub’s own managed Actions, while meeting our specific requirements.

Towards Improved Supply Chain Security

This new CI system aims to be a solid foundation for achieving our security goals. We already do a lot to ensure supply chain security, but there’s still a way to go before we fully reach security nirvana— if it exists.

Many current efforts already meet general SLSA requirements and there are additional systems that have been commonplace for a long time, such as measures to ensure build provenance and two-person codeowner based review.

Adding to this, our new CI design aims to fill in some of the gaps and provide a solid foundation for the future. Some examples of the enhancements it offers are:

  • An ephemeral build environment
  • Egress restrictions
  • Isolated build service

Wrapping Up

In the CI/CD world, supply chain security is a rising concern and, for the foreseeable future, it will become even more crucial. This article aims to raise awareness and hopefully give some ideas for a better future.

For example, conducting regular threat assessment for CI/CD systems, since attack vectors are constantly changing. In order to be able to consistently meet security requirements, consider them as a pillar of design (and revisit them regularly).

Regardless of the current state of your supply chain security— or the scale at which it operates, there are frameworks (like SLSA) available to help achieve your goals in a gradual and measurable way.

Could you improve the way your CI/CD system mitigates new or existing threats? Don’t wait until tomorrow, do it today.

We are hiring! Especially, if you are passionate about providing CI/CD and other platforms to developers that assist and improve their productivity.

Please consider applying for these positions:

  • X
  • Facebook
  • linkedin
  • このエントリーをはてなブックマークに追加