Automation of Terraform for AWS

*This article is a translation of the Japanese article published on January 24th, 2022.

This article is part of the DPE Camp blog series.

Hello. This is Kenichi Sasaki (@siroken3) from the Platform Infrastructure Team. At Mercari, I’m mostly responsible for AWS management work. In this article, I discuss building a secure CI/CD environment for AWS configuration management repositories.

Background

The role of AWS in Mercari

Mercari has a long history with AWS, with S3 being used to store product images from the very start of the service. We also use S3 as a backup location for our MySQL databases, and as the backend for AWS Transfer Family for linking data with partner companies. When the Mercari US service was launched in 2014, the main infrastructure was on AWS.

More recently, we’ve introduced Amazon Connect as a tool for managing the operations of our customer phone support representatives, and Amazon AppStream 2.0 as a Virtual Desktop Infrastructure (VDI).

The importance of data varies for these services, so we typically create a separate AWS account for each company and project in order to keep access paths separate. We also use separate AWS accounts throughout the software life cycle, such as for production environments and development environments. Finally, individual developers can also request their own AWS accounts to do PoC work. All of these accounts are managed using AWS Organization.

Issues

Until now, we’ve used the Management Console to manually create AWS accounts as they are requested for each project or by each company at Mercari. We’ve used IaC using HashiCorp Terraform to manage AWS resources after creating accounts.

This doesn’t work at the same granularity as microservices, so the number of new AWS accounts hasn’t increased much and there hasn’t been any impact on operations.

However, as our business has expanded, we’ve encountered some issues with manual AWS account creation serving as a bottleneck when multiple requests are received around the same time. A couple of the issues we have encountered are:

  • Increased lead time up until service launch
  • Increased cost of maintaining and managing pipelines for gathering security audit logs

There’s also the issue of manual tasks like this creating too much specialization. We’ve also had to increase the number of members with high-level access (or who could have high-level access) so that they can continue to manually create accounts without issue. However, more members with high-level access means more concerns over security as people transfer to different teams or leave the company, and more costs involved in handling this.

One way to resolve this would be to automate account management. Automation has matured since the release of AWS Organizations, with solutions such as AWS Landing Zone and AWS Control Tower provided by AWS itself. However, for various reasons, these solutions couldn’t be used with existing AWS Organizations accounts when they were first released, and we missed our chance to incorporate them.

However, at Mercari we had already built a CI/CD platform in GCP (our microservice infrastructure), and were using Terraform as IaC for infrastructure work performed by engineers in each Camp, and CloudBuild for GCP for CI/CD.

AWS IaC is also partially built using Terraform, but we passed on directly using an AWS solution that requires the use of AWS CloudFormation (AWS IaC), as we thought it would be difficult to get people to use it internally.

Secure CI for AWS in Mercari

With these issues in mind, we built a secure CI/CD environment ("Secure CI" below) with the following objectives:

  • Automate AWS account creation
    • Use Terraform code PullRequest
  • Automate baseline configuration during account creation
    • Use security audit log pipeline
    • Enable AWS Single Sign-On (SSO) + authentication and authorization using internal IDP
  • Reduce the risk of elevated permissions being hijacked, and localize scope of impact
    • Eliminate risk of hijacks by attackers, by setting time limits on elevated permission access keys
    • Reduce scope of impact during security incidents, by restricting permissions as a multilayer defense

Secure CI structure

Overview

Let’s start by taking a look at the overall structure of Secure CI for AWS at Mercari (Figure 1). The upper half of this diagram shows that the same structure from Secure CI for GCP is reused here. YAML code for build pipelines implemented on Secure CI for GCP is isolated from the Terraform code repositories, and YAML code is obtained and built on-demand. This is the same mechanism implemented for Secure CI for AWS.

These Cloud Build and admin accounts are created for each environment (production, development, and laboratory). Admin accounts are dedicated AWS accounts used for provisioning, which have S3 buckets (to store Terraform state files related to the environment) and cross-account-tfm-apply IAM roles (for running Terraform). The Cloud Build service account is trusted as an OpenID Connect (OIDC) IdP for cross-account-tfm-apply.

Having this trust allows us to use the AWS Security Token Service(STS) AssumeRoleWithWebIdentity action and utilize temporary credentials to use the AWS management account IAM role in order to provision AWS resources, without having to prepare credentials using a permanent access key in the Cloud Build build pipeline.

Therefore, even though we use Cloud Build for GCP, the Terraform repositories defining AWS resources depend only on the AWS provider, so we can also use them for build pipelines outside of Cloud Build.

Overview
Figure 1. Overview

Our solution is composed of the following AWS components.

  • Management account
  • SecurityAudit account
  • Organization Unit (OU)
    • (Although only one diagram is shown above, we have a separate configuration for each of the production, development, and laboratory environments)
    • Admin account
    • New account
  • AWS SSO
  • CloudFormation StackSets

Next, I’ll explain the role of these components, as well as how we can accomplish our goals.

AWS account creation automation

In order to create a new account within AWS Organizations, a CreateAccount API request must be issued to the IAM identity (IAM role, etc.) within the single management account in Organizations. In Secure CI for AWS, an IAM role called cross-account-tfm-apply has the IAM role.

Note that cross-account-tfm-apply maintains only the minimum permissions required to create accounts (such as the CreateAccount API), and does not have the STS AssumeRole permission. I’ll discuss why later on.

Once a new account is created, it must have the IAM role to provision resources via Terraform.

In AWS, if an account is created within Organizations, an administrative IAM role (with a default name of OrganizationAccountAccessRole) is automatically created. We also considered using this. However, this IAM role trusts only the management account. In order to trust the Cloud Build for GCP service account, we would need to establish trust through the management account ahead of time. The management account is an important account and we would use this Secure CI for purposes other than account creation, so we wanted to minimize the use of it.

We ultimately decided to use CloudFormation StackSets (a means provided by AWS to customize account creation) to create the IAM role (terraform-apply) that would normally be used by CI/CD, instead of using OrganizationAccountAccessRole.

CloudFormation StackSets allows you to specify an organization unit (OU) as a deployment option. We decided to use this feature to create accounts. As suggested by the name, CloudFormation StackSets is for CloudFormation (not Terraform). It is a component within the build pipeline that would not be visible to internal users, so we thought that it wouldn’t have any negative impact on internal use.

Doing this allowed us to configure our system so that the terraform-apply IAM role (granted to new accounts) would automatically trust the admin account representing the OU, whenever a new account was created within an OU. Note that the admin account already trusts the corresponding GCP project as an OIDC IdP (Figure 2).

CloudFormationStackSet

Figure 2. Automatic creation of IAM roles that trust the admin account representing the OU

This meant we could localize the scope of management AWS API requests involved in running GCP build pipelines within the individual organization unit, without having to directly expose the management account externally. I’ll discuss the significance of trust linking and localization for the management account later.

Security audit log pipeline construction automation

At Mercari, we’ve built a system for gathering security audit logs for each AWS account. This is provided as a baseline immediately after an AWS account is created. We mainly use the following AWS managed services as components for gathering audit logs.

These were originally created for internal use as a single Terraform module. These cannot be used unless a new account is created, so the Cloud Build pipeline waits for an account to be created and for CloudFormation StackSets to create the terraform-apply IAM role, and then calls this Terraform module.

AWS SSO + Okta

We’ve been using federation access with Okta as our SAML identity provider for Mercari AWS Management Console authentication, and have been using saml2aws (with Okta as IdP) to gain access using temporary credentials for aws-cli authentication.

User allocation needed to be handled by Okta, and we were using the Okta management console to manually create AWS accounts. We decided to automate this, too.
We initially considered using the Terraform Okta provider. However, we would have had to employ some means of passing Okta API credentials to the provider, and this would have violated our Secure CI implementation policy of minimizing the use of permanent credentials.

Linking functionality between Okta and AWS SSO had seen some improvements, and the System for Cross-domain Identity Management (SCIM) v2.0 protocol could now be used to construct a mechanism by which AWS SSO would automatically import and synchronize users from Okta (the IdP) (reference. We decided to use this solution.

Figure 3 provides an overview of this mechanism. User on the right side of the diagram is automatically synchronized as IdentityStoreUser by the SCIM v2.0 protocol. The orange squares on the left represent Terraform resources, etc. PermissionSet and Account represent permissions and AWS accounts, respectively. AccountAssignment represents controlling what PermissionSet permissions IdentityStoreUser has for Account.

AWS SSO and Okta
Figure 3. Link between Okta and AWS SSO

AccountAssignment is an AWS resource. This means we can make use AWS IaC description, which I mentioned earlier. This also means that we don’t need Okta API credentials in this Secure CI.
Restrict permissions as a multilayer defense, and reduce scope of impact during security incidents
My explanation thus far assumes that the terraform-apply IAM role would perform AssumeRole actions by specifying the Terraform provider, in order to provision resources within new accounts. The Terraform code is shown below.

provider “aws” {
   assume_role {
      role_arn = (NewAccount's terraform-apply ARN)
      (...)
   }
}

We considered the risk of high-level access being hijacked, and then investigated which IAM roles should be able to run Terraform. I’ll discuss this in this section.

As I previously explained in the "AWS account creation automation" section, IAM roles with administrative permissions are automatically created for all accounts created within Organizations. By default, these IAM roles trust IAM identities (IAM users, IAM roles) held by the management account. We therefore need to minimize the permissions held by these IAM identities.

Imagine that an IAM identity has STS AssumeRole permissions, and that a malicious intruder hijacks this IAM identity. He could then use the AssumeRole action to assume an IAM role with administrative permissions for an account within Organizations (Figure 4).


Figure 4. Situation where the management account contains a vulnerable IAM identity

We therefore need to restrict this IAM role to only those permissions required for creating accounts. We decided not to grant it the AssumeRole permission, so we thought it would be impossible for it to run the Terraform provider code provided at the beginning of this section.

In contrast with the GCP projects supporting Cloud Build, which are created separately for each environment, there is only a single management account within Organizations that is shared among all environments. Therefore, if this management account directly trusts the GCP projects for each environment as an OIDC IdP, the risk of a security incident occurring if the management account is hijacked (either unintentionally or for nefarious purposes) would increase as more environments are added.

We decided instead to prepare separate admin accounts (representing organization units) for each environment. We could then have the cross-account-tfm-apply IAM role of the management account indirectly trust Cloud Build through these admin accounts (Figure 5).

Admin account
Figure 5. Admin accounts in each environment

The cross-account-tfm-apply IAM role of each AWS admin account trusts the GCP project service account corresponding to the OU to which the role belongs. For example, cross-account-tfm-apply for admin account production trusts only the GCP CI production.

If the CI laboratory environment is hijacked, we should be able to localize the impact to just the laboratory environment.

Going Forward

AWS provides an account management automation solution called AWS Control Tower. This solution continues to improve, and it is now possible to use it with existing Organizations. It can also now be used even in the Tokyo region. As for the issue with IaC code, AWS released a Terraform module compatible with Control Tower, making it possible to link the process from creating accounts to constructing security audit pipelines with Terraform.

We can now realistically use AWS Control Tower to replace Secure CI for AWS, with no change to workflows as far as users are concerned. Doing this all with AWS managed services should reduce the amount of operations work, and make maintenance easier.

Also, Secure CI for AWS relies on GCP and cannot support use cases where we would like to operate solely on AWS Organizations, so we plan to implement this independently on AWS only.

Summary

In this article, I covered the following topics.

  • AWS use cases in Mercari
  • Requirements and issues with regard to our AWS account management process until now
  • Overview of constructing a CI/CD environment using a hybrid GCP and AWS structure
  • Trends in AWS Control Tower (AWS managed service for automation), and our goal to use this service to improve Secure CI

References

  • X
  • Facebook
  • linkedin
  • このエントリーをはてなブックマークに追加