A smooth CDN provider migration and future initiatives

Introduction

Hello! I’m hatappi from the Microservices Platform Network team.

Since 2023, Mercari has been gradually migrating our content delivery network (CDN) provider from Fastly to Cloudflare. We have completed the traffic migration for almost all existing services, and all new services are now using Cloudflare.

In this article, I will focus on the migration process itself, not on comparing CDN providers, while explaining the approach we took to ensure a smooth migration. I will also introduce our internal "CDN as a Service" model, which is the ultimate goal of our CDN efforts.

Background

At Mercari, our network team has managed hundreds of Fastly services across both development and production environments. Our team also maintains Cloud Networking like a GCP Virtual Private Cloud (VPC) and Data Center Networking. We needed to find a way to conduct the migration smoothly within given time constraints.

Migration Steps

Preparation

Though both Fastly and Cloudflare are CDN providers, they do not behave in exactly the same way. For example: Fastly separates cache respecting the origin’s Vary header, but Cloudflare currently only supports this for images. We needed to investigate which features were being used in Fastly and how to implement them in Cloudflare.

We focused on not significantly altering the current behavior when considering migration features. Starting a migration might lead to adding improvements or trying new features. Such an approach could be manageable for a few services, but attempting to apply it to hundreds of services would make the migration endless. Therefore, keeping the migration scope narrow was crucial for a smooth migration. This philosophy helped in subsequent steps as well.

Implementation

We use the official Terraform provider to manage Cloudflare. Instead of using Terraform resources individually for each service, we created a Terraform module with the necessary functionality within the module required to reuse it in upcoming service migrations.

In Fastly, the logic we implemented and Fastly’s logic gets compiled into a single VCL (Varnish Configuration Language) file. Initially, we manually checked each VCL and implemented changes into Cloudflare’s Terraform resources, which took more than 30 minutes per implementation.

However, as more services were migrated, we found certain classes in the VCL logic; necessary migration logic, and ignorable logic. Therefore, in the later stage, we developed migration scripts using Go, automating the Terraform module settings based on VCLs. Any logic that couldn’t be automatically configured was shown as output. This allowed us to complete implementations for simple services in just a few minutes.

Testing

Most services have both development and production environments, so we tested in the development environment before migrating production. For services with high traffic or mission-critical features, we wrote code to test behavior beforehand. Since we didn’t drastically change behavior from Fastly, we could write tests comparing against Fastly service behavior, allowing confident commencement of traffic migration.

Traffic Migration

Regardless of the number of tests conducted, actual traffic migration requires caution, especially ensuring smooth rollback in case of issues.

We adopted an approach to meet these requirements at the domain name system (DNS) layer. Mercari uses Amazon Route 53 and Google Cloud DNS, both of which support weighted routing. This allows us to gradually migrate traffic from Fastly to Cloudflare. In case of issues, setting Cloudflare’s weight to 0% enables a simple rollback.

We used Datadog to monitor traffic during migration, checking several metrics.

First, we monitored whether traffic rates were as intended. The following image shows traffic rates visualized from the request ratios between Fastly and Cloudflare.

Cloudflare Traffic Rate

Next, the image below shows the ratio of requests with non-2xx status codes out of all Cloudflare requests. Monitoring these metrics during traffic increases is important.
Cloudflare Non 2xx Rate

Since Fastly and Cloudflare exhibit no major visible changes from the client’s perspective, we compared their cache rates, request numbers, and bandwidth usage.

Though not all service migrations had zero incidents, these approaches helped avoid major incidents and minimized impact during incidents.

CDN as a Service

For the next step after migration, we aim for developer self-service, transitioning from centrally managed CDN services by the Network team to "CDN as a Service."

Here, I’ll introduce two initiatives toward "CDN as a Service".

CDN Kit

We named the Terraform module created during the migration process "CDN Kit." By using CDN Kit, developers can easily achieve their goals without needing to define several Terraform resources. The Platform team could provide best practices in one place instead of requiring changes to individual service configuration files.

For example, if the requirement is simple that access the origin via Cloudflare, a developer can use CDN Kit as follows:

module "cdn_kit" {
  source = "..."

  company        = "mercari"
  environment    = "development"
  domain         = "example.mercari.com"

  endpoints = {
    "@" = {
      backend = "example.com"
    }
  }
}

Though simple from a developer’s perspective, using CDN Kit automatically creates various resources. Examples:

  • Automated logging to BigQuery
    • Normally, Cloud Functions are used to log Cloudflare data into BigQuery (document). However creating these for each service is cumbersome, so necessary resources are automatically created with CDN Kit.
  • Creation of Datadog monitors
  • Issuance of auto-updated SSL/TLS certificates

Permission Granting System

Cloudflare’s dashboard is a powerful tool for interactive access analysis. However, several challenges needed resolution to make the dashboard accessible to developers:

  • Managing retired employees
  • Automating permission grants

For the first challenge, we solved it by enabling SSO on Cloudflare’s dashboard and using Okta as the identity provider (document). Mercari uses Okta, with the IT team managing retiree accounts. Thus, removing retiree accounts from Okta also automatically removes their access to Cloudflare’s dashboard, eliminating the need for direct Network team involvement.

For the second challenge, we created a system that operates in conjunction with our existing internal system. The following is an overview diagram:
※ Team Kit is a Terraform module for managing developer groups.
Cloudflare SSO

The Terraform modules for managing developer teams (Team Kit) and managing Cloudflare (CDN Kit) are managed in a GitHub repository. We created a GitHub Actions Workflow to automatically detect module updates. Upon detection, it generates permission management manifest files and commits them to the GitHub repository, as shown below:

account_id: [Cloudflare Account ID]
zone_id: [Cloudflare Zone ID]
zone_name: [Cloudflare Zone Name]
teams:
- team_id: [ID of Team Kit]
  roles:
  - Domain Administrator Read Only
users:
- email: [email address]
  roles:
  - Domain Administrator Read Only

On detecting changes in the manifest files, another GitHub Actions Workflow runs, setting appropriate permissions in Cloudflare based on the manifest files.

We adopt managing Cloudflare permissions declaratively through manifest files instead of directly changing them via GitHub Actions Workflow. This enables returning to the correct state based on the manifest even after manual changes.

The permission granting system allows developers to view the dashboard without requesting access from the Network team. Developers have independently identified and resolved issues using the dashboard, affirming the effectiveness of our "CDN as a Service" initiative.

Conclusion

In this article, I introduced our approach to CDN provider migration and described our initiatives for "CDN as a Service" such as the Terraform module named CDN Kit and permission granting system.

  • X
  • Facebook
  • linkedin
  • このエントリーをはてなブックマークに追加