When Caching Hides the Truth: A VPC Service Controls & Artifact Registry Tale

Hello, I am South from the Mercari Platform Security team.

To mitigate potential impacts of Docker Hub rate limits and improve supply chain security, Mercari has undertaken a project to launch an in-house Docker registry and migrate our production infrastructure over to pull from the registry. This project mainly involved Google Artifact Registry and VPC Service Controls.

This post will cover the reason behind the project, the solution we chose, an outage that was caused during the rollout and the lessons learned.

Impetus: The Docker Rate Limit Announcement

This project began in response to the announcement of new Docker Hub rate limits. The announcement, giving about one week’s notice, set an initial effective date of March 1, 2025.

We promptly started investigating systems in our company infrastructure that pull from Docker unauthenticated and drafted plans to ensure that these systems pull from Docker with credentials. While Mercari primarily builds and uses in-house containers, a small number were pulled from official upstream sources, including some base images from Docker Hub.

Later, we noticed that the new restriction has been delayed by a month to April 1, 2025, and we continued our planning.

Deciding on a Solution: the Registry Part

We evaluated several potential solutions. Google hosts a Docker Hub mirror at mirror.gcr.io, which caches "frequently-accessed public Docker Hub images". For images not cached by mirror.gcr.io, Google recommends using an Artifact Registry remote repository. (While our tests indicated direct pulls of uncached images via mirror.gcr.io might sometimes work, we followed the official guidance.) An Artifact Registry remote repository allows configuring Docker Hub credentials, ensuring reliable upstream image fetching without hitting rate limits. Alternatively, we could have configured Docker Hub credentials individually wherever image pulls occur, but this approach was deemed too labor-intensive and error-prone.

Considering critical use cases like our production cluster and CI/CD infrastructure, alongside the need for developers to pull images, we opted for the Artifact Registry route. Having chosen Artifact Registry, we started considering how to handle authentication between the image puller and the remote repository to prevent running a public Docker registry and potentially incurring substantial costs.

Setting the Stage: What are VPC Service Controls?

Before we dive into our solution for the authentication, let’s set the stage with a quick primer on VPC Service Controls.

VPC Service Controls (VPC-SC) is a Google Cloud feature for defining a service perimeter around specified resources. It controls both ingress (access from outside the perimeter to resources inside) and egress (access from inside the perimeter to resources outside). While ‘VPC’ is in the name, these perimeters can secure access to resources based on the project they reside in, which was key for our Artifact Registry setup.

Note: VPC-SC is tightly related with Access Context Manager (ACM): all VPC-SC APIs are under the accesscontextmanager.googleapis.com domain, and many VPC-SC resources (for example, ingress rules) can refer to ACM resources (for example, access levels). In this article, we will use VPC-SC to refer to both VPC-SC and ACM, since it is not likely that we will use VPC-SC alone.

A service perimeter in VPC-SC typically contains Google Cloud projects and can restrict access to specific services within those projects. Conceptually, VPC-SC establishes this security perimeter around the specified resources. By default, this perimeter blocks network communication crossing its boundary.

To allow approved communication, administrators configure ingress and egress rules. These rules define specific exceptions, permitting authorized traffic through the perimeter under defined conditions. Crucially, ingress and egress refer to where the principal accessing the resource and the resource being accessed are located with respect to the access boundary, not necessarily the direction of data flow. For example, we need to configure an ingress rule to allow a user outside of the boundary to download a sensitive file from a bucket inside of the access boundary, despite the sensitive data flowing outwards.

Rather than detailing all rule configurations, let’s consider a concrete example relevant to our use case. Suppose we want to allow users from a specific corporate IP range to access images from an Artifact Registry instance within a specific project. To achieve this:

  1. An access level must be created defining the specific IP range.

  2. An ingress rule must be configured for the perimeter, specifying this access level, the intended users (or service accounts), the target project, and the artifactregistry.googleapis.com service.

This configuration permits users from the specified IP range to access the registry, while access from other locations remains blocked by the perimeter.

Deciding on a Solution: the Authentication Part

Both IAM permissions and VPC-SC can manage access to Artifact Registry. However, certain internal workloads required the ability to pull images from specific IP ranges without easily configurable authentication mechanisms. Standard IAM role bindings alone could not satisfy this requirement.

IAM supports various principal identifiers. The allUsers identifier grants access to any principal, including unauthenticated users, whereas allAuthenticatedUsers restricts access to authenticated Google accounts. A notable consequence of using either principal identifier is the disabling of data access audit logs for the registry.

Given that this registry mirrors only public images, confidentiality was not a requirement. This allowed us to deviate from our usual identity-first approach and instead use network controls (IP filtering) to efficiently prevent costly, unauthorized external access. Implementing IP-based restrictions without altering numerous client applications necessitated using the allUsers binding on the Artifact Registry repository, thereby shifting the burden of access control entirely to the VPC-SC perimeter’s IP filtering rules.

This approach, using allUsers on the registry and relying on the VPC-SC perimeter for actual IP-based filtering, was necessary to meet our requirement of allowing pulls from specific internal systems without embedding authentication credentials into each one. While configuring the IAM policy and referencing the relevant IAM documentation, the side-effect of allUsers inhibiting data access logs was not apparent, as this detail resides mainly in separate audit logging documentation. The significance of this logging behavior emerged during the subsequent incident response.

Rolling Out: Dry-Running & Going Live

To validate our configuration safely, we utilized VPC-SC’s valuable dry-run mode. This feature logs potential policy violations that would occur if the policy were active, without actually blocking traffic, sending details of these potential denials to the audit logs. In Terraform, dry-run mode can be enabled using the use_explicit_dry_run_spec flag and specifying the intended policy within the spec block.

After enabling dry-run mode for several days, we analyzed the audit logs to identify any legitimate traffic that would be inadvertently blocked and prepared the necessary additional ingress rules. The audit log provides details on the request, source identity and IP address, and destination service, enabling us to refine the policy.

Following the dry-run period and necessary rule adjustments, we enabled the VPC-SC restrictions in active mode. In Terraform, this involved disabling use_explicit_dry_run_spec and moving the policy definition from the spec block (for dry-run configuration) to the status block (for active configuration). Initially, registry operations continued without apparent issues.

When Things Go Wrong: The Incident Unfolds

Several days after enablement, a planned update was required for the registry’s Docker Hub credentials. Originally, the registry pulled upstream images anonymously, but to avoid potential rate limits, we configured it through Terraform (this part will come into play later) to use an API token stored in Secret Manager.

This update unexpectedly led to image pull failures for end-users. We began an investigation into the cause. The investigation faced challenges: data access logs were unavailable (a consequence of the allUsers setting), standard VPC-SC violation logs were not being generated for this failure mode, and the client error message provided only a generic "caller does not have permission". The recently enabled VPC-SC perimeter was identified as a likely factor. To restore service quickly while continuing the investigation, we decided to temporarily revert the VPC-SC enablement, temporarily resolving the issue after 68 minutes.

Digging Deeper: The Incident Investigation Process

Once the revert was complete and image pulls were functional again, we continued the investigation.

The investigation revealed that the root cause actually predated the credential switch. A missing VPC-SC config had been present since enablement, but its effect was masked by Artifact Registry’s image caching mechanism. When we switched the credentials using Terraform, the Artifact Registry repository resource was unnecessarily recreated due to a Terraform provider bug, clearing the cache. While we noted the planned recreation of the repository, we didn’t anticipate issues, assuming images could simply be re-fetched from the upstream source. However, this cache clearing exposed the underlying VPC-SC configuration gap. At this point, Artifact Registry needed to pull images directly from Docker Hub but was unable to do so.

The core technical issue was that Artifact Registry required network egress to reach Docker Hub, and this path was blocked by the VPC-SC perimeter. Allowing this traffic requires a dedicated VPC-SC config (google_artifact_registry_vpcsc_config in Terraform) specifically for Artifact Registry remote repositories. Crucially, this isn’t managed via standard egress rules; it requires a dedicated configuration designed solely to allow these repositories to bypass the perimeter for upstream fetches. No egress rules, even ones that permit all egress, would allow this traffic. This crucial configuration was missing in our initial setup.

Regarding the absence of VPC-SC violation logs for this failure, Google Cloud Support confirmed this is the expected behavior for this specific Artifact Registry egress scenario.

Furthermore, we discovered a limitation in the dry-run mode’s coverage: it did not generate violation logs for this specific scenario (blocked upstream pulls by a remote repository due to missing google_artifact_registry_vpcsc_config), even though the active policy would block the traffic. We only knew the cause of the problem because Google Cloud support was able to point out the issue with the information we had provided. Fortunately, despite anticipating no disruption, our deployment plan included performing the rollout during hours when the team was available for immediate incident response, which proved essential.

After creating the necessary VPC-SC config for the remote repository, we re-enabled the restriction. This time, image pulls functioned correctly, even with an empty cache.

Learning from Experience: Retrospective Findings

Our post-incident review confirmed the missing VPC-SC config as the direct cause. The review also highlighted related areas for improvement,

  • Lack of visibility into the status: early in the incident response, the absence of relevant logs made determining the cause of the failure difficult. This required us to rely primarily on available Artifact Registry metrics and deductive reasoning to identify the root cause of the image pull failures.
    • Remediation: We now understand that using the allUsers binding inhibits data access audit log generation for certain events. This finding has been shared within our team and with other relevant teams. Going forward, we will explicitly consider this logging limitation as a known trade-off when evaluating the use of allUsers.
  • Lack of a comparable staging environment: while we had a testing environment and performed tests before applying the same changes to the production environment, the testing environment is not similar enough to the production environment, notably that it lacks the same downstream pullers to allow us to detect problems that did not pop up during testing but happened during the incident.
    • Remediation: even though we do not have plans to make changes to the registry yet, we have started creating a staging environment parallel to the production environment, with consumers of the registry that pull images from the staging environment to ensure that we will be able to catch as many problems as possible during the next change.
  • Insufficient breakglass access: during the incident response, we had tried to speed up the changes by bypassing CI and making changes with our breakglass access. While we were able to approve the breakglass request quickly, we discovered that the breakglass access role does not grant sufficient access to perform the changes.
    • Remediation: we made a change to the breakglass access role after the incident response. In addition, we are planning additional incident response training and tabletop exercises to catch similar issues.

We have since taken action to address some identified hazards and continue to work on others.

Final Thoughts: On VPC-SC and Third-Party Dependencies

While powerful, the complexity of VPC Service Controls necessitates careful configuration and deep understanding, sometimes making alternative solutions preferable. If implementing VPC-SC, a thorough grasp of its mechanisms combined with rigorous testing (including dry runs) is essential for a successful and secure deployment.

In addition, learning from this experience, we recognize the risks associated with free third-party services, particularly how their terms can change unexpectedly. Consequently, we are adopting a more cautious stance moving forward. We will prioritize the stability and predictability offered by in-house solutions or paid services with explicit agreements, thereby minimizing our reliance on free external services wherever possible.

  • X
  • Facebook
  • linkedin
  • このエントリーをはてなブックマークに追加