Hi everyone, are you enjoying the Advent Calendar so far? This is the 13th post of the Mercari Advent Calendar and Christmas is approaching at a crazy speed, isn’t it?
My name is Raphael Fraysse, Engineering Manager for the Network team; please check out our team introduction when you get time!
Introduction
I will explain today how we managed to migrate ‘fast-secure-lb’, an Nginx-based critical network infrastructure component, from on-premises to our cloud-based platform without any client-side action or downtime. This approach saved us hundreds of hours by making a significant migration seamless for our developers. Sounds too good to be true? Let us show you how it happened!
Migration context
Here at Mercari, our infrastructure was entirely running in a private datacenter until 2018. In 2018, we embraced the microservices concept to accelerate our business growth and gradually offload our monolithic application into a microservices cloud computing platform (Google Cloud Platform).
Since then, we have been conducting the microservices migration between our DCs and our platform built upon Google Kubernetes Engine (GKE). During the migration, the newly created platform and its microservices needed to constantly communicate with the DC monolithic service via a secure path as traffic flows over the Internet.
The primary component responsible for ensuring the secure path between them was “fast-secure-lb”, a load balancer based on Nginx terminating the TLS connections initiated by our microservices running in GKE.
As you can see, this was a critical component of our infrastructure as it controlled all north-south traffic between microservices and the monolith.
In early 2022, we started migrating this critical component to GKE to reduce our DCs’ footprint gradually.
Lift-and-shift
To minimize the migration cost, we adopted a lift-and-shift approach to containerizing the Nginx proxy in GKE. We focused on porting the existing solution, Nginx, directly into the cloud environment by transforming the Ansible-managed configuration from our bare metal servers in the datacenter to a Kubernetes Deployment based on the same containerized Nginx build.
As Nginx and its configuration are portable, it was easy to achieve. We could easily scale it out according to the incoming load in Kubernetes and compensate for the performance loss compared with bare metal servers.
From north-south to east-west and its benefits
Note: From this part of the article, we will use the present tense to get in immersion together with how to perform the migration.
As fast-secure-lb is running in our datacenter, it is considered external to our cloud platform, so the traffic is treated as north-south. Crossing the boundaries over the Internet requires us to use a public DNS domain to call it from GKE. To achieve this, all clients running in the GKE platform called fast-secure-lb using a DNS A record hosted in a public zone in Amazon Route 53: ‘public-mercari.jp.’ The DNS resolves to the public IP addresses of our fast-secure-lb servers in the datacenter.
Below is a simplified breakdown of the DNS hierarchy related to fast-secure-lb:
Under the ‘public-mercari.jp’ DNS zone, we have one DNS A record ‘fast-secure-lb.public-mercari.jp’ exposing 3 IP addresses corresponding to each fast-secure-lb server in the datacenter.
By migrating fast-secure-lb into GKE, the client services do not have to call it over the Internet via north-south communication anymore. Instead, they can directly call it via east-west communication, internally to the platform, bringing many benefits along:
- Latency optimization by physical distance and network hop decrease
- Data path for a request before migration
- Services on GKE →Internet → Datacenter fast-secure-lb servers
- After migration, requests stay in the same VPC network
- Services on GKE→ Fast-secure-lb on GKE
- Data path for a request before migration
- Security strengthening
- DNS exposure is internal only, avoiding DNS spoofing or discovery from malicious actors
- Traffic stays within the same internal network, protecting it from external threats
- Reliability improvement
- DNS data path
- Before migration:
- Services on GKE → Cluster DNS resolver (kube-dns) → Cloud DNS → Route 53
- After migration:
- Services on GKE → Node local metadata server *Note, this will change later in this article
- Before migration:
- DNS data path
- Certificate cost reduction
- By switching to self-signed internal certificates, we no longer need to pay for public certificates.
GKE DNS internals
Let’s dive into how DNS resolution works in GKE and GCP to understand how we could call fast-secure-lb using an internal DNS A record.
Source: https://cloud.google.com/kubernetes-engine/docs/how-to/kube-dns
In GKE, DNS resolution is by default provided by the cluster DNS agent, in our case, kube-dns (as we don’t use CoreDNS, which is the new default since Kubernetes 1.13)
When queried, kube-dns has two behaviours, one for domains internal to the cluster it has authority upon (i.e. ‘*.cluster.local’) and one for everything else. For the cluster’s internal domain, kube-dns will check the records directly in its registry. It will call the GCE (Google Compute Engine) node local metadata server for other domains, which requests Cloud DNS.
In Kubernetes, when creating Services, an A record pointing to their virtual IP is automatically generated and shared to kube-dns. This is the simplest way to call fast-secure-lb from services within the cluster from a technical perspective but it requires developers to change their service configuration with this new DNS A record. We’ll see in the next section how it is a problem for us.
Core problem
We know that client configuration changes take a very long time to complete, especially in our case where we had about 50+ services using fast-secure-lb, with their own priority and business focus. It is very costly for the services developers to perform migrations requiring their action and the platform team to handle all services migration.
Ideally, we don’t want to impact the developers at all while keeping the migration schedule and limiting our costs.
Does such a solution exist? This is the kind of platform engineer’s pinnacle migration: transparent, safe, easy to perform and seamless for users that we’re aiming at!
Let’s formulate our requirements for the migration to stay laser-focused on appropriate solutions:
- Gradual DNS migration without client action
- No impact on existing DNS resolution
- DNS resolution and data path must be internal (within the same VPC network)
Formulating (ideal) solutions
We know that fulfilling all requirements forces us to:
- Keeping the same domain configuration on the client services.
- Refrain from reusing the same public domain in Route 53, as we cannot host internal A records from GCP there.
- Using an internal DNS A record.
These are almost contradictory requirements, as not changing the domain name endpoint while changing the destination domain is impossible if we stick to the standards of RFC 1035. RFC 6672’s DNAME resource record also requires us to reuse the Route 53 record.
We should think more generally of DNS as a simple application protocol and map approaches from other application protocols, such as HTTP, in altering requests.
In the HTTP protocol, it is relatively standard to use URL (Universal Resource Locator) rewrites for various reasons.
What if we could do the same thing with DNS? This is not recommended for public DNS communication or critical systems because of the added complexity, maintenance cost and potential compatibility issues with third-party DNS servers. Nevertheless, this can be an option in a closed, simple and controlled environment.
We could achieve this with an L7 proxy, say Envoy, redirecting client traffic when matching a destination, our Route 53 DNS endpoint. It is tricky to set it up as we need to have Envoy in the data path of all services, either as a sidecar (service mesh approach) or egress gateway approach, both of which need to be more seamless to clients. It is also not DNS protocol native, as occurring on the HTTP layer.
Here enters the savior, CoreDNS
Fortunately, the Kubernetes ecosystem is blessed to have a great piece of DNS software: CoreDNS (yes, the default DNS server we talked about earlier!)
CoreDNS has so many plugins that it is hard to understand or keep track of everything, but there was a feature that caught our attention: the rewrite plugin.
This is a (DNS protocol native!) request/response rewrite engine that is pretty versatile and able to perform complex rewrites based on regex as well as simple ones. It is so versatile that it’s a little scary, and we know that we avoid complex logic when possible to prevent performance bottlenecks in networking.
We didn’t thoroughly test the performance impact of a simple rewrite vs a complex regex-based rewrite, but other past experiences with HTTP convinced us to avoid using it on a critical path. (Remember, though, YMMV)
Thankfully, we don’t need anything more than a simple request rewrite to achieve our goal!
What we want to do is, whenever a request from a client to resolve the ‘fast-secure-lb.public-mercari.jp’ domain name goes through CoreDNS, we want it to rewrite the request to ‘fast-secure-lb.fast-secure-lb-namespace.svc.cluster.local’.
It is as simple as writing this line in CoreDNS’s Corefile:
rewrite name fast-secure-lb.public-mercari.jp fast-secure-lb.fast-secure-lb-namespace.svc.cluster.local
Great! Now, we know how to get our new fast-secure-lb called without changing the client configuration! We can also call it using an internal DNS record. However, we still need to figure out two more things:
- How do we get CoreDNS to be received all client services DNS requests?
- How do we perform a gradual rollout to ensure we don’t face issues during migration?
Hijacking the cluster DNS with CoreDNS
This section only applies to our case, where kube-dns is the cluster DNS server. If you are using newer GKE clusters with CoreDNS as the default DNS server, you won’t have any issues, so feel free to skip this section.
Kube-dns gets all DNS requests from the cluster pods thanks to the dnsPolicy in the workload manifest set to ‘clusterFirst’ (the default in Kubernetes, although a ‘Default’ dnsPolicy also exists). Unfortunately, kube-dns cannot do a fraction of what CoreDNS does, so we need to find a way to get DNS traffic flowing to CoreDNS.
Happily, kube-dns has a feature called ‘stub domains’ (RFC 1123), allowing us to specify alternative nameservers for defined DNS domains and forward the DNS resolution to them.
By using this ConfigMap, we make kube-dns a stub resolver for the ‘public-mercari.jp’ domain, forwarding the DNS resolution to the CoreDNS Service IP address, which is then sent to the CoreDNS pods in a round-robin fashion:
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-dns
namespace: kube-system
labels:
addonmanager.kubernetes.io/mode: EnsureExists
data:
stubDomains: |
{"public-mercari.jp" : ["COREDNS_IP_ADDRESS"]}
Once applied, it will immediately take effect. Thus, CoreDNS needs to be running and battle-tested to serve the amount of DNS traffic flowing to the entire ‘public-mercari.jp’ domain. Stub domains are an “opt-in, opt-out” feature, so it needs to be planned carefully, as you wouldn’t like your production queries to be impacted by the CoreDNS Deployment.
On the other hand, it is cheaper to use this option than migrating kube-dns to CoreDNS as the sole DNS server for the GKE cluster. CoreDNS, as a standalone Deployment, will only get this domain traffic which should be small compared to the cluster’s internal DNS traffic.
Kubernetes DNS limitations
Now that we have our CoreDNS ready to process the DNS traffic and rewrite it. Let’s think about how to roll it out in a safe way.
First, here is what your CoreDNS Corefile could look like:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns-standalone
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
log . {combined} {
class denial error
}
ready
rewrite _name fast-secure-lb.public-mercari.jp fast-secure-lb.fast-secure-lb-namespace.svc.cluster.local_
forward . KUBE_DNS_SERVICE_IP
prometheus :9153
cache 30
loop
reload
loadbalance
}
Attention: The above zone logic will apply to any DNS request flowing through CoreDNS, so it would only work if the same logic applies to all stub domains configured with CoreDNS as the forwarded DNS server. You should add zone blocks if you use multiple stub domains and need different configurations per zone (= domain).
When using kube-dns as the cluster DNS server, an issue is that the ‘cluster.local’ domain resolution will fall back to kube-dns, and the DNS data path will look like the following, ending up doubling the DNS traffic for ‘fast-secure-lb.public-mercari.jp’ in kube-dns. Fortunately, caching the DNS results in CoreDNS would mitigate this.
client pod → kube-dns → CoreDNS → kube-dns → client pod
With this approach, once we apply the CoreDNS configuration it will replace the Route 53 DNS records returned to the clients with the fast-secure-lb GKE Service IP, leading to a sudden complete shift in traffic. This is dangerous as we cannot test the migration properly. If we make a mistake, our entire production is down without any breaks so it is not reasonable to go with the current approach.
Gradual migration with Cloud DNS
Let’s think about how we could take a better approach to gradually perform the migration and test ramp-up phases while eventually being able to roll back without a significant impact in case of migration issues.
Gradual migrations are done using weighted routing, with a proxy capable of controlling the amount of traffic between different endpoints based on a weight configuration. Having several phases, let’s say 1%, 10%, 25%, 50%, and then 100%, helps identify potential issues early in the migration process. As we modify the DNS resolution, we can only have the weighted routing happen at that layer. Route 53 can do that, but we don’t use it for the new domain. CoreDNS cannot do that, leaving us with one option: using GCP’s Cloud DNS.
Cloud DNS supports the weight round-robin DNS policy, so if we use it for our DNS record, we should be able to get that safe migration!
The issue is that only kube-dns/CoreDNS knows our ‘fast-secure-lb.fast-secure-lb-namespace.svc.cluster.local’ record, so we cannot have it in Cloud DNS…
Therefore, we need to give up on using the fast-secure-lb Service DNS record and find a way to use Cloud DNS instead. Let’s define an internal zone in Cloud DNS within the same VPC network as our GKE cluster.
We’ll create the ‘internal.public-mercari.jp’ sub-domain to keep consistency with our external domain so that we won’t break our domain parity unnecessarily for our production traffic.
When calling our datacenter endpoint, we use the ‘public-mercari.jp’ domain hosted by Route 53. Route 53 is the zone’s SOA (Start of Authority) and primary nameserver. Suppose we try to add new records from another DNS server (i.e. Cloud DNS) for this domain. In that case, we will get rejected because Cloud DNS is not the primary nameserver unless we delegate authority. We don’t want to do it, as the domain is still to be managed by Route 53.
Source: https://ns1.com/resources/split-horizon-or-multiview-dns
Split-horizon DNS is a good feature in terms of security by allowing us to reuse the same domain name across internal and external scopes. Sadly, GCP and AWS do not provide interoperability on split-horizon as both internal and external zones must be hosted in the same cloud provider, so we cannot use it either.
Instead, we can create a new sub-domain part of the same domain and manage it as a private zone. Its visibility will be limited to the VPC network it is built into and will not collide with the parent domain.
Below is a breakdown of the sub-domain of ‘public-mercari.jp’: ‘private.public-mercari.jp’
It has one DNS A record, ‘fast-secure-lb.private.public-mercari.jp’ with two sets of RRDATA. In each set, weight is defined alongside a list of IP addresses, describing the traffic control status of both sets behind the same DNS A record.
With this, we can use the weighted round-robin policy between both instances of fast-secure-lb.
Our DNS A record looks like the following:
Let’s reflect it in our CoreDNS configuration by modifying the rewrite statement:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns-standalone
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
log . {combined} {
class denial error
}
ready
**rewrite _name fast-secure-lb.public-mercari.jp fast-secure-lb.private.public-mercari.jp_**
**forward . 169.254.169.254**
prometheus :9153
cache 30
loop
reload
loadbalance
}
We want to call the local metadata server because it knows the Cloud DNS information for our internal DNS hosted zone.
With this applied, we can safely confirm that fast-secure-lb DC is reachable despite not changing the endpoint on the client side when using KubeDNS as the resolver:
dnstools# dig @KUBE_DNS_SERVICE_IP fast-secure-lb.public-mercari.jp
; <<>> DiG 9.11.3 <<>> @KUBE_DNS_SERVICE_IP fast-secure-lb.public-mercari.jp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8973
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 794e4f1d1cc49fc4 (echoed)
;; QUESTION SECTION:
;fast-secure-lb.public-mercari.jp. IN A
;; ANSWER SECTION:
fast-secure-lb.public-mercari.jp. 30 IN A 203.0.113.0.10
fast-secure-lb.public-mercari.jp. 30 IN A 203.0.113.0.11
fast-secure-lb.public-mercari.jp. 30 IN A 203.0.113.0.12
;; Query time: 7 msec
;; SERVER: KUBE_DNS_SERVICE_IP:53(KUBE_DNS_SERVICE_IP)
;; WHEN: Wed Dec 22 23:18:41 UTC 2021
;; MSG SIZE rcvd: 209
By proceeding with the gradual rollout for our weights, we managed to get 100% of the traffic migrated to fast-secure-lb in GKE without any downtime, impact or involvement from our developers. CoreDNS has been running in this configuration without any incident in our production for eight months until we decommissioned fast-secure-lb as it was no longer required in an east-west configuration.
Conclusion
In this article, we explained how we migrated fast-secure-lb, a critical component for the production traffic between our microservices running in the cloud in GCP and our monolith on-premises. We defined the problems aroused by the migration and investigated solving them in the most seamless way possible.
We understood that seamless migrations require a transparent approach using a hijack/man-in-the-middle approach, in our case with DNS rewriting to happen. By diving into the DNS internals of GCP and GKE, we found a way to replace the cluster DNS, kube-dns, with a powerful DNS server having the tools to make it work, CoreDNS, using stub domains. We then identified the limitations of Kubernetes DNS when it comes to gradual migration involving DNS domain change and integrated Cloud DNS powerful weighted-round-robin DNS policy to allow a safe and testable migration.
Finally, we shared how this seamless migration saved us hundreds of engineer time as we prevented 50+ services from being involved in it, motivating us to keep looking for seamless options when they exist in future migrations.
If you are interested in this work and would like to discuss it with me and my team, please don’t hesitate to apply or contact me on Linkedin / Twitter!
Tomorrow’s article will be by @hatappi from my team! Look forward to it!