Capturing Network Packets in Kubernetes

This post is for Day 18 of Mercari Advent Calendar 2025, brought to you by @mshibuya from the Mercari Platform Network and SRE team.

Today, I’m going to talk about capturing network packets in a Kubernetes environment. As mentioned above, I’m currently part of the Network team, where we build and operate the network-related components among the various platform components that support product development at Mercari.

At Mercari, we operate over several hundred microservices, and the network communication both within and between these services is complex and diverse. Due to the nature of our work, the Network team is often asked to investigate network-related issues and problems that arise in this environment. Of course, sometimes the cause is a simple misconfiguration, but in situations where the problem is complex and we’re struggling to find a starting point, we need a means for deep analysis. This is where packet capturing comes in.

For this kind of investigation, if the execution procedure itself is not clearly defined, it won’t be useful when a problem actually occurs—especially if it’s a high-urgency incident. The method we’ve established might not be directly applicable to your environment as-is. However, the purpose of publishing this article is my belief that introducing a stable and executable investigation procedure will be helpful for all of you when you create similar procedures in your own organizations.

Why is Capturing Packets in Kubernetes Difficult?

Kubernetes provides abstractions at various layers, such as hardware and OS, offering an environment where developers can run workloads without being bothered by such raw resources. For security reasons, users like developers generally do not have access to the raw nodes. Furthermore, workloads like Pods running on them are isolated from each other in a multi-tenant fashion. Therefore, it’s not as simple as the old days where you could just run tcpdump on a server and call it a day.

There’s also the difficulty due to the Service Mesh. At Mercari, we have adopted Istio, and communication within the cluster is basically encrypted by mTLS. This means you can’t see the content of the communication as it is. We needed to establish a method that takes this into account.

Furthermore, our standpoint as a Platform team is to provide a complete set of tools for developers to deliver features easily and quickly, including these Kubernetes clusters. It’s impossible to predict when the need for such deep network troubleshooting will arise. A crucial requirement was to enable developers to perform this kind of investigation themselves via self-service, without needing special Platform-specific privileges.

Pod-Level Capture Using Ephemeral Containers

The method we established to meet these conditions is one that utilizes Kubernetes’ Ephemeral Containers feature.

Ephemeral Containers became generally available (GA) in Kubernetes v1.25. They allow you to attach a temporary debugging container to a running Pod’s shared resources, such as its network namespace, without needing access to the entire node. This is perfect for packet capturing, as it eliminates the need to include debugging tools like tcpdump within the application container. Another significant advantage is that it doesn’t require special privileges for the Node or the entire Cluster, allowing both Platform members and developers to conduct investigations using the same method.

The specific procedure is as follows.

Step 1. Getting Necessary Permissions

At Mercari, we use an in-house tool called Carrier to temporarily grant permissions, achieving Zero Touch Production where we normally do not have operational privileges in the production environment.
Therefore, when performing packet captures to investigate problems in production, we first need to obtain a Role that has operational permissions for the target Pod.

This Role is pre-configured with the necessary permissions to operate Ephemeral Containers.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: example-role
rules:
# ...
- apiGroups: [""]
  resources: ["pods/ephemeralcontainers"]
  verbs: ["create", "delete", "deletecollection", "patch", "update"]

Step 2. Launching the Ephemeral Container

Once you have the permissions, attach an Ephemeral Container to the target Pod. Here, we use netshoot, which comes with a rich set of tools for all kinds of network troubleshooting, including packet capturing.

kubectl debug -it -n <your-namespace> <target-pod> \
  --image=nicolaka/netshoot \
  --custom=./root.yaml --container=netshoot-debug

Here, we prepare a file ./root.yaml with the following content beforehand.

securityContext:
  runAsUser: 0
  runAsNonRoot: false

This fulfills the requirement of "running the netshoot container as root," which is necessary to execute tcpdump inside the container. It’s not a very long piece of content, so I’d love to write it inline in the command, but for now, it seems that kubectl debug can only take a file as an argument…

Step 3. Performing the Capture

Once the netshoot container’s shell opens successfully, you can start the capture. Here, we’re writing to the file /tmp/capture.pcap.

tcpdump -i any -w /tmp/capture.pcap

In an Istio-enabled environment, this -i any is the key point. Traffic passes not only through eth0 but also through virtual interfaces redirected by iptables. To avoid missing any of this, we target all interfaces. If you only capture on eth0, you’ll likely only get the mTLS-encrypted content, which should be insufficient for investigation purposes.

Capturing all traffic can result in a massive amount of data. I won’t go into details here, but you can filter the packets you capture using tcpdump options. It’s easier for later analysis if you narrow down the capture as much as possible to packets related to the problem you’re investigating. Of course, there’s a trade-off: if you filter too much, you might find that you "didn’t capture the necessary data."

Step 4. Retrieving the File

The above step creates a file in the Ephemeral Container. You can then download it from your local machine using kubectl cp to complete the process. Don’t forget to specify the container name you assigned in Step 2.
Now you can move on to analyzing the captured data.

kubectl cp -n <your-namespace> <target-pod>:/tmp/capture.pcap ./capture.pcap -c netshoot-debug

Once you get comfortable with the process, you might want to perform Steps 2-4 in a single line. It would look like this. You use -iq to prevent extraneous output from being mixed into the file and also discard the standard error output. The -G 10 option specifies the capture duration in seconds.

kubectl -n <your-namespace> debug <target-pod> -iq --image=nicolaka/netshoot --custom=./root.yaml -- bash -c 'tcpdump -i any -G 10 -W 1 -s0 -w - 2>/dev/null' > tcpdump.pcap

Node-Level Capture

In addition to the Pod-level capture method above, we have also prepared a procedure for performing packet captures by SSH-ing into a Google Kubernetes Engine (GKE) Node and using the CoreOS Toolbox. However, this is considered a supplementary method because it requires privileges to SSH into the Node and, as mentioned earlier, it can only capture the encrypted Istio traffic. It is mainly intended for Platform members to use for troubleshooting issues that can only be observed at the node level.

Step 1. Getting Necessary Permissions

At Mercari, we build and operate our Kubernetes clusters with Google Kubernetes Engine. First, you need to obtain the necessary permissions to SSH into the GKE nodes using the aforementioned Carrier. The following permissions should be sufficient.

  • roles/compute.instanceAdmin.v1
  • roles/iam.serviceAccountUser
  • roles/iap.tunnelResourceAccessor

Step 2. Identifying the Node

Use the kubectl get pod command to check the name of the node where the target Pod is hosted.

$ kubectl get pod -n <your-namespace> your-app-pod-7f5b7f7d9f-abcde -o wide
NAME                           READY   STATUS    RESTARTS   AGE    IP           NODE                                NOMINATED NODE   READINESS GATES
your-app-pod-7f5b7f7d9f-abcde   2/2     Running   0          2d1h   10.1.2.3     gke-cluster-1-node-pool-1-a1b2c3d4   <none>           <none>

Step 3. Entering the Toolbox Environment

Use gcloud compute ssh to SSH into the node, and then use the toolbox command to enter a shell environment equipped with debugging tools.

gcloud compute ssh --project <your-project> gke-cluster-1-node-pool-1-a1b2c3d4
# On the GKE node
$ toolbox

Step 4. Performing the Capture

Run tcpdump inside the toolbox shell. The host’s root filesystem is mounted at /media/root, so save the capture file to /media/root/tmp/, which corresponds to the node’s /tmp. Use -i any to specify capturing from all interfaces and use the Pod’s IP address, confirmed in Step 2, as a filter.

# Inside the toolbox shell
$ tcpdump -i any -w /media/root/tmp/node_capture.pcap host 10.1.2.3

Step 5. Retrieving the File

Exit the toolbox shell (exit) and then the SSH session (exit), and copy the file to your local machine using gcloud compute scp.

gcloud compute scp --project <your-project> gke-cluster-1-node-pool-1-a1b2c3d4:/tmp/node_capture.pcap ./node_capture.pcap

We haven’t had a chance to use this node-level capture in a real investigation yet, but by having the procedure established like this, we can begin investigating calmly when a problem does occur.

Summary

In this article, I introduced the practices for Kubernetes packet capturing at Mercari. Particularly at the Pod level, by leveraging Ephemeral Containers, we have established a procedure that allows developers to troubleshoot on their own while balancing security and convenience.

Pod-Level (Ephemeral Containers) Node-Level (Toolbox)
Primary Use Case Investigating application-specific issues, inspecting mTLS traffic Investigating node-wide network issues (e.g., CNI, iptables rules)
Required Permissions Relatively low (Pod-level permissions) High (Node SSH access)
Traffic Visibility in Istio Environments Can capture unencrypted, plain-text traffic Can only capture encrypted traffic
Ease of Targeting Easy to target traffic by attaching directly to the Pod Relatively difficult to isolate traffic for a single Pod among many
Recommended User Application Developers, SREs Platform Teams, SREs
Self-Service Suitability High (Developers can investigate on their own) Low (Limited due to the need for high privileges)

I am also pleased to announce that I will be presenting a deeper dive into this subject at SRECon26 Americas next March. My session is titled "It’s Not Always the Network (But Here’s How to Prove It): Kubernetes Packet Capture for SREs," and I hope to see some of you there in Seattle.

The next step after capturing packets is the phase of actually analyzing the captured data. Due to space constraints, and also because I’m still learning in that area, I didn’t touch on it this time, but I hope to share some knowledge on that in the future.

Thank you for reading to the end. Tomorrow’s article will be "Accelerating AI-Native Development with the Introduction of AWS Kiro and Automating Account Management with Okta" by amenbo-san and siroken3-san! Please continue to enjoy the series.

  • X
  • Facebook
  • linkedin
  • このエントリーをはてなブックマークに追加