Master these 5 common Kubernetes troubleshooting tasks
Not enough nodes? Have some noisy neighbors? Plenty of things can cause containers to underperform. Here's how to chase down and resolve three common Kubernetes problems.
Kubernetes is a complex system, and it can be a challenge to troubleshoot. It's one thing to recognize that a container cluster is unavailable or that a pod doesn't respond as expected. But how do you figure out the cause of those issues and fix them?
Several scenarios can cause problems that require troubleshooting, such as unavailable nodes, containers that won't respond and problems with the control plane or network connectivity. You can address these problems by assigning enough nodes to a cluster, using namespaces and resources quotas, running commands to troubleshoot non-responsive pods and more.
Look over these five common Kubernetes troubleshooting scenarios you might encounter and learn how to address them.
Unavailable nodes
Kubernetes distributes applications across a variety of worker nodes. The physical or virtual servers that host a Kubernetes cluster. Kubernetes relies on control-plane nodes to host the services that manage the cluster. If a cluster lacks enough nodes to support the control plane and all the applications you want to run, you're likely to experience application performance degradations or failures.
To troubleshoot this problem, first ensure that enough nodes are assigned to a cluster. The minimum number of nodes a cluster needs depends on factors such as how much memory and CPU each node contains and what your applications require. For smaller-scale clusters, have at least one worker node for every two applications you want to run. If you want to achieve high availability, you should have a minimum of two control-plane nodes. More control-plane nodes are even better.
If you think have enough nodes but still experience issues, use an infrastructure monitoring tool to observe the servers that are functioning as nodes. Look at how much memory and CPU each node consumes. If total resource utilization exceeds about 85% of the total available resources, your nodes are not up to the task of hosting your workloads. Add more nodes to your cluster or increase the resource allocations for them to accommodate workloads.
Having enough nodes doesn't guarantee ongoing stability, however. Nodes can fail after you've set them up and joined them to a cluster. Most cloud providers and on-premises VM platforms offer auto-recovery features that automatically restart a failed machine.
If all your nodes are VMs hosted on the same physical server, your entire cluster will fail if that server goes down. Spread nodes across more physical servers to limit the harm done to your cluster by a physical server failure.
Noisy neighbors
The noisy neighbor problem is a common challenge in a multi-tenant Kubernetes cluster. To troubleshoot, configure Kubernetes with the information it needs to assign the right number of resources to each workload. You can do this at the level of individual containers or pods using Limit Ranges, a capability that specifies the maximum resources that a container or pod can consume.
Use namespaces to divide a single cluster into different virtual spaces. Then, use Kubernetes's Resource Quotas feature to place limits on the amount of resources that a single namespace can consume. This helps prevent one namespace from using more than its fair share of resources. Resource Quotas apply to an entire namespace and can't control the behavior of individual applications within the same namespace. To do that, use Limit Ranges.
Non-responsive containers
Even if you configure proper Limit Ranges and Resource Quotas for your containers, pods and namespaces, you may discover that your containers or pods are not responding.
To troubleshoot non-responsive containers or pods, run this command.
kubectl get pods --all-namespaces
Specify a particular pod's name and/or a specific namespace to refine this command. The output lists all your pods and their status in a format such as the following.
NAME READY STATUS RESTARTS AGE
ml-pipeline-685b7b74d-6pvdr 0/1 ContainerCreating 0 294d
minio-65dff76b66-xk8jt 0/1 Pending 0 13s
mysql-67f7987d45-fqmw8 0/1 Pending 0 20s
proxy-agent-bff474798-sqc8g 0/1 CrashLoopBackOff 1545 (2m13s ago) 294d
In this scenario, one pod is in the ContainerCreating state, which means it's still launching its containers. Two other pods are pending, which means the pod hasn't been scheduled on a node yet. The last pod is in the CrashLoopBackOff state. This typically happens when Kubernetes attempts to restart a pod, but the pod keeps crashing.
Get additional details about the status of a pod and the containers within it using the following command.
kubectl describe pods --all-namespaces
The output tells you the detailed status of the pods. For instance, here are the details for the pod stuck in the CrashLoopBackOff state.
Name: proxy-agent-bff474798-sqc8g
Namespace: kubeflow
Priority: 0
Service Account: proxy-agent-runner
Node: chris-gazelle/192.168.0.107
Start Time: Sun, 12 Nov 2023 12:37:32 -0500
Labels: app=proxy-agent
application-crd-id=kubeflow-pipelines
pod-template-hash=bff474798
Annotations: <none>
Status: Running
IP: 192.168.0.107
IPs:
IP: 192.168.0.107
Controlled By: ReplicaSet/proxy-agent-bff474798
Containers:
proxy-agent:
Container ID: containerd://dae867f6ff4434d50e68161fd1602889e09d0bd17486968bbee41a9fac32d7bb
Image: gcr.io/ml-pipeline/inverse-proxy-agent:1.8.5
Image ID: gcr.io/ml-pipeline/inverse-proxy-agent@sha256:a16f99d14ae724b94b403cb9d7ac8d8f4adce33ef6a24dd91306048c0c02c7e4
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 6
Started: Mon, 15 Jan 2024 12:58:09 -0500
Finished: Mon, 15 Jan 2024 12:58:09 -0500
Ready: False
Restart Count: 1546
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fsjrz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-fsjrz:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ --- ---- -------
Warning BackOff 44s (x21022 over 41d) kubelet Back-off restarting failed container
This information provides more clues about how the pod is configured and why it continues to crash. To troubleshoot the failure further, try the following:
- Ensure that the container image specified in the kubectl describe pods output is available and is not corrupted.
- Try to run the container directly from the CLI to see if it starts there without an issue. If it does, the problem probably has to do with the way the pod is configured rather than the container itself.
- Verify the storage and secrets configuration to make sure they are not causing the pod to fail.
You may need to look at the container or application logs to get further information about the status of non-responsive pods and containers. The location of those logs may vary depending on where your containers and pods write log data.
Poorly configured readiness or liveness probes can also cause issues. Kubernetes uses liveness probes to check whether a container is responsive. If it's not, Kubernetes restarts the container. Readiness probes determine whether a container or set of containers are up and ready to accept traffic.
These probes are good safeguards against situations where you need to manually restart a failed container or where containers are not yet fully initialized and, therefore, not ready for traffic. But readiness and liveness probes that are too aggressive can lead to containers that are unavailable. For example, consider a liveness probe that checks a container every second and restarts the container if it determines that container is unresponsive. In some situations, network congestion or latency problems will cause the readiness check to take longer than one second to complete, even if the container runs without issue. Therefore, Kubernetes will constantly restart the container, making it unavailable.
To prevent this, configure readiness and liveness probes in ways that make sense for containers and environment variables. Avoid one-size-fits-all configurations. Each container should have its own policies.
Control-plane problems
The Kubernetes control plane is the set of services and agents that manage the overall cluster, assign pods to nodes and so on. If your control plane fails, your cluster may behave in strange ways or fail entirely. Software bugs or insufficient resources on control plane nodes can cause a control-plane failure.
Check the health of your control plane using the following command.
kubectl cluster-info
The output looks like the following.
Kubernetes control plane is running at https://127.0.0.1:6443
CoreDNS is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/https:metrics-server:https/proxy
This information tells the status of the control-plane and related services, such as CoreDNS in this example. If any services are not running as expected, run the following command to get details about their status.
kubectl cluster-info dump
If it's not obvious what the problem is from the cluster-info dump command, look at control plane logs. In most cases, you can find these in the /var/log directory of your control-plane nodes.
Network connectivity problems
Problems with your Kubernetes networking configuration may prevent an application that is running from being visible to other applications or services.
There is no simple process for troubleshooting Kubernetes networks. However, basic steps include the following:
- Log into an application's host node directly using SSH or a similar protocol and attempt to ping or connect to the application from there. If you can reach the app from the node but not from outside, a NAT or firewall issue is probably the root cause of the connectivity problem.
- Check the syslog files of the nodes in your cluster for events and errors related to networking.
- Run the tcpdump command on a node that can collect traffic flowing between pods or containers. Then, inspect packets using a tool such as Wireshark to check for errors such as packet checksums.
- If an external load balancer is in use, make sure that the external load balancing service is up and running and that your load balancer configuration properly points to it.
- If you're experiencing connectivity issues, redeploy the application with simpler networking settings. For example, use a ClusterIP service instead of a load balancer to check whether the issue persists under that configuration. If not, modify networking configuration variables one by one until you identify which one makes your application unreachable.
Chris Tozzi is a freelance writer, research advisor, and professor of IT and society who has previously worked as a journalist and Linux systems administrator.