alok shankar

Posted on Dec 17

Resolving Kubernetes Production Pod Failure Due to EFS Mount & Memory Exhaustion

#devops #kubernetes #cloud #aws

Introduction

In production Kubernetes environments, storage-related issues combined with resource exhaustion can lead to cascading pod failures. In this blog, a critical JMSQ and backend pod reached 100% memory utilization, and subsequent pod recreation attempts failed. An associated efs-start pod was also impacted, resulting in Persistent Volume mount failures. This blog walks through the real-world troubleshooting approach, commands used, root cause analysis, and the final fix applied in an Amazon EKS cluster.

Problem Statement

Symptoms Observed

Application pod stuck in Pending / ContainerCreating state
Continuous FailedMount errors
Dependent pods unable to start
Heap dump PVC backed by AWS EFS failing to mount

Error Message:

Warning FailedMount kubelet Unable to attach or mount volumes:
unmounted volumes=[heapdump-volume], unattached volumes=[heapdump-volume]:
timed out waiting for the condition

This clearly indicated a Persistent Volume mount issue, most likely related to EFS connectivity from worker nodes.

2️⃣ Failure Flow (What Went Wrong)

JMSQ Pod Memory Spike
│
▼
Pod Restart Triggered
│
▼
Scheduler selects OLD Node
│
▼
Node has STALE EFS mount
│
▼
EFS Volume Mount Timeout
│
▼
Pod stuck in ContainerCreating
│
▼
FailedMount errors flood Events

Step-by-Step Troubleshooting & Fix

Step 1: Identify the Impacted Pod

Command used:

kubectl describe pod jmsq-deployment-34253e -n prdq

Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  110s (x30 over 67m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[heapdump-volume], unattached volumes=[heapdump-volume]: timed out waiting for the condition

Why this step?
kubectl describe provides detailed pod-level events, container states, volume mounts, and scheduling information.

Key Findings:

Pod stuck in ContainerCreating
Volume heapdump-volume failed to mount
Storage backend: PersistentVolumeClaim (PVC)

Step 2: Verify efs-start Pod & Storage Health

Command:

kubectl exec -it efs-start-23sed-34e -n prdefs -- /bin/sh

#Inside the pod:
df -h

Filesystem                Size      Used Available Use% Mounted on
overlay                 128.0G      5.2G    122.8G   4% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                     7.6G         0      7.6G   0% /sys/fs/cgroup
fs-xxxxx.efs.us-west-2.amazonaws.com:/
                          8.0E      5.3G      8.0E   0% /persistentvolumes

Why this step?
To ensure that EFS itself was reachable and mounted correctly.

Observation:

EFS filesystem was mounted
No disk space exhaustion
Indicates node-level EFS connectivity issue, not EFS service failure

Step 3: Inspect Persistent Volumes (PV)

Command:

kubectl get pv

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                              STORAGECLASS    
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   101Gi      RWX            Delete           Bound    prd/jmsq-volume                      aws-efs      
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   100Gi      RWO            Delete           Bound    prd/solr-index-volume                slow-local   
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   100Gi      RWO            Delete           Bound    prd/heapdump-volume                  aws-efs

Purpose:
Validate whether the PV associated with the heapdump PVC is in a healthy Bound state.

Result:

heapdump-volume PV was Bound
Backed by aws-efs storage class
This ruled out PVC misconfiguration.

Step 4: Verify Storage Classes

Command:

kubectl get storageclass

NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE
aws-efs         example.com/aws-efs     Delete          Immediate        
fast-local      kubernetes.io/aws-ebs   Delete          Immediate                
slow-local      kubernetes.io/aws-ebs   Delete          Immediate

Why this matters:
Confirms dynamic provisioning behavior and backend type.

Key Storage Class:

aws-efs → RWX, dynamic provisioning, Immediate binding

Step 5: Check Persistent Volume Claims (PVC)

Command:

kubectl get pvc

NAME           STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   
Cloudnexus     Bound    pvc-XXXXXXXXXX   256Gi      RWO            slow-local     
Cloudjenkin    Bound    pvc-xxxxxxxxxx   64Gi       RWO            slow-local

Step 6: Inspect Worker Nodes

Command:

kubectl get nodes

Observation:

All nodes appeared Ready
However, older nodes likely had stale or broken EFS mount connections

Step 7: Cordon Existing Nodes

Command:

for X in $(kubectl get nodes | grep "^ip" | awk '{print $1}'); do \
kubectl cordon $X; \
done

[ec2-user@ip-xxxxxxxx ~]$ for X in $(kubectl get nodes | grep "^ip" | awk '{print $1}'); do kubectl cordon $X;done
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned

Why cordon nodes?

Prevents new workloads from scheduling on problematic nodes
Allows isolation of faulty infrastructure

Verify post cordon nodes:

kubectl get nodes

[ec2-user@ip-xxxxxxxx ~]$ kubectl get node
NAME                                          STATUS                     ROLES    AGE    VERSION
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0

Step 8: Scale Auto Scaling Group to Add New Nodes

New EKS worker node was automatically provisioned via ASG
New node established fresh EFS mount connections

Verification:

kubectl get nodes

[ec2-user@ip-xxxxxxxx ~]$ kubectl get node
NAME                                          STATUS                     ROLES    AGE    VERSION
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready                      <none>   50s   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready                      <none>   50s   v1.22.14-eks-fb459a0

Result:

New node joined cluster in Ready state
Healthy EFS connectivity

Step 9: Restart Impacted Deployments

Command:

kubectl rollout restart deployment <deployment-name> -n <namespace>

Outcome:

Pods scheduled only on new healthy nodes
EFS volumes mounted successfully
Application recovered without data loss

Root Cause Analysis

Long-running worker nodes developed stale EFS mount connections
JMSQ pod memory exhaustion triggered repeated restarts
Kubernetes attempted to reuse unhealthy nodes
EFS mount timeout prevented PVC attachment

Final Fix Applied

✔ Cordoned impacted nodes
✔ Added fresh worker nodes via Auto Scaling Group
✔ Restarted deployments to force rescheduling
✔ Restored healthy EFS mounts and pod stability

3️⃣ Recovery Flow (Fix Applied)

Detect FailedMount Errors
│
▼
Cordon Old Worker Nodes
│
▼
Auto Scaling Group adds NEW Node
│
▼
New Node establishes fresh EFS mount
│
▼
Rollout Restart Deployment
│
▼
Pods scheduled on healthy node
│
▼
Application fully recovered

Key Learnings & Best Practices

Always check Events section in kubectl describe pod
PVC Bound ≠ storage is usable (node-level issues matter)
Periodically rotate EKS worker nodes
Monitor memory usage to avoid JVM heap exhaustion
Use rollout restart instead of deleting pods manually

Summary

This case study demonstrates how storage + node health issues can silently break Kubernetes workloads even when cluster objects appear healthy. A structured troubleshooting approach — starting from pod events, moving to storage, and finally node isolation — helped resolve the production outage efficiently with minimal risk.

If you are running stateful workloads on EKS with EFS, proactive node lifecycle management and monitoring are critical to avoid similar failures.

Happy Learning & Reliable Kubernetes! 🚀

Follow me on LinkedIn: www.linkedin.com/in/alok-shankar-55b94826