DEV Community

Cover image for Resolving Kubernetes Production Pod Failure Due to EFS Mount & Memory Exhaustion
alok shankar
alok shankar

Posted on

Resolving Kubernetes Production Pod Failure Due to EFS Mount & Memory Exhaustion

Introduction

In production Kubernetes environments, storage-related issues combined with resource exhaustion can lead to cascading pod failures. In this blog, a critical JMSQ and backend pod reached 100% memory utilization, and subsequent pod recreation attempts failed. An associated efs-start pod was also impacted, resulting in Persistent Volume mount failures. This blog walks through the real-world troubleshooting approach, commands used, root cause analysis, and the final fix applied in an Amazon EKS cluster.

Problem Statement

Symptoms Observed

  1. Application pod stuck in Pending / ContainerCreating state
  2. Continuous FailedMount errors
  3. Dependent pods unable to start
  4. Heap dump PVC backed by AWS EFS failing to mount

Error Message:

Warning FailedMount kubelet Unable to attach or mount volumes:
unmounted volumes=[heapdump-volume], unattached volumes=[heapdump-volume]:
timed out waiting for the condition
Enter fullscreen mode Exit fullscreen mode

This clearly indicated a Persistent Volume mount issue, most likely related to EFS connectivity from worker nodes.

2️⃣ Failure Flow (What Went Wrong)

JMSQ Pod Memory Spike
│
▼
Pod Restart Triggered
│
▼
Scheduler selects OLD Node
│
▼
Node has STALE EFS mount
│
▼
EFS Volume Mount Timeout
│
▼
Pod stuck in ContainerCreating
│
▼
FailedMount errors flood Events
Enter fullscreen mode Exit fullscreen mode

Step-by-Step Troubleshooting & Fix

Step 1: Identify the Impacted Pod

Command used:

kubectl describe pod jmsq-deployment-34253e -n prdq

Enter fullscreen mode Exit fullscreen mode
Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  110s (x30 over 67m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[heapdump-volume], unattached volumes=[heapdump-volume]: timed out waiting for the condition

Enter fullscreen mode Exit fullscreen mode

Why this step?
kubectl describe provides detailed pod-level events, container states, volume mounts, and scheduling information.

Key Findings:

  1. Pod stuck in ContainerCreating
  2. Volume heapdump-volume failed to mount
  3. Storage backend: PersistentVolumeClaim (PVC)

Step 2: Verify efs-start Pod & Storage Health

Command:

kubectl exec -it efs-start-23sed-34e -n prdefs -- /bin/sh

#Inside the pod:
df -h

Enter fullscreen mode Exit fullscreen mode
Filesystem                Size      Used Available Use% Mounted on
overlay                 128.0G      5.2G    122.8G   4% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                     7.6G         0      7.6G   0% /sys/fs/cgroup
fs-xxxxx.efs.us-west-2.amazonaws.com:/
                          8.0E      5.3G      8.0E   0% /persistentvolumes
Enter fullscreen mode Exit fullscreen mode

Why this step?
To ensure that EFS itself was reachable and mounted correctly.

Observation:

  1. EFS filesystem was mounted
  2. No disk space exhaustion
  3. Indicates node-level EFS connectivity issue, not EFS service failure

Step 3: Inspect Persistent Volumes (PV)

Command:

kubectl get pv
Enter fullscreen mode Exit fullscreen mode
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                              STORAGECLASS    
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   101Gi      RWX            Delete           Bound    prd/jmsq-volume                      aws-efs      
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   100Gi      RWO            Delete           Bound    prd/solr-index-volume                slow-local   
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx   100Gi      RWO            Delete           Bound    prd/heapdump-volume                  aws-efs 
Enter fullscreen mode Exit fullscreen mode

Purpose:
Validate whether the PV associated with the heapdump PVC is in a healthy Bound state.

Result:

  1. heapdump-volume PV was Bound
  2. Backed by aws-efs storage class
  3. This ruled out PVC misconfiguration.

Step 4: Verify Storage Classes

Command:

kubectl get storageclass
Enter fullscreen mode Exit fullscreen mode
NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE
aws-efs         example.com/aws-efs     Delete          Immediate        
fast-local      kubernetes.io/aws-ebs   Delete          Immediate                
slow-local      kubernetes.io/aws-ebs   Delete          Immediate        
Enter fullscreen mode Exit fullscreen mode

Why this matters:
Confirms dynamic provisioning behavior and backend type.

Key Storage Class:

  1. aws-efs → RWX, dynamic provisioning, Immediate binding

Step 5: Check Persistent Volume Claims (PVC)

Command:

kubectl get pvc
Enter fullscreen mode Exit fullscreen mode
NAME           STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   
Cloudnexus     Bound    pvc-XXXXXXXXXX   256Gi      RWO            slow-local     
Cloudjenkin    Bound    pvc-xxxxxxxxxx   64Gi       RWO            slow-local
Enter fullscreen mode Exit fullscreen mode

Step 6: Inspect Worker Nodes

Command:

kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

Observation:

  1. All nodes appeared Ready
  2. However, older nodes likely had stale or broken EFS mount connections

Step 7: Cordon Existing Nodes

Command:

for X in $(kubectl get nodes | grep "^ip" | awk '{print $1}'); do \
kubectl cordon $X; \
done
Enter fullscreen mode Exit fullscreen mode
[ec2-user@ip-xxxxxxxx ~]$ for X in $(kubectl get nodes | grep "^ip" | awk '{print $1}'); do kubectl cordon $X;done
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
Enter fullscreen mode Exit fullscreen mode

Why cordon nodes?

  1. Prevents new workloads from scheduling on problematic nodes
  2. Allows isolation of faulty infrastructure

Verify post cordon nodes:

kubectl get nodes
Enter fullscreen mode Exit fullscreen mode
[ec2-user@ip-xxxxxxxx ~]$ kubectl get node
NAME                                          STATUS                     ROLES    AGE    VERSION
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
Enter fullscreen mode Exit fullscreen mode

Step 8: Scale Auto Scaling Group to Add New Nodes

  1. New EKS worker node was automatically provisioned via ASG
  2. New node established fresh EFS mount connections

Verification:

kubectl get nodes
Enter fullscreen mode Exit fullscreen mode
[ec2-user@ip-xxxxxxxx ~]$ kubectl get node
NAME                                          STATUS                     ROLES    AGE    VERSION
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   33d   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready                      <none>   50s   v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal   Ready                      <none>   50s   v1.22.14-eks-fb459a0
Enter fullscreen mode Exit fullscreen mode

Result:

  1. New node joined cluster in Ready state
  2. Healthy EFS connectivity

Step 9: Restart Impacted Deployments

Command:

kubectl rollout restart deployment <deployment-name> -n <namespace>
Enter fullscreen mode Exit fullscreen mode

Outcome:

  1. Pods scheduled only on new healthy nodes
  2. EFS volumes mounted successfully
  3. Application recovered without data loss

Root Cause Analysis

  1. Long-running worker nodes developed stale EFS mount connections
  2. JMSQ pod memory exhaustion triggered repeated restarts
  3. Kubernetes attempted to reuse unhealthy nodes
  4. EFS mount timeout prevented PVC attachment

Final Fix Applied

✔ Cordoned impacted nodes
✔ Added fresh worker nodes via Auto Scaling Group
✔ Restarted deployments to force rescheduling
✔ Restored healthy EFS mounts and pod stability

3️⃣ Recovery Flow (Fix Applied)

Detect FailedMount Errors
│
▼
Cordon Old Worker Nodes
│
▼
Auto Scaling Group adds NEW Node
│
▼
New Node establishes fresh EFS mount
│
▼
Rollout Restart Deployment
│
▼
Pods scheduled on healthy node
│
▼
Application fully recovered
Enter fullscreen mode Exit fullscreen mode

Key Learnings & Best Practices

  1. Always check Events section in kubectl describe pod
  2. PVC Bound ≠ storage is usable (node-level issues matter)
  3. Periodically rotate EKS worker nodes
  4. Monitor memory usage to avoid JVM heap exhaustion
  5. Use rollout restart instead of deleting pods manually

Summary

This case study demonstrates how storage + node health issues can silently break Kubernetes workloads even when cluster objects appear healthy. A structured troubleshooting approach — starting from pod events, moving to storage, and finally node isolation — helped resolve the production outage efficiently with minimal risk.

If you are running stateful workloads on EKS with EFS, proactive node lifecycle management and monitoring are critical to avoid similar failures.

Happy Learning & Reliable Kubernetes! 🚀

Follow me on LinkedIn: www.linkedin.com/in/alok-shankar-55b94826

Top comments (0)