Introduction
In production Kubernetes environments, storage-related issues combined with resource exhaustion can lead to cascading pod failures. In this blog, a critical JMSQ and backend pod reached 100% memory utilization, and subsequent pod recreation attempts failed. An associated efs-start pod was also impacted, resulting in Persistent Volume mount failures. This blog walks through the real-world troubleshooting approach, commands used, root cause analysis, and the final fix applied in an Amazon EKS cluster.
Problem Statement
Symptoms Observed
- Application pod stuck in Pending / ContainerCreating state
- Continuous FailedMount errors
- Dependent pods unable to start
- Heap dump PVC backed by AWS EFS failing to mount
Error Message:
Warning FailedMount kubelet Unable to attach or mount volumes:
unmounted volumes=[heapdump-volume], unattached volumes=[heapdump-volume]:
timed out waiting for the condition
This clearly indicated a Persistent Volume mount issue, most likely related to EFS connectivity from worker nodes.
2️⃣ Failure Flow (What Went Wrong)
JMSQ Pod Memory Spike
│
▼
Pod Restart Triggered
│
▼
Scheduler selects OLD Node
│
▼
Node has STALE EFS mount
│
▼
EFS Volume Mount Timeout
│
▼
Pod stuck in ContainerCreating
│
▼
FailedMount errors flood Events
Step-by-Step Troubleshooting & Fix
Step 1: Identify the Impacted Pod
Command used:
kubectl describe pod jmsq-deployment-34253e -n prdq
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 110s (x30 over 67m) kubelet Unable to attach or mount volumes: unmounted volumes=[heapdump-volume], unattached volumes=[heapdump-volume]: timed out waiting for the condition
Why this step?
kubectl describe provides detailed pod-level events, container states, volume mounts, and scheduling information.
Key Findings:
- Pod stuck in ContainerCreating
- Volume heapdump-volume failed to mount
- Storage backend: PersistentVolumeClaim (PVC)
Step 2: Verify efs-start Pod & Storage Health
Command:
kubectl exec -it efs-start-23sed-34e -n prdefs -- /bin/sh
#Inside the pod:
df -h
Filesystem Size Used Available Use% Mounted on
overlay 128.0G 5.2G 122.8G 4% /
tmpfs 64.0M 0 64.0M 0% /dev
tmpfs 7.6G 0 7.6G 0% /sys/fs/cgroup
fs-xxxxx.efs.us-west-2.amazonaws.com:/
8.0E 5.3G 8.0E 0% /persistentvolumes
Why this step?
To ensure that EFS itself was reachable and mounted correctly.
Observation:
- EFS filesystem was mounted
- No disk space exhaustion
- Indicates node-level EFS connectivity issue, not EFS service failure
Step 3: Inspect Persistent Volumes (PV)
Command:
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx 101Gi RWX Delete Bound prd/jmsq-volume aws-efs
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx 100Gi RWO Delete Bound prd/solr-index-volume slow-local
pvc-xxxxxxxx-xxxxx-xxxxxxxxx-xxxxxxxxxxx 100Gi RWO Delete Bound prd/heapdump-volume aws-efs
Purpose:
Validate whether the PV associated with the heapdump PVC is in a healthy Bound state.
Result:
- heapdump-volume PV was Bound
- Backed by aws-efs storage class
- This ruled out PVC misconfiguration.
Step 4: Verify Storage Classes
Command:
kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
aws-efs example.com/aws-efs Delete Immediate
fast-local kubernetes.io/aws-ebs Delete Immediate
slow-local kubernetes.io/aws-ebs Delete Immediate
Why this matters:
Confirms dynamic provisioning behavior and backend type.
Key Storage Class:
- aws-efs → RWX, dynamic provisioning, Immediate binding
Step 5: Check Persistent Volume Claims (PVC)
Command:
kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
Cloudnexus Bound pvc-XXXXXXXXXX 256Gi RWO slow-local
Cloudjenkin Bound pvc-xxxxxxxxxx 64Gi RWO slow-local
Step 6: Inspect Worker Nodes
Command:
kubectl get nodes
Observation:
- All nodes appeared Ready
- However, older nodes likely had stale or broken EFS mount connections
Step 7: Cordon Existing Nodes
Command:
for X in $(kubectl get nodes | grep "^ip" | awk '{print $1}'); do \
kubectl cordon $X; \
done
[ec2-user@ip-xxxxxxxx ~]$ for X in $(kubectl get nodes | grep "^ip" | awk '{print $1}'); do kubectl cordon $X;done
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
node/ip-xxxxxxxxxxx.us-west-2.compute.internal cordoned
Why cordon nodes?
- Prevents new workloads from scheduling on problematic nodes
- Allows isolation of faulty infrastructure
Verify post cordon nodes:
kubectl get nodes
[ec2-user@ip-xxxxxxxx ~]$ kubectl get node
NAME STATUS ROLES AGE VERSION
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
Step 8: Scale Auto Scaling Group to Add New Nodes
- New EKS worker node was automatically provisioned via ASG
- New node established fresh EFS mount connections
Verification:
kubectl get nodes
[ec2-user@ip-xxxxxxxx ~]$ kubectl get node
NAME STATUS ROLES AGE VERSION
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready,SchedulingDisabled <none> 33d v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready <none> 50s v1.22.14-eks-fb459a0
ip-xxxxxxxxxxxxx.us-west-2.compute.internal Ready <none> 50s v1.22.14-eks-fb459a0
Result:
- New node joined cluster in Ready state
- Healthy EFS connectivity
Step 9: Restart Impacted Deployments
Command:
kubectl rollout restart deployment <deployment-name> -n <namespace>
Outcome:
- Pods scheduled only on new healthy nodes
- EFS volumes mounted successfully
- Application recovered without data loss
Root Cause Analysis
- Long-running worker nodes developed stale EFS mount connections
- JMSQ pod memory exhaustion triggered repeated restarts
- Kubernetes attempted to reuse unhealthy nodes
- EFS mount timeout prevented PVC attachment
Final Fix Applied
✔ Cordoned impacted nodes
✔ Added fresh worker nodes via Auto Scaling Group
✔ Restarted deployments to force rescheduling
✔ Restored healthy EFS mounts and pod stability
3️⃣ Recovery Flow (Fix Applied)
Detect FailedMount Errors
│
▼
Cordon Old Worker Nodes
│
▼
Auto Scaling Group adds NEW Node
│
▼
New Node establishes fresh EFS mount
│
▼
Rollout Restart Deployment
│
▼
Pods scheduled on healthy node
│
▼
Application fully recovered
Key Learnings & Best Practices
- Always check Events section in kubectl describe pod
- PVC Bound ≠ storage is usable (node-level issues matter)
- Periodically rotate EKS worker nodes
- Monitor memory usage to avoid JVM heap exhaustion
- Use rollout restart instead of deleting pods manually
Summary
This case study demonstrates how storage + node health issues can silently break Kubernetes workloads even when cluster objects appear healthy. A structured troubleshooting approach — starting from pod events, moving to storage, and finally node isolation — helped resolve the production outage efficiently with minimal risk.
If you are running stateful workloads on EKS with EFS, proactive node lifecycle management and monitoring are critical to avoid similar failures.
Happy Learning & Reliable Kubernetes! 🚀
Follow me on LinkedIn: www.linkedin.com/in/alok-shankar-55b94826
Top comments (0)