kubernetes Resource Isolation - 10. Real-world Kubernetes Troubleshooting Case Studies
SEGMENT 10 — Real-world Kubernetes Troubleshooting Case Studies
We’ll cover:
- Node meltdown due to page cache pressure
- Java Pod killed despite having high memory limit
- Kubelet OOM-killed → node NotReady
- Redis latency spikes due to CPU throttling
- GPU Pod fails to start due to NUMA misalignment
- PID exhaustion on the node
- High slab usage causing eviction storms
- Multi-container Pod causing mysterious memory OOM
Let’s go through each.
CASE STUDY 1 — Node Meltdown Due to Page Cache Pressure
Symptoms:
- Node frequently enters
MemoryPressure - Pods evicted even though
memory.currentis low kubectl describe nodeshows eviction events-
Logs show:
kubelet: eviction manager: pods evicted - PSI memory increases (
/proc/pressure/memory) free -hshows huge “cached” memory, small “available”
Root Cause:
Page cache explosion. Common scenarios:
- heavy IO workloads (Spark, ML feature loading, scanning)
- large image downloads
- container image GC
- CSI drivers caching data
- heavy logging to file
Page cache counts as memory.available but is not part of working set.
Diagnosis:
-
Check page cache:
cat /proc/meminfo | grep -i cache -
Check working set vs memory.current:
cat /sys/fs/cgroup/<pod>/memory.current cat /sys/fs/cgroup/<pod>/memory.stat -
PSI:
cat /proc/pressure/memory
Fix:
- Enable MemoryQoS → memory.high protects working set
- Use local SSD with tuned readahead
- Reduce file logging
- Tune eviction thresholds (higher soft threshold)
- Isolate IO-heavy workloads on dedicated nodes
CASE STUDY 2 — Java Pod OOMKilled Despite High Memory Limit
Symptoms:
-
Pod dies with:
OOMKilled - memory.limit = 4Gi
- JVM heap = 3Gi
- No apparent memory leak
Root Cause:
Java uses:
- heap
- metaspace
- thread stacks
- direct buffers
- JIT
- compressed class space
- GC regions
- mmap’d libraries
Total RSS often exceeds -Xmx by 30–50%.
Diagnosis:
-
Inside container:
pmap <pid> | grep total cat /proc/<pid>/status | grep Vm -
Compare:
- RSS
- Xmx
- total limit
Fix:
- Reduce thread count
-
Set:
-XX:MaxDirectMemorySize -Xss -XX:MaxMetaspaceSize - Increase Pod memory limit
- Enable MemoryQoS
- Use Java’s container-aware options (
-XX:+UseContainerSupport)
CASE STUDY 3 — Kubelet OOM-Killed → Node NotReady
Symptoms:
- Node goes NotReady
- Many Pods rescheduled
journalctl -u kubeletends abruptly-
dmesgshows:Out of memory: Killed process 1234 (kubelet)
Root Cause:
No system-reserved or kube-reserved.
User Pods starve kubelet memory.
When kernel is under pressure, kubelet is a prime OOM victim unless protected.
Diagnosis:
dmesg | grep -i oom
journalctl -u kubelet
free -h
Fix:
--system-reserved=cpu=500m,memory=1Gi
--kube-reserved=cpu=1,memory=2Gi
CASE STUDY 4 — Redis Latency Spikes Due to CPU Throttling
Symptoms:
- Redis p99/p999 latency spikes
- No memory OOM
- CPU usage < limit
container_cpu_cfs_throttled_periods_totalincreasingperf recordshows scheduler stalls
Root Cause:
Redis single-threaded, requires consistent CPU cycles. CPU limit (even high) → periodic throttling → latency jitter.
Diagnosis:
cat /sys/fs/cgroup/<pod>/cpu.stat
Look for:
nr_throttled > 0
Fix:
- Remove CPU limit
- Or use static CPU Manager (exclusive core)
- Or set limit = request exactly (Guaranteed)
CASE STUDY 5 — GPU Pod Fails Due to NUMA Misalignment
Symptoms:
-
Pod startup fails with:
insufficient resources: nvidia.com/gpu - Despite GPU being present.
- Or Pod starts but inference performance is terrible.
Root Cause:
GPU is attached to NUMA node X
Pod requests CPU from NUMA node Y
TopologyManager (restricted/single-numa-node) rejects alignment.
Diagnosis:
Check GPU NUMA node:
nvidia-smi topo -m
Check CPU→NUMA mapping:
lscpu
Fix:
- Ensure sufficient CPUs available on same NUMA node
- Use node affinity to force workload onto specific node type
- Increase CPU reservations in proper NUMA socket
- Enable topology-aware scheduling
CASE STUDY 6 — PID Exhaustion on Node
Symptoms:
- New pods fail
- Node becomes unresponsive
cannot forkerrorspids.currentnear kernel limit
Root Cause:
Containers spawning too many threads (Python thread pools, Java thread leak).
Diagnosis:
cat /proc/sys/kernel/pid_max
cat /sys/fs/cgroup/pids/kubepods.slice/pids.current
Fix:
Set:
--pod-max-pids=1024
Prevent runaway Pod from consuming all PIDs.
CASE STUDY 7 — High Slab Usage Causes Eviction Storm
Symptoms:
- memory.available low
- Pods evicted
- But memory.current of pods looks normal
slabtopshows huge kernel slab usage-
Common on:
- heavy NFS
- massive logging
- many open file handles
Root Cause:
Kernel slabs are not fully reclaimable. High slab → memory pressure → eviction soft/hard triggers.
Diagnosis:
slabtop
cat /proc/meminfo | grep Slab
Fix:
- Reduce inode/dentry creation
- Avoid massive directory trees
- Tune vm.vfs_cache_pressure
- Reduce log volume
- Move workloads to local SSD
CASE STUDY 8 — Multi-container Pod Causes Unexpected OOM
Symptoms:
- Pod OOMKilled
- But container memory limit was not reached
- Sidecar appears unrelated
Root Cause:
Pod-level memory cgroup = sum(container limits All containers share memory.max at pod level.
Example:
- app limit: 1Gi
- sidecar limit: 500Mi
Pod-level memory.max = 1.5Gi.
If app uses 1.3Gi → OOM at pod-level, kernel kills one container.
Diagnosis:
Check pod-level cgroup:
/sys/fs/cgroup/.../kubepods-burstable-podUID.slice/memory.current
Compare with sum(container limits).
Fix:
- Increase Pod memory limit
- Reduce sidecar memory usage
- Separate heavy workloads into different Pods
BONUS CASE — CNI or CSI plugin causes hidden memory leak
Symptoms:
- Node memory slowly increases
- No pod shows increased memory.current
- Only occurs after running for days/weeks
Root Cause:
CNI plugin (Calico, Cilium) or CSI plugin stores kernel objects:
- conntrack entries
- BPF maps
- page cache for logs
- CSI FUSE mount buffers
This is kernel/slab usage outside Pod cgroups.
Diagnosis:
slabtop
cat /proc/net/nf_conntrack | wc -l
bpftool map show
Fix:
- Set conntrack max
- Tune Cilium map sizes
- Update CSI driver
- Reboot node if slab fragmentation is extreme
SEGMENT 10 SUMMARY
You now understand real-world failure scenarios across:
CPU
- throttling
- noisy neighbors
- lack of pinning causing jitter
Memory
- page cache
- slab
- kernel OOM
- eviction
- pod-level vs container-level mismatch
System
- kubelet starvation
- PID exhaustion
- NUMA mismatches
- CNI/CSI component memory leaks
Root cause patterns
- page cache growth
- memory overcommit
- renew delay
- control plane starvation
Segment 10 is the practical SRE playbook that complements Segments 1–9.
Leave a comment