kubernetes Resource Isolation - 10. Real-world Kubernetes Troubleshooting Case Studies

October 14, 2025 4 minute read

SEGMENT 10 — Real-world Kubernetes Troubleshooting Case Studies

We’ll cover:

Node meltdown due to page cache pressure
Java Pod killed despite having high memory limit
Kubelet OOM-killed → node NotReady
Redis latency spikes due to CPU throttling
GPU Pod fails to start due to NUMA misalignment
PID exhaustion on the node
High slab usage causing eviction storms
Multi-container Pod causing mysterious memory OOM

Let’s go through each.

CASE STUDY 1 — Node Meltdown Due to Page Cache Pressure

Symptoms:

Node frequently enters MemoryPressure
Pods evicted even though memory.current is low
kubectl describe node shows eviction events

Logs show:

kubelet: eviction manager: pods evicted

PSI memory increases (/proc/pressure/memory)
free -h shows huge “cached” memory, small “available”

Root Cause:

Page cache explosion. Common scenarios:

heavy IO workloads (Spark, ML feature loading, scanning)
large image downloads
container image GC
CSI drivers caching data
heavy logging to file

Page cache counts as memory.available but is not part of working set.

Diagnosis:

Check page cache:
```
cat /proc/meminfo | grep -i cache
```

Check working set vs memory.current:

cat /sys/fs/cgroup/<pod>/memory.current
cat /sys/fs/cgroup/<pod>/memory.stat

PSI:
```
cat /proc/pressure/memory
```

Fix:

Enable MemoryQoS → memory.high protects working set
Use local SSD with tuned readahead
Reduce file logging
Tune eviction thresholds (higher soft threshold)
Isolate IO-heavy workloads on dedicated nodes

CASE STUDY 2 — Java Pod OOMKilled Despite High Memory Limit

Symptoms:

Pod dies with:
```
OOMKilled
```
memory.limit = 4Gi
JVM heap = 3Gi
No apparent memory leak

Root Cause:

Java uses:

heap
metaspace
thread stacks
direct buffers
JIT
compressed class space
GC regions
mmap’d libraries

Total RSS often exceeds -Xmx by 30–50%.

Diagnosis:

Inside container:

pmap <pid> | grep total
cat /proc/<pid>/status | grep Vm

Compare:
- RSS
- Xmx
- total limit

Fix:

Reduce thread count

Set:

-XX:MaxDirectMemorySize
-Xss
-XX:MaxMetaspaceSize

Increase Pod memory limit
Enable MemoryQoS
Use Java’s container-aware options (-XX:+UseContainerSupport)

CASE STUDY 3 — Kubelet OOM-Killed → Node NotReady

Symptoms:

Node goes NotReady
Many Pods rescheduled
journalctl -u kubelet ends abruptly

dmesg shows:

Out of memory: Killed process 1234 (kubelet)

Root Cause:

No system-reserved or kube-reserved.

User Pods starve kubelet memory.

When kernel is under pressure, kubelet is a prime OOM victim unless protected.

Diagnosis:

dmesg | grep -i oom
journalctl -u kubelet
free -h

Fix:

--system-reserved=cpu=500m,memory=1Gi
--kube-reserved=cpu=1,memory=2Gi

CASE STUDY 4 — Redis Latency Spikes Due to CPU Throttling

Symptoms:

Redis p99/p999 latency spikes
No memory OOM
CPU usage < limit
container_cpu_cfs_throttled_periods_total increasing
perf record shows scheduler stalls

Root Cause:

Redis single-threaded, requires consistent CPU cycles. CPU limit (even high) → periodic throttling → latency jitter.

Diagnosis:

cat /sys/fs/cgroup/<pod>/cpu.stat

Look for:

nr_throttled > 0

Fix:

Remove CPU limit
Or use static CPU Manager (exclusive core)
Or set limit = request exactly (Guaranteed)

CASE STUDY 5 — GPU Pod Fails Due to NUMA Misalignment

Symptoms:

Pod startup fails with:
```
insufficient resources: nvidia.com/gpu
```
Despite GPU being present.
Or Pod starts but inference performance is terrible.

Root Cause:

GPU is attached to NUMA node X Pod requests CPU from NUMA node Y TopologyManager (restricted/single-numa-node) rejects alignment.

Diagnosis:

Check GPU NUMA node:

nvidia-smi topo -m

Check CPU→NUMA mapping:

lscpu

Fix:

Ensure sufficient CPUs available on same NUMA node
Use node affinity to force workload onto specific node type
Increase CPU reservations in proper NUMA socket
Enable topology-aware scheduling

CASE STUDY 6 — PID Exhaustion on Node

Symptoms:

New pods fail
Node becomes unresponsive
cannot fork errors
pids.current near kernel limit

Root Cause:

Containers spawning too many threads (Python thread pools, Java thread leak).

Diagnosis:

cat /proc/sys/kernel/pid_max
cat /sys/fs/cgroup/pids/kubepods.slice/pids.current

Fix:

Set:

--pod-max-pids=1024

Prevent runaway Pod from consuming all PIDs.

CASE STUDY 7 — High Slab Usage Causes Eviction Storm

Symptoms:

memory.available low
Pods evicted
But memory.current of pods looks normal
slabtop shows huge kernel slab usage
Common on:
- heavy NFS
- massive logging
- many open file handles

Root Cause:

Kernel slabs are not fully reclaimable. High slab → memory pressure → eviction soft/hard triggers.

Diagnosis:

slabtop
cat /proc/meminfo | grep Slab

Fix:

Reduce inode/dentry creation
Avoid massive directory trees
Tune vm.vfs_cache_pressure
Reduce log volume
Move workloads to local SSD

CASE STUDY 8 — Multi-container Pod Causes Unexpected OOM

Symptoms:

Pod OOMKilled
But container memory limit was not reached
Sidecar appears unrelated

Root Cause:

Pod-level memory cgroup = sum(container limits All containers share memory.max at pod level.

Example:

app limit: 1Gi
sidecar limit: 500Mi

Pod-level memory.max = 1.5Gi.

If app uses 1.3Gi → OOM at pod-level, kernel kills one container.

Diagnosis:

Check pod-level cgroup:

/sys/fs/cgroup/.../kubepods-burstable-podUID.slice/memory.current

Compare with sum(container limits).

Fix:

Increase Pod memory limit
Reduce sidecar memory usage
Separate heavy workloads into different Pods

BONUS CASE — CNI or CSI plugin causes hidden memory leak

Symptoms:

Node memory slowly increases
No pod shows increased memory.current
Only occurs after running for days/weeks

Root Cause:

CNI plugin (Calico, Cilium) or CSI plugin stores kernel objects:

conntrack entries
BPF maps
page cache for logs
CSI FUSE mount buffers

This is kernel/slab usage outside Pod cgroups.

Diagnosis:

slabtop
cat /proc/net/nf_conntrack | wc -l
bpftool map show

Fix:

Set conntrack max
Tune Cilium map sizes
Update CSI driver
Reboot node if slab fragmentation is extreme

SEGMENT 10 SUMMARY

You now understand real-world failure scenarios across:

CPU

throttling
noisy neighbors
lack of pinning causing jitter

Memory

page cache
slab
kernel OOM
eviction
pod-level vs container-level mismatch

System

kubelet starvation
PID exhaustion
NUMA mismatches
CNI/CSI component memory leaks

Root cause patterns

page cache growth
memory overcommit
renew delay
control plane starvation

Segment 10 is the practical SRE playbook that complements Segments 1–9.

Share on

Twitter Facebook Reddit LinkedIn Mastodon

Maung San

SEGMENT 10 — Real-world Kubernetes Troubleshooting Case Studies

CASE STUDY 1 — Node Meltdown Due to Page Cache Pressure

Symptoms:

Root Cause:

Diagnosis:

Fix:

CASE STUDY 2 — Java Pod OOMKilled Despite High Memory Limit

Symptoms:

Root Cause:

Diagnosis:

Fix:

CASE STUDY 3 — Kubelet OOM-Killed → Node NotReady

Symptoms:

Root Cause:

Diagnosis:

Fix:

CASE STUDY 4 — Redis Latency Spikes Due to CPU Throttling

Symptoms:

Root Cause:

Diagnosis:

Fix:

CASE STUDY 5 — GPU Pod Fails Due to NUMA Misalignment

Symptoms:

Root Cause:

Diagnosis:

Fix:

CASE STUDY 6 — PID Exhaustion on Node

Symptoms:

Root Cause:

Diagnosis:

Fix:

CASE STUDY 7 — High Slab Usage Causes Eviction Storm

Symptoms:

Root Cause:

Diagnosis:

Fix:

CASE STUDY 8 — Multi-container Pod Causes Unexpected OOM

Symptoms:

Root Cause:

Diagnosis:

Fix:

BONUS CASE — CNI or CSI plugin causes hidden memory leak

Symptoms:

Root Cause:

Diagnosis:

Fix:

SEGMENT 10 SUMMARY

CPU

Memory

System

Root cause patterns

Share on

Leave a comment

You may also enjoy

DevOps Quick Read - How to Read a Packer Template in 60 Seconds

kubernetes Resource Isolation - 14. A catalog of **cluster design patterns

kubernetes Resource Isolation - 13. Production-ready node & kubelet blueprint

kubernetes Resource Isolation - 12. Ultimate Node Sizing Guide for AKS, EKS, and GKE