kubernetes Resource Isolation - 10. Real-world Kubernetes Troubleshooting Case Studies

October 14, 2025  4 minute read  

SEGMENT 10 — Real-world Kubernetes Troubleshooting Case Studies

We’ll cover:

  1. Node meltdown due to page cache pressure
  2. Java Pod killed despite having high memory limit
  3. Kubelet OOM-killed → node NotReady
  4. Redis latency spikes due to CPU throttling
  5. GPU Pod fails to start due to NUMA misalignment
  6. PID exhaustion on the node
  7. High slab usage causing eviction storms
  8. Multi-container Pod causing mysterious memory OOM

Let’s go through each.


CASE STUDY 1 — Node Meltdown Due to Page Cache Pressure

Symptoms:

  • Node frequently enters MemoryPressure
  • Pods evicted even though memory.current is low
  • kubectl describe node shows eviction events
  • Logs show:

    kubelet: eviction manager: pods evicted
    
  • PSI memory increases (/proc/pressure/memory)
  • free -h shows huge “cached” memory, small “available”

Root Cause:

Page cache explosion. Common scenarios:

  • heavy IO workloads (Spark, ML feature loading, scanning)
  • large image downloads
  • container image GC
  • CSI drivers caching data
  • heavy logging to file

Page cache counts as memory.available but is not part of working set.

Diagnosis:

  1. Check page cache:

    cat /proc/meminfo | grep -i cache
    
  2. Check working set vs memory.current:

    cat /sys/fs/cgroup/<pod>/memory.current
    cat /sys/fs/cgroup/<pod>/memory.stat
    
  3. PSI:

    cat /proc/pressure/memory
    

Fix:

  • Enable MemoryQoS → memory.high protects working set
  • Use local SSD with tuned readahead
  • Reduce file logging
  • Tune eviction thresholds (higher soft threshold)
  • Isolate IO-heavy workloads on dedicated nodes

CASE STUDY 2 — Java Pod OOMKilled Despite High Memory Limit

Symptoms:

  • Pod dies with:

    OOMKilled
    
  • memory.limit = 4Gi
  • JVM heap = 3Gi
  • No apparent memory leak

Root Cause:

Java uses:

  • heap
  • metaspace
  • thread stacks
  • direct buffers
  • JIT
  • compressed class space
  • GC regions
  • mmap’d libraries

Total RSS often exceeds -Xmx by 30–50%.

Diagnosis:

  1. Inside container:

    pmap <pid> | grep total
    cat /proc/<pid>/status | grep Vm
    
  2. Compare:

    • RSS
    • Xmx
    • total limit

Fix:

  • Reduce thread count
  • Set:

    -XX:MaxDirectMemorySize
    -Xss
    -XX:MaxMetaspaceSize
    
  • Increase Pod memory limit
  • Enable MemoryQoS
  • Use Java’s container-aware options (-XX:+UseContainerSupport)

CASE STUDY 3 — Kubelet OOM-Killed → Node NotReady

Symptoms:

  • Node goes NotReady
  • Many Pods rescheduled
  • journalctl -u kubelet ends abruptly
  • dmesg shows:

    Out of memory: Killed process 1234 (kubelet)
    

Root Cause:

No system-reserved or kube-reserved.

User Pods starve kubelet memory.

When kernel is under pressure, kubelet is a prime OOM victim unless protected.

Diagnosis:

dmesg | grep -i oom
journalctl -u kubelet
free -h

Fix:

--system-reserved=cpu=500m,memory=1Gi
--kube-reserved=cpu=1,memory=2Gi

CASE STUDY 4 — Redis Latency Spikes Due to CPU Throttling

Symptoms:

  • Redis p99/p999 latency spikes
  • No memory OOM
  • CPU usage < limit
  • container_cpu_cfs_throttled_periods_total increasing
  • perf record shows scheduler stalls

Root Cause:

Redis single-threaded, requires consistent CPU cycles. CPU limit (even high) → periodic throttling → latency jitter.

Diagnosis:

cat /sys/fs/cgroup/<pod>/cpu.stat

Look for:

nr_throttled > 0

Fix:

  • Remove CPU limit
  • Or use static CPU Manager (exclusive core)
  • Or set limit = request exactly (Guaranteed)

CASE STUDY 5 — GPU Pod Fails Due to NUMA Misalignment

Symptoms:

  • Pod startup fails with:

    insufficient resources: nvidia.com/gpu
    
  • Despite GPU being present.
  • Or Pod starts but inference performance is terrible.

Root Cause:

GPU is attached to NUMA node X Pod requests CPU from NUMA node Y TopologyManager (restricted/single-numa-node) rejects alignment.

Diagnosis:

Check GPU NUMA node:

nvidia-smi topo -m

Check CPU→NUMA mapping:

lscpu

Fix:

  • Ensure sufficient CPUs available on same NUMA node
  • Use node affinity to force workload onto specific node type
  • Increase CPU reservations in proper NUMA socket
  • Enable topology-aware scheduling

CASE STUDY 6 — PID Exhaustion on Node

Symptoms:

  • New pods fail
  • Node becomes unresponsive
  • cannot fork errors
  • pids.current near kernel limit

Root Cause:

Containers spawning too many threads (Python thread pools, Java thread leak).

Diagnosis:

cat /proc/sys/kernel/pid_max
cat /sys/fs/cgroup/pids/kubepods.slice/pids.current

Fix:

Set:

--pod-max-pids=1024

Prevent runaway Pod from consuming all PIDs.


CASE STUDY 7 — High Slab Usage Causes Eviction Storm

Symptoms:

  • memory.available low
  • Pods evicted
  • But memory.current of pods looks normal
  • slabtop shows huge kernel slab usage
  • Common on:

    • heavy NFS
    • massive logging
    • many open file handles

Root Cause:

Kernel slabs are not fully reclaimable. High slab → memory pressure → eviction soft/hard triggers.

Diagnosis:

slabtop
cat /proc/meminfo | grep Slab

Fix:

  • Reduce inode/dentry creation
  • Avoid massive directory trees
  • Tune vm.vfs_cache_pressure
  • Reduce log volume
  • Move workloads to local SSD

CASE STUDY 8 — Multi-container Pod Causes Unexpected OOM

Symptoms:

  • Pod OOMKilled
  • But container memory limit was not reached
  • Sidecar appears unrelated

Root Cause:

Pod-level memory cgroup = sum(container limits All containers share memory.max at pod level.

Example:

  • app limit: 1Gi
  • sidecar limit: 500Mi

Pod-level memory.max = 1.5Gi.

If app uses 1.3Gi → OOM at pod-level, kernel kills one container.

Diagnosis:

Check pod-level cgroup:

/sys/fs/cgroup/.../kubepods-burstable-podUID.slice/memory.current

Compare with sum(container limits).

Fix:

  • Increase Pod memory limit
  • Reduce sidecar memory usage
  • Separate heavy workloads into different Pods

BONUS CASE — CNI or CSI plugin causes hidden memory leak

Symptoms:

  • Node memory slowly increases
  • No pod shows increased memory.current
  • Only occurs after running for days/weeks

Root Cause:

CNI plugin (Calico, Cilium) or CSI plugin stores kernel objects:

  • conntrack entries
  • BPF maps
  • page cache for logs
  • CSI FUSE mount buffers

This is kernel/slab usage outside Pod cgroups.

Diagnosis:

slabtop
cat /proc/net/nf_conntrack | wc -l
bpftool map show

Fix:

  • Set conntrack max
  • Tune Cilium map sizes
  • Update CSI driver
  • Reboot node if slab fragmentation is extreme

SEGMENT 10 SUMMARY

You now understand real-world failure scenarios across:

CPU

  • throttling
  • noisy neighbors
  • lack of pinning causing jitter

Memory

  • page cache
  • slab
  • kernel OOM
  • eviction
  • pod-level vs container-level mismatch

System

  • kubelet starvation
  • PID exhaustion
  • NUMA mismatches
  • CNI/CSI component memory leaks

Root cause patterns

  • page cache growth
  • memory overcommit
  • renew delay
  • control plane starvation

Segment 10 is the practical SRE playbook that complements Segments 1–9.

Leave a comment