kubernetes Resource Isolation - 08. Linux Kernel Memory Internals for Kubernete
This is where we look under the hood at what Linux is doing while Kubernetes is trying to keep the node alive.
If you understand Segment 8, you can troubleshoot ANY Kubernetes memory issue — Java apps, page cache pressure, kernel OOMs, eviction storms, and mysterious node reboots.
SEGMENT 8 — Linux Kernel Memory Internals for Kubernetes
This segment covers:
- memory.current vs working set
- Page cache behavior
- Slab memory & kernel accounting
- Reclaim mechanism (kswapd, pressure, aging)
- PSI (Pressure Stall Information)
- OOM scoring, oom_score_adj, and the kernel’s OOM killer logic
- Why Kubernetes eviction fights with kernel OOM
- How MemoryQoS interacts with kernel reclaim
- Real-world troubleshooting patterns
Let’s go step-by-step.
PART 1 — memory.current vs “working set”
In cgroup v2:
memory.current
memory.stat
memory.current (formerly memory.usage_in_bytes)
Includes:
- file cache (page cache)
- anonymous memory (heap, stack)
- shared memory
- kernel memory (some parts)
- tmpfs
Kubernetes usually cares about working set, not memory.current.
Working set approximates:
“The actively used memory that cannot be easily reclaimed.”
Kubernetes calculates:
- memory working set
- usually = usage - inactive_file
From cgroup v1/v2 stats:
memory.stat:
file
file_mapped
file_dirty
inactive_file
active_anon
inactive_anon
Working set excludes page cache because page cache can be reclaimed safely.
PART 2 — Page Cache Behavior (critical for eviction)
Linux uses unused RAM for:
- disk caching
- read-ahead
- buffering I/O
This is NOT wasteful. It improves performance.
But:
Page cache counts toward memory.current and reduces memory.available → triggers eviction.
This is the #1 cause of unexpected evictions.
How reclaim works:
Linux scans these LRU lists:
active_fileinactive_fileactive_anoninactive_anon
Priority:
- inactive_file (cold page cache — reclaimable)
- inactive_anon (swappable anon memory)
- active_file/active_anon (least preferred)
If nothing can be reclaimed → memory pressure escalates.
PART 3 — Slab Memory (kernel memory)
/proc/slabinfo
OR
memory.stat → slab
Slab includes:
- inodes
- dentries
- network buffers
- kernel objects
- cgroup metadata
Slab can grow under:
- high file operations
- high network throughput
- many pods/containers (K8s itself uses lots of slab)
- massive directory listings
- logging & journald
Slab is partially reclaimable but not fully.
When slab grows too large:
- memory.available shrinks
- Kubelet triggers eviction
- OR kernel OOM kills things
PART 4 — Reclaim Mechanism (kswapd & direct reclaim)
Two main reclaim paths:
1. kswapd (background reclaim)
Triggered when:
memory.available falls below watermark
kswapd tries to free:
- page cache
- clean file-backed memory
- inactive anonymous memory
If kswapd can keep up, node survives.
2. Direct reclaim (synch reclaim)
If memory pressure grows faster than kswapd can handle:
A process itself is forced into reclaim:
- it sleeps
- scans pages
- causes latency spikes
- can stall entire workloads
Direct reclaim failures → system goes into:
3. OOM killer
When all reclaim paths fail.
PART 5 — PSI (Pressure Stall Information)
One of the best modern kernel features for diagnosing resource pressure.
Located in:
/proc/pressure/memory
/proc/pressure/cpu
/proc/pressure/io
Output example:
some avg10=5.00 avg60=2.00 avg300=1.00 total=12000
full avg10=0.50 avg60=0.20 avg300=0.10 total=500
Meaning:
- avg10=5% processes are stalled due to memory reclaim
- “full” = even processes that do not need memory stall because holder is stalled
High PSI memory → system is under severe reclaim pressure.
K8s doesn’t use PSI yet, but containerd + systemd do.
PART 6 — OOM Killer Internals (oom_score & badness)
When reclaim is exhausted:
The kernel OOM killer chooses a victim based on:
- Process RSS
- oom_score
- oom_score_adj
- badness heuristic
- memory cgroup violations
You can see a process’s score:
cat /proc/<pid>/oom_score
cat /proc/<pid>/oom_score_adj
Kubernetes sets oom_score_adj for containers:
- Guaranteed pods: low (harder to kill)
- Burstable: medium
- BestEffort: very high
Actual values:
- Guaranteed:
-998 - Burstable:
100 - BestEffort:
1000
These match eviction priorities.
PART 7 — Kubernetes Eviction vs Kernel OOM: who wins?
Kubelet Eviction:
- Works at pod level
- Evicts Pods to prevent node collapse
- Slow compared to kernel OOM
- Depends on periodic stats updates
- Reacts to memory.available
Kernel OOM:
- Instant
- Works at process level
- Will kill inside containers
- Can kill kubelet or containerd (!)
Real world:
Kernel OOM often fires before kubelet can react.
This is why MemoryQoS was introduced.
PART 8 — MemoryQoS & cgroup v2 (memory.min & memory.high)
memory.min (reserve)
Guarantees a Pod cannot be reclaimed below its request.
Example:
memory.min = 500Mi
memory.high (throttling)
When exceeding memory.high:
- kernel slows down refaults
- reclaims pages more aggressively
- prevents sudden OOM kills
- serves as a “soft limit”
memory.max (hard limit)
Container cannot exceed this.
MemoryQoS dramatically reduces:
- sudden OOMs
- reclaim storms
- eviction noise
PART 9 — Real World Troubleshooting Scenarios
Scenario 1: Node pressure with high page cache
Symptoms:
- memory.current high
- working_set low
- PSI memory rises
- eviction-soft triggers
Fix:
- use MemoryQoS
- reduce file I/O
- isolate workloads
- reduce log rate
Scenario 2: JVM with high RSS but small heap
Common cause:
- thread stacks
- metaspace
- direct buffers
- ZGC
Fix:
- increase memory limit
- set upper bounds on thread count
- reduce container page cache via mount flags
Scenario 3: Kubelet or containerd killed by kernel OOM
Fix:
- increase system-reserved/kube-reserved
- reduce Pod overcommit
- MemoryQoS
Scenario 4: High slab (inode/dentry leaks)
Fix:
- tune VFS caches
- reduce directories in hostPath
- prevent massive logging to files
Scenario 5: Direct reclaim storms
Fix:
- increase eviction-soft threshold
- reduce memory pressure via scaling
SEGMENT 8 SUMMARY
You now deeply understand core Linux internals relevant to Kubernetes memory behavior:
memory.current vs working set
- working set excludes inactive_file
- memory.current includes all usage
Page cache
- essential but causes eviction if too large
Slab memory
- kernel internal memory
- growth can exhaust memory
Reclaim (kswapd/direct reclaim)
- how kernel frees memory
- when system stalls
PSI (pressure stall information)
- best indicator of memory pressure
Kernel OOM
- picks victims based on badness score
- can kill kubelet or containerd
- races with kubelet eviction
MemoryQoS
- memory.min (reservation)
- memory.high (soft limit)
- memory.max (hard limit)
This is the deepest technical layer behind Kubernetes memory management.
Leave a comment