kubernetes Resource Isolation - 08. Linux Kernel Memory Internals for Kubernete

October 12, 2025  4 minute read  

This is where we look under the hood at what Linux is doing while Kubernetes is trying to keep the node alive.

If you understand Segment 8, you can troubleshoot ANY Kubernetes memory issue — Java apps, page cache pressure, kernel OOMs, eviction storms, and mysterious node reboots.

SEGMENT 8 — Linux Kernel Memory Internals for Kubernetes

This segment covers:

  1. memory.current vs working set
  2. Page cache behavior
  3. Slab memory & kernel accounting
  4. Reclaim mechanism (kswapd, pressure, aging)
  5. PSI (Pressure Stall Information)
  6. OOM scoring, oom_score_adj, and the kernel’s OOM killer logic
  7. Why Kubernetes eviction fights with kernel OOM
  8. How MemoryQoS interacts with kernel reclaim
  9. Real-world troubleshooting patterns

Let’s go step-by-step.


PART 1 — memory.current vs “working set”

In cgroup v2:

memory.current
memory.stat

memory.current (formerly memory.usage_in_bytes)

Includes:

  • file cache (page cache)
  • anonymous memory (heap, stack)
  • shared memory
  • kernel memory (some parts)
  • tmpfs

Kubernetes usually cares about working set, not memory.current.

Working set approximates:

“The actively used memory that cannot be easily reclaimed.”

Kubernetes calculates:

  • memory working set
  • usually = usage - inactive_file

From cgroup v1/v2 stats:

memory.stat:
  file
  file_mapped
  file_dirty
  inactive_file
  active_anon
  inactive_anon

Working set excludes page cache because page cache can be reclaimed safely.


PART 2 — Page Cache Behavior (critical for eviction)

Linux uses unused RAM for:

  • disk caching
  • read-ahead
  • buffering I/O

This is NOT wasteful. It improves performance.

But:

Page cache counts toward memory.current and reduces memory.available → triggers eviction.

This is the #1 cause of unexpected evictions.


How reclaim works:

Linux scans these LRU lists:

  • active_file
  • inactive_file
  • active_anon
  • inactive_anon

Priority:

  1. inactive_file (cold page cache — reclaimable)
  2. inactive_anon (swappable anon memory)
  3. active_file/active_anon (least preferred)

If nothing can be reclaimed → memory pressure escalates.


PART 3 — Slab Memory (kernel memory)

/proc/slabinfo OR memory.stat → slab

Slab includes:

  • inodes
  • dentries
  • network buffers
  • kernel objects
  • cgroup metadata

Slab can grow under:

  • high file operations
  • high network throughput
  • many pods/containers (K8s itself uses lots of slab)
  • massive directory listings
  • logging & journald

Slab is partially reclaimable but not fully.

When slab grows too large:

  • memory.available shrinks
  • Kubelet triggers eviction
  • OR kernel OOM kills things

PART 4 — Reclaim Mechanism (kswapd & direct reclaim)

Two main reclaim paths:

1. kswapd (background reclaim)

Triggered when:

memory.available falls below watermark

kswapd tries to free:

  • page cache
  • clean file-backed memory
  • inactive anonymous memory

If kswapd can keep up, node survives.


2. Direct reclaim (synch reclaim)

If memory pressure grows faster than kswapd can handle:

A process itself is forced into reclaim:

  • it sleeps
  • scans pages
  • causes latency spikes
  • can stall entire workloads

Direct reclaim failures → system goes into:

3. OOM killer

When all reclaim paths fail.


PART 5 — PSI (Pressure Stall Information)

One of the best modern kernel features for diagnosing resource pressure.

Located in:

/proc/pressure/memory
/proc/pressure/cpu
/proc/pressure/io

Output example:

some avg10=5.00 avg60=2.00 avg300=1.00 total=12000
full avg10=0.50 avg60=0.20 avg300=0.10 total=500

Meaning:

  • avg10=5% processes are stalled due to memory reclaim
  • “full” = even processes that do not need memory stall because holder is stalled

High PSI memory → system is under severe reclaim pressure.

K8s doesn’t use PSI yet, but containerd + systemd do.


PART 6 — OOM Killer Internals (oom_score & badness)

When reclaim is exhausted:

The kernel OOM killer chooses a victim based on:

  1. Process RSS
  2. oom_score
  3. oom_score_adj
  4. badness heuristic
  5. memory cgroup violations

You can see a process’s score:

cat /proc/<pid>/oom_score
cat /proc/<pid>/oom_score_adj

Kubernetes sets oom_score_adj for containers:

  • Guaranteed pods: low (harder to kill)
  • Burstable: medium
  • BestEffort: very high

Actual values:

  • Guaranteed: -998
  • Burstable: 100
  • BestEffort: 1000

These match eviction priorities.


PART 7 — Kubernetes Eviction vs Kernel OOM: who wins?

Kubelet Eviction:

  • Works at pod level
  • Evicts Pods to prevent node collapse
  • Slow compared to kernel OOM
  • Depends on periodic stats updates
  • Reacts to memory.available

Kernel OOM:

  • Instant
  • Works at process level
  • Will kill inside containers
  • Can kill kubelet or containerd (!)

Real world:

Kernel OOM often fires before kubelet can react.

This is why MemoryQoS was introduced.


PART 8 — MemoryQoS & cgroup v2 (memory.min & memory.high)

memory.min (reserve)

Guarantees a Pod cannot be reclaimed below its request.

Example:

memory.min = 500Mi

memory.high (throttling)

When exceeding memory.high:

  • kernel slows down refaults
  • reclaims pages more aggressively
  • prevents sudden OOM kills
  • serves as a “soft limit”

memory.max (hard limit)

Container cannot exceed this.

MemoryQoS dramatically reduces:

  • sudden OOMs
  • reclaim storms
  • eviction noise

PART 9 — Real World Troubleshooting Scenarios

Scenario 1: Node pressure with high page cache

Symptoms:

  • memory.current high
  • working_set low
  • PSI memory rises
  • eviction-soft triggers

Fix:

  • use MemoryQoS
  • reduce file I/O
  • isolate workloads
  • reduce log rate

Scenario 2: JVM with high RSS but small heap

Common cause:

  • thread stacks
  • metaspace
  • direct buffers
  • ZGC

Fix:

  • increase memory limit
  • set upper bounds on thread count
  • reduce container page cache via mount flags

Scenario 3: Kubelet or containerd killed by kernel OOM

Fix:

  • increase system-reserved/kube-reserved
  • reduce Pod overcommit
  • MemoryQoS

Scenario 4: High slab (inode/dentry leaks)

Fix:

  • tune VFS caches
  • reduce directories in hostPath
  • prevent massive logging to files

Scenario 5: Direct reclaim storms

Fix:

  • increase eviction-soft threshold
  • reduce memory pressure via scaling

SEGMENT 8 SUMMARY

You now deeply understand core Linux internals relevant to Kubernetes memory behavior:

memory.current vs working set

  • working set excludes inactive_file
  • memory.current includes all usage

Page cache

  • essential but causes eviction if too large

Slab memory

  • kernel internal memory
  • growth can exhaust memory

Reclaim (kswapd/direct reclaim)

  • how kernel frees memory
  • when system stalls

PSI (pressure stall information)

  • best indicator of memory pressure

Kernel OOM

  • picks victims based on badness score
  • can kill kubelet or containerd
  • races with kubelet eviction

MemoryQoS

  • memory.min (reservation)
  • memory.high (soft limit)
  • memory.max (hard limit)

This is the deepest technical layer behind Kubernetes memory management.

Leave a comment