kubernetes Resource Isolation - 08. Linux Kernel Memory Internals for Kubernete

October 12, 2025 4 minute read

This is where we look under the hood at what Linux is doing while Kubernetes is trying to keep the node alive.

If you understand Segment 8, you can troubleshoot ANY Kubernetes memory issue — Java apps, page cache pressure, kernel OOMs, eviction storms, and mysterious node reboots.

SEGMENT 8 — Linux Kernel Memory Internals for Kubernetes

This segment covers:

memory.current vs working set
Page cache behavior
Slab memory & kernel accounting
Reclaim mechanism (kswapd, pressure, aging)
PSI (Pressure Stall Information)
OOM scoring, oom_score_adj, and the kernel’s OOM killer logic
Why Kubernetes eviction fights with kernel OOM
How MemoryQoS interacts with kernel reclaim
Real-world troubleshooting patterns

Let’s go step-by-step.

PART 1 — memory.current vs “working set”

In cgroup v2:

memory.current
memory.stat

memory.current (formerly memory.usage_in_bytes)

Includes:

file cache (page cache)
anonymous memory (heap, stack)
shared memory
kernel memory (some parts)
tmpfs

Kubernetes usually cares about working set, not memory.current.

Working set approximates:

“The actively used memory that cannot be easily reclaimed.”

Kubernetes calculates:

memory working set
usually = usage - inactive_file

From cgroup v1/v2 stats:

memory.stat:
  file
  file_mapped
  file_dirty
  inactive_file
  active_anon
  inactive_anon

Working set excludes page cache because page cache can be reclaimed safely.

PART 2 — Page Cache Behavior (critical for eviction)

Linux uses unused RAM for:

disk caching
read-ahead
buffering I/O

This is NOT wasteful. It improves performance.

But:

Page cache counts toward memory.current and reduces memory.available → triggers eviction.

This is the #1 cause of unexpected evictions.

How reclaim works:

Linux scans these LRU lists:

active_file
inactive_file
active_anon
inactive_anon

Priority:

inactive_file (cold page cache — reclaimable)
inactive_anon (swappable anon memory)
active_file/active_anon (least preferred)

If nothing can be reclaimed → memory pressure escalates.

PART 3 — Slab Memory (kernel memory)

/proc/slabinfo OR memory.stat → slab

Slab includes:

inodes
dentries
network buffers
kernel objects
cgroup metadata

Slab can grow under:

high file operations
high network throughput
many pods/containers (K8s itself uses lots of slab)
massive directory listings
logging & journald

Slab is partially reclaimable but not fully.

When slab grows too large:

memory.available shrinks
Kubelet triggers eviction
OR kernel OOM kills things

PART 4 — Reclaim Mechanism (kswapd & direct reclaim)

Two main reclaim paths:

1. kswapd (background reclaim)

Triggered when:

memory.available falls below watermark

kswapd tries to free:

page cache
clean file-backed memory
inactive anonymous memory

If kswapd can keep up, node survives.

2. Direct reclaim (synch reclaim)

If memory pressure grows faster than kswapd can handle:

A process itself is forced into reclaim:

it sleeps
scans pages
causes latency spikes
can stall entire workloads

Direct reclaim failures → system goes into:

3. OOM killer

When all reclaim paths fail.

PART 5 — PSI (Pressure Stall Information)

One of the best modern kernel features for diagnosing resource pressure.

Located in:

/proc/pressure/memory
/proc/pressure/cpu
/proc/pressure/io

Output example:

some avg10=5.00 avg60=2.00 avg300=1.00 total=12000
full avg10=0.50 avg60=0.20 avg300=0.10 total=500

Meaning:

avg10=5% processes are stalled due to memory reclaim
“full” = even processes that do not need memory stall because holder is stalled

High PSI memory → system is under severe reclaim pressure.

K8s doesn’t use PSI yet, but containerd + systemd do.

PART 6 — OOM Killer Internals (oom_score & badness)

When reclaim is exhausted:

The kernel OOM killer chooses a victim based on:

Process RSS
oom_score
oom_score_adj
badness heuristic
memory cgroup violations

You can see a process’s score:

cat /proc/<pid>/oom_score
cat /proc/<pid>/oom_score_adj

Kubernetes sets `oom_score_adj` for containers:

Guaranteed pods: low (harder to kill)
Burstable: medium
BestEffort: very high

Actual values:

Guaranteed: -998
Burstable: 100
BestEffort: 1000

These match eviction priorities.

PART 7 — Kubernetes Eviction vs Kernel OOM: who wins?

Kubelet Eviction:

Works at pod level
Evicts Pods to prevent node collapse
Slow compared to kernel OOM
Depends on periodic stats updates
Reacts to memory.available

Kernel OOM:

Instant
Works at process level
Will kill inside containers
Can kill kubelet or containerd (!)

Real world:

Kernel OOM often fires before kubelet can react.

This is why MemoryQoS was introduced.

PART 8 — MemoryQoS & cgroup v2 (memory.min & memory.high)

memory.min (reserve)

Guarantees a Pod cannot be reclaimed below its request.

Example:

memory.min = 500Mi

memory.high (throttling)

When exceeding memory.high:

kernel slows down refaults
reclaims pages more aggressively
prevents sudden OOM kills
serves as a “soft limit”

memory.max (hard limit)

Container cannot exceed this.

MemoryQoS dramatically reduces:

sudden OOMs
reclaim storms
eviction noise

PART 9 — Real World Troubleshooting Scenarios

Scenario 1: Node pressure with high page cache

Symptoms:

memory.current high
working_set low
PSI memory rises
eviction-soft triggers

Fix:

use MemoryQoS
reduce file I/O
isolate workloads
reduce log rate

Scenario 2: JVM with high RSS but small heap

Common cause:

thread stacks
metaspace
direct buffers
ZGC

Fix:

increase memory limit
set upper bounds on thread count
reduce container page cache via mount flags

Scenario 3: Kubelet or containerd killed by kernel OOM

Fix:

increase system-reserved/kube-reserved
reduce Pod overcommit
MemoryQoS

Scenario 4: High slab (inode/dentry leaks)

Fix:

tune VFS caches
reduce directories in hostPath
prevent massive logging to files

Scenario 5: Direct reclaim storms

Fix:

increase eviction-soft threshold
reduce memory pressure via scaling

SEGMENT 8 SUMMARY

You now deeply understand core Linux internals relevant to Kubernetes memory behavior:

memory.current vs working set

working set excludes inactive_file
memory.current includes all usage

Page cache

essential but causes eviction if too large

Slab memory

kernel internal memory
growth can exhaust memory

Reclaim (kswapd/direct reclaim)

how kernel frees memory
when system stalls

PSI (pressure stall information)

best indicator of memory pressure

Kernel OOM

picks victims based on badness score
can kill kubelet or containerd
races with kubelet eviction

MemoryQoS

memory.min (reservation)
memory.high (soft limit)
memory.max (hard limit)

This is the deepest technical layer behind Kubernetes memory management.

Share on

Twitter Facebook Reddit LinkedIn Mastodon

Maung San

SEGMENT 8 — Linux Kernel Memory Internals for Kubernetes

PART 1 — memory.current vs “working set”

In cgroup v2:

memory.current (formerly memory.usage_in_bytes)

Kubernetes usually cares about working set, not memory.current.

PART 2 — Page Cache Behavior (critical for eviction)

But:

How reclaim works:

PART 3 — Slab Memory (kernel memory)

PART 4 — Reclaim Mechanism (kswapd & direct reclaim)

1. kswapd (background reclaim)

2. Direct reclaim (synch reclaim)

3. OOM killer

PART 5 — PSI (Pressure Stall Information)

PART 6 — OOM Killer Internals (oom_score & badness)

The kernel OOM killer chooses a victim based on:

Kubernetes sets oom_score_adj for containers:

PART 7 — Kubernetes Eviction vs Kernel OOM: who wins?

Kubelet Eviction:

Kernel OOM:

Real world:

PART 8 — MemoryQoS & cgroup v2 (memory.min & memory.high)

memory.min (reserve)

memory.high (throttling)

memory.max (hard limit)

PART 9 — Real World Troubleshooting Scenarios

Scenario 1: Node pressure with high page cache

Scenario 2: JVM with high RSS but small heap

Scenario 3: Kubelet or containerd killed by kernel OOM

Scenario 4: High slab (inode/dentry leaks)

Scenario 5: Direct reclaim storms

SEGMENT 8 SUMMARY

memory.current vs working set

Page cache

Slab memory

Reclaim (kswapd/direct reclaim)

PSI (pressure stall information)

Kernel OOM

MemoryQoS

Share on

Leave a comment

You may also enjoy

DevOps Quick Read - How to Read a Packer Template in 60 Seconds

kubernetes Resource Isolation - 14. A catalog of **cluster design patterns

kubernetes Resource Isolation - 13. Production-ready node & kubelet blueprint

kubernetes Resource Isolation - 12. Ultimate Node Sizing Guide for AKS, EKS, and GKE

Kubernetes sets `oom_score_adj` for containers: