kubernetes Resource Isolation - 09. Full Node Resource Isolation Architecture

October 13, 2025 5 minute read

Segment 9 is the “grand unification” segment, where we combine everything from Segments 1–8 into a single mental model:

How a Kubernetes node actually works internally as a resource-managed system, from hardware → kernel → cgroups → kubelet → Pods. This is the architecture senior SREs, kernel engineers, and AI infrastructure teams use to design production-grade clusters with predictable performance and stability.

SEGMENT 9 — Full Node Resource Isolation Architecture

We will integrate:

CPU (requests, limits, throttling, shares, CPUManager, cpuset)
Memory (limits, working set, eviction, MemoryQoS, OOM)
I/O (blkio/io fairness)
PIDs (pid exhaustion)
HugePages
NUMA alignment (TopologyManager)
Node Allocatable + Reservations
Kubelet, container runtime, systemd
Linux kernel internals (page cache, slabs, PSI)
Device plugins (GPU, SR-IOV, RDMA)

And build the definitive model of how everything interacts.

PART 1 — The Node Stack Overview

We start with the stack diagram:

+------------------------------------------------------+
|                  USER WORKLOAD PODs                  |
|    - Containers (cgroups-v2 isolation)               |
|    - CPU quota/shares/pinning                        |
|    - Memory limits/requests/QoS                      |
+-------------------- kubelet -------------------------+
|    - Applies Pod cgroup settings                     |
|    - Enforces QoS, eviction, CPUManager, NodeAlloc   |
|    - container runtime API (CRI)                     |
+---------------- container runtime -------------------+
|    - Manages namespaces, seccomp, mounts             |
|    - Talks to systemd for cgroup changes             |
+---------------------- systemd ------------------------+
|    - Manages cgroup hierarchy (system.slice, pods)   |
+---------------------- Linux Kernel -------------------+
|    - cgroups: cpu, cpuset, memory, io, pids          |
|    - reclaim, PSI, page cache, slab, OOM killer      |
|    - NUMA topology                                   |
+----------------------- Hardware ----------------------+
|    CPU cores, memory nodes, disks, NICs              |
+------------------------------------------------------+

Everything below the kubelet is Linux kernel machinery.

Everything above is Kubernetes policy.

PART 2 — CPU Isolation Architecture (Integrated Model)

CPU in Kubernetes is controlled at 4 layers.

Layer 1 — CPU Requests → cpu.weight

Sets fairness during contention.

Pods “compete” proportionally based on:

cpu.request * weight_scale

Layer 2 — CPU Limits → CFS quota

Sets absolute maximum:

cpu.cfs_quota_us = limit * 100000

Layer 3 — CPU Manager

If enabled (static):

whole-integer CPU requests → exclusive CPUs
cpuset.cpus = “X-Y”
no throttling (no CFS quota)
no noisy neighbor interference

Layer 4 — Topology Manager

Ensures:

CPU cores
hugepages
devices (GPU/NIC queues) are pulled from the same NUMA node.

This layer prevents:

cross-socket memory latency
inconsistent performance
jitter in AI and network workloads.

PART 3 — Memory Isolation Architecture

Memory has more interacting parts than CPU.

Kubernetes uses:

1. memory.max

Hard limit — violation → container OOMKilled.

2. memory.min (MemoryQoS)

Guarantees each Pod keeps its requested memory before reclaim.

3. memory.high (soft limit)

Triggers reclaim/throttling if Pod exceeds memory.high.

4. Dynamic working set

kubelet uses:

memory.current - inactive_file

to determine working memory.

5. Eviction thresholds

Use memory.available (NOT cgroup memory) to determine pre-OOM action.

6. Node Allocatable

Prevents workload starvation of system-daemons.

7. Kernel OOM

Kills processes when reclaim fails.

All of these layers interact:

   Pod Memory Limit
       |
       V
   memory.max  → hard OOM
       |
       V
memory.high   → kernel throttles/reclaims
       |
       V
memory.min    → reserve working set
       |
       V
Node Eviction (available < threshold)
       |
       V
Kernel OOM (when reclaim fails)

This is the complete memory isolation flow.

PART 4 — I/O, PIDs, HugePages

These supplement CPU/memory.

I/O (blkio/io controller)

Limited use in Kubernetes; fairness only.

PIDs

Prevents node meltdown:

pids.max = <limit>

HugePages

Separate cgroup controller:

hugetlb.2MB.max = X
hugetlb.1GB.max = Y

No overcommit allowed.

Used heavily in:

DPDK
Redis
ML inference

PART 5 — Node Allocatable Model

Node resources are carved:

                        TOTAL NODE MEMORY
  +---------------------------------------------------------+
  |   system-reserved                                       |
  +---------------------------------------------------------+
  |   kube-reserved                                          |
  +---------------------------------------------------------+
  |   eviction thresholds (memory.available must stay above) |
  +---------------------------------------------------------+
  |   Node Allocatable (scheduler target)                    |
  +---------------------------------------------------------+

Pods should never consume below eviction thresholds.

If they do:

kubelet evicts Pods
or kernel OOM kills things

Node Allocatable ensures scheduler doesn’t overpack.

PART 6 — How All Isolation Components Interact

Here is the full lifecycle of a Pod on a node:

Step 1 — Scheduler

Decides based on:

requests
node allocatable
taints/tolerations
NUMA hints (Topology Aware Scheduling in future)

Step 2 — Kubelet

Creates:

Pod cgroup
Container cgroups

Applies:

cpu.weight
cpu.cfs_quota_us
memory.max
memory.high/min (MemoryQoS)
pids.max
cpuset.cpus (CPUManager)
cpuset.mems (TopologyManager)

Kubelet does not do enforcement — kernel does.

Step 3 — Container runtime

(containerd/CRI-O)

Creates namespaces
Launches container process inside cgroup
Systemd applies cgroup settings

Step 4 — Linux Kernel

Enforces:

CPU throttling
CPU pinning
Memory allocation
Memory reclaim
Direct reclaim
PSI pressure
blkio/io fairness
PIDs limit
HugePages accounting
NUMA locality

Step 5 — Kubelet Monitoring

Every 10s:

reads cgroup stats
calculates working set
checks eviction thresholds
updates Node conditions
evicts Pods if necessary

Step 6 — Kernel OOM Killer

If reclaim fails:

kills highest badness process
may kill kubelet or container runtime
node may go NotReady
Pods reschedule elsewhere

This is where node meltdown happens.

PART 7 — Putting Everything Together (Node Architecture Diagram)

+-------------------------------------------------------------+
|                         USER PODS                           |
|                                                             |
|  - CPU cgroup (weight, quota, cpuset, pinned CPUs)          |
|  - Memory cgroup (max, high, min)                           |
|  - I/O cgroup (io.max/weight via runtime)                   |
|  - PID cgroup (pids.max)                                    |
|  - HugePages cgroup (hugetlb.X.max)                         |
+------------------------------ kubelet ----------------------+
|  Applies QoS, limits, MemoryQoS, CPUManager, topology rules |
|  Calculates working set, eviction, node conditions          |
+------------------------ container runtime ------------------+
|  Creates namespaces & cgroups, delegates to systemd         |
+----------------------------- systemd -----------------------+
|  Manages full cgroup hierarchy (kubepods.slice)             |
+----------------------------- kernel ------------------------+
|  PSI: detects memory pressure                               |
|  kswapd: reclaim pages                                       |
|  OOM killer: last resort                                     |
|  Apply quota, pinning, hugepages, pids                      |
|  NUMA: memory locality, CPU locality                        |
+----------------------------- hardware ----------------------+
|  CPU cores, NUMA nodes, memory, disks, NIC                  |
+-------------------------------------------------------------+

PART 8 — How to Build the Perfect Node Isolation Template (Recommended)

Here is the ideal configuration for a production-grade cluster:

CPU:

cpuManagerPolicy: static
cpuCFSQuota: true
topologyManagerPolicy: single-numa-node

Memory:

memoryQoS: enabled
evictionHard: memory.available<500Mi
evictionSoft: memory.available<1Gi

Reservations:

systemReserved: cpu=500m,memory=1Gi
kubeReserved: cpu=1,memory=2Gi

PIDs:

podMaxPids: 1024

OS:

Use cgroup v2
Use systemd unified hierarchy
Use containerd
Enable PSI reporting
Reserve HugePages if required

Hardware:

Prefer NUMA-exposed VMs
Local SSD for I/O-sensitive workloads
High memory bandwidth for ML workloads

PART 9 — Runbook for Node Pressure Events

What to check on pressure:

cat /proc/pressure/memory
cat /sys/fs/cgroup/kubepods.slice/.../memory.current
cat /sys/fs/cgroup/.../memory.stat
dmesg | grep -i oom
journalctl -u kubelet
free -h / vmstat 1 / slabtop
Identify:
- page cache overgrowth
- slab overgrowth
- bad actors with huge working set
- memory-thrashing pods

SEGMENT 9 SUMMARY

You now have the full 360° understanding of Kubernetes resource isolation:

CPU

weights, quota, exclusive cpus, NUMA

Memory

limits, working set, eviction, reclaim, OOM

Topology

single-numa-node alignment of CPU/memory/devices

Node Allocatable

the real capacity available to Pods

Kernel Internals

PSI, kswapd, direct reclaim, OOM killer

Runtime & systemd

how cgroups actually get created

Hardware Constraints

NUMA topology, hugepages, local SSD

This integrated model is the mental foundation for building stable, high-performance Kubernetes clusters.

Share on

Twitter Facebook Reddit LinkedIn Mastodon

Maung San