kubernetes Resource Isolation - 09. Full Node Resource Isolation Architecture
Segment 9 is the “grand unification” segment, where we combine everything from Segments 1–8 into a single mental model:
How a Kubernetes node actually works internally as a resource-managed system, from hardware → kernel → cgroups → kubelet → Pods. This is the architecture senior SREs, kernel engineers, and AI infrastructure teams use to design production-grade clusters with predictable performance and stability.
SEGMENT 9 — Full Node Resource Isolation Architecture
We will integrate:
- CPU (requests, limits, throttling, shares, CPUManager, cpuset)
- Memory (limits, working set, eviction, MemoryQoS, OOM)
- I/O (blkio/io fairness)
- PIDs (pid exhaustion)
- HugePages
- NUMA alignment (TopologyManager)
- Node Allocatable + Reservations
- Kubelet, container runtime, systemd
- Linux kernel internals (page cache, slabs, PSI)
- Device plugins (GPU, SR-IOV, RDMA)
And build the definitive model of how everything interacts.
PART 1 — The Node Stack Overview
We start with the stack diagram:
+------------------------------------------------------+
| USER WORKLOAD PODs |
| - Containers (cgroups-v2 isolation) |
| - CPU quota/shares/pinning |
| - Memory limits/requests/QoS |
+-------------------- kubelet -------------------------+
| - Applies Pod cgroup settings |
| - Enforces QoS, eviction, CPUManager, NodeAlloc |
| - container runtime API (CRI) |
+---------------- container runtime -------------------+
| - Manages namespaces, seccomp, mounts |
| - Talks to systemd for cgroup changes |
+---------------------- systemd ------------------------+
| - Manages cgroup hierarchy (system.slice, pods) |
+---------------------- Linux Kernel -------------------+
| - cgroups: cpu, cpuset, memory, io, pids |
| - reclaim, PSI, page cache, slab, OOM killer |
| - NUMA topology |
+----------------------- Hardware ----------------------+
| CPU cores, memory nodes, disks, NICs |
+------------------------------------------------------+
Everything below the kubelet is Linux kernel machinery.
Everything above is Kubernetes policy.
PART 2 — CPU Isolation Architecture (Integrated Model)
CPU in Kubernetes is controlled at 4 layers.
Layer 1 — CPU Requests → cpu.weight
Sets fairness during contention.
Pods “compete” proportionally based on:
cpu.request * weight_scale
Layer 2 — CPU Limits → CFS quota
Sets absolute maximum:
cpu.cfs_quota_us = limit * 100000
Layer 3 — CPU Manager
If enabled (static):
- whole-integer CPU requests → exclusive CPUs
- cpuset.cpus = “X-Y”
- no throttling (no CFS quota)
- no noisy neighbor interference
Layer 4 — Topology Manager
Ensures:
- CPU cores
- hugepages
- devices (GPU/NIC queues) are pulled from the same NUMA node.
This layer prevents:
- cross-socket memory latency
- inconsistent performance
- jitter in AI and network workloads.
PART 3 — Memory Isolation Architecture
Memory has more interacting parts than CPU.
Kubernetes uses:
1. memory.max
Hard limit — violation → container OOMKilled.
2. memory.min (MemoryQoS)
Guarantees each Pod keeps its requested memory before reclaim.
3. memory.high (soft limit)
Triggers reclaim/throttling if Pod exceeds memory.high.
4. Dynamic working set
kubelet uses:
memory.current - inactive_file
to determine working memory.
5. Eviction thresholds
Use memory.available (NOT cgroup memory) to determine pre-OOM action.
6. Node Allocatable
Prevents workload starvation of system-daemons.
7. Kernel OOM
Kills processes when reclaim fails.
All of these layers interact:
Pod Memory Limit
|
V
memory.max → hard OOM
|
V
memory.high → kernel throttles/reclaims
|
V
memory.min → reserve working set
|
V
Node Eviction (available < threshold)
|
V
Kernel OOM (when reclaim fails)
This is the complete memory isolation flow.
PART 4 — I/O, PIDs, HugePages
These supplement CPU/memory.
I/O (blkio/io controller)
Limited use in Kubernetes; fairness only.
PIDs
Prevents node meltdown:
pids.max = <limit>
HugePages
Separate cgroup controller:
hugetlb.2MB.max = X
hugetlb.1GB.max = Y
No overcommit allowed.
Used heavily in:
- DPDK
- Redis
- ML inference
PART 5 — Node Allocatable Model
Node resources are carved:
TOTAL NODE MEMORY
+---------------------------------------------------------+
| system-reserved |
+---------------------------------------------------------+
| kube-reserved |
+---------------------------------------------------------+
| eviction thresholds (memory.available must stay above) |
+---------------------------------------------------------+
| Node Allocatable (scheduler target) |
+---------------------------------------------------------+
Pods should never consume below eviction thresholds.
If they do:
- kubelet evicts Pods
- or kernel OOM kills things
Node Allocatable ensures scheduler doesn’t overpack.
PART 6 — How All Isolation Components Interact
Here is the full lifecycle of a Pod on a node:
Step 1 — Scheduler
Decides based on:
- requests
- node allocatable
- taints/tolerations
- NUMA hints (Topology Aware Scheduling in future)
Step 2 — Kubelet
Creates:
- Pod cgroup
- Container cgroups
Applies:
- cpu.weight
- cpu.cfs_quota_us
- memory.max
- memory.high/min (MemoryQoS)
- pids.max
- cpuset.cpus (CPUManager)
- cpuset.mems (TopologyManager)
Kubelet does not do enforcement — kernel does.
Step 3 — Container runtime
(containerd/CRI-O)
- Creates namespaces
- Launches container process inside cgroup
- Systemd applies cgroup settings
Step 4 — Linux Kernel
Enforces:
- CPU throttling
- CPU pinning
- Memory allocation
- Memory reclaim
- Direct reclaim
- PSI pressure
- blkio/io fairness
- PIDs limit
- HugePages accounting
- NUMA locality
Step 5 — Kubelet Monitoring
Every 10s:
- reads cgroup stats
- calculates working set
- checks eviction thresholds
- updates Node conditions
- evicts Pods if necessary
Step 6 — Kernel OOM Killer
If reclaim fails:
- kills highest badness process
- may kill kubelet or container runtime
- node may go NotReady
- Pods reschedule elsewhere
This is where node meltdown happens.
PART 7 — Putting Everything Together (Node Architecture Diagram)
+-------------------------------------------------------------+
| USER PODS |
| |
| - CPU cgroup (weight, quota, cpuset, pinned CPUs) |
| - Memory cgroup (max, high, min) |
| - I/O cgroup (io.max/weight via runtime) |
| - PID cgroup (pids.max) |
| - HugePages cgroup (hugetlb.X.max) |
+------------------------------ kubelet ----------------------+
| Applies QoS, limits, MemoryQoS, CPUManager, topology rules |
| Calculates working set, eviction, node conditions |
+------------------------ container runtime ------------------+
| Creates namespaces & cgroups, delegates to systemd |
+----------------------------- systemd -----------------------+
| Manages full cgroup hierarchy (kubepods.slice) |
+----------------------------- kernel ------------------------+
| PSI: detects memory pressure |
| kswapd: reclaim pages |
| OOM killer: last resort |
| Apply quota, pinning, hugepages, pids |
| NUMA: memory locality, CPU locality |
+----------------------------- hardware ----------------------+
| CPU cores, NUMA nodes, memory, disks, NIC |
+-------------------------------------------------------------+
PART 8 — How to Build the Perfect Node Isolation Template (Recommended)
Here is the ideal configuration for a production-grade cluster:
CPU:
cpuManagerPolicy: static
cpuCFSQuota: true
topologyManagerPolicy: single-numa-node
Memory:
memoryQoS: enabled
evictionHard: memory.available<500Mi
evictionSoft: memory.available<1Gi
Reservations:
systemReserved: cpu=500m,memory=1Gi
kubeReserved: cpu=1,memory=2Gi
PIDs:
podMaxPids: 1024
OS:
- Use cgroup v2
- Use systemd unified hierarchy
- Use containerd
- Enable PSI reporting
- Reserve HugePages if required
Hardware:
- Prefer NUMA-exposed VMs
- Local SSD for I/O-sensitive workloads
- High memory bandwidth for ML workloads
PART 9 — Runbook for Node Pressure Events
What to check on pressure:
cat /proc/pressure/memorycat /sys/fs/cgroup/kubepods.slice/.../memory.currentcat /sys/fs/cgroup/.../memory.statdmesg | grep -i oomjournalctl -u kubeletfree -h/vmstat 1/slabtop-
Identify:
- page cache overgrowth
- slab overgrowth
- bad actors with huge working set
- memory-thrashing pods
SEGMENT 9 SUMMARY
You now have the full 360° understanding of Kubernetes resource isolation:
CPU
- weights, quota, exclusive cpus, NUMA
Memory
- limits, working set, eviction, reclaim, OOM
Topology
- single-numa-node alignment of CPU/memory/devices
Node Allocatable
- the real capacity available to Pods
Kernel Internals
- PSI, kswapd, direct reclaim, OOM killer
Runtime & systemd
- how cgroups actually get created
Hardware Constraints
- NUMA topology, hugepages, local SSD
This integrated model is the mental foundation for building stable, high-performance Kubernetes clusters.
Leave a comment