kubernetes Resource Isolation - 09. Full Node Resource Isolation Architecture

October 13, 2025  5 minute read  

Segment 9 is the “grand unification” segment, where we combine everything from Segments 1–8 into a single mental model:

How a Kubernetes node actually works internally as a resource-managed system, from hardware → kernel → cgroups → kubelet → Pods. This is the architecture senior SREs, kernel engineers, and AI infrastructure teams use to design production-grade clusters with predictable performance and stability.

SEGMENT 9 — Full Node Resource Isolation Architecture

We will integrate:

  • CPU (requests, limits, throttling, shares, CPUManager, cpuset)
  • Memory (limits, working set, eviction, MemoryQoS, OOM)
  • I/O (blkio/io fairness)
  • PIDs (pid exhaustion)
  • HugePages
  • NUMA alignment (TopologyManager)
  • Node Allocatable + Reservations
  • Kubelet, container runtime, systemd
  • Linux kernel internals (page cache, slabs, PSI)
  • Device plugins (GPU, SR-IOV, RDMA)

And build the definitive model of how everything interacts.


PART 1 — The Node Stack Overview

We start with the stack diagram:

+------------------------------------------------------+
|                  USER WORKLOAD PODs                  |
|    - Containers (cgroups-v2 isolation)               |
|    - CPU quota/shares/pinning                        |
|    - Memory limits/requests/QoS                      |
+-------------------- kubelet -------------------------+
|    - Applies Pod cgroup settings                     |
|    - Enforces QoS, eviction, CPUManager, NodeAlloc   |
|    - container runtime API (CRI)                     |
+---------------- container runtime -------------------+
|    - Manages namespaces, seccomp, mounts             |
|    - Talks to systemd for cgroup changes             |
+---------------------- systemd ------------------------+
|    - Manages cgroup hierarchy (system.slice, pods)   |
+---------------------- Linux Kernel -------------------+
|    - cgroups: cpu, cpuset, memory, io, pids          |
|    - reclaim, PSI, page cache, slab, OOM killer      |
|    - NUMA topology                                   |
+----------------------- Hardware ----------------------+
|    CPU cores, memory nodes, disks, NICs              |
+------------------------------------------------------+

Everything below the kubelet is Linux kernel machinery.

Everything above is Kubernetes policy.


PART 2 — CPU Isolation Architecture (Integrated Model)

CPU in Kubernetes is controlled at 4 layers.

Layer 1 — CPU Requests → cpu.weight

Sets fairness during contention.

Pods “compete” proportionally based on:

cpu.request * weight_scale

Layer 2 — CPU Limits → CFS quota

Sets absolute maximum:

cpu.cfs_quota_us = limit * 100000

Layer 3 — CPU Manager

If enabled (static):

  • whole-integer CPU requests → exclusive CPUs
  • cpuset.cpus = “X-Y”
  • no throttling (no CFS quota)
  • no noisy neighbor interference

Layer 4 — Topology Manager

Ensures:

  • CPU cores
  • hugepages
  • devices (GPU/NIC queues) are pulled from the same NUMA node.

This layer prevents:

  • cross-socket memory latency
  • inconsistent performance
  • jitter in AI and network workloads.

PART 3 — Memory Isolation Architecture

Memory has more interacting parts than CPU.

Kubernetes uses:

1. memory.max

Hard limit — violation → container OOMKilled.

2. memory.min (MemoryQoS)

Guarantees each Pod keeps its requested memory before reclaim.

3. memory.high (soft limit)

Triggers reclaim/throttling if Pod exceeds memory.high.

4. Dynamic working set

kubelet uses:

memory.current - inactive_file

to determine working memory.

5. Eviction thresholds

Use memory.available (NOT cgroup memory) to determine pre-OOM action.

6. Node Allocatable

Prevents workload starvation of system-daemons.

7. Kernel OOM

Kills processes when reclaim fails.

All of these layers interact:

   Pod Memory Limit
       |
       V
   memory.max  → hard OOM
       |
       V
memory.high   → kernel throttles/reclaims
       |
       V
memory.min    → reserve working set
       |
       V
Node Eviction (available < threshold)
       |
       V
Kernel OOM (when reclaim fails)

This is the complete memory isolation flow.


PART 4 — I/O, PIDs, HugePages

These supplement CPU/memory.

I/O (blkio/io controller)

Limited use in Kubernetes; fairness only.

PIDs

Prevents node meltdown:

pids.max = <limit>

HugePages

Separate cgroup controller:

hugetlb.2MB.max = X
hugetlb.1GB.max = Y

No overcommit allowed.

Used heavily in:

  • DPDK
  • Redis
  • ML inference

PART 5 — Node Allocatable Model

Node resources are carved:

                        TOTAL NODE MEMORY
  +---------------------------------------------------------+
  |   system-reserved                                       |
  +---------------------------------------------------------+
  |   kube-reserved                                          |
  +---------------------------------------------------------+
  |   eviction thresholds (memory.available must stay above) |
  +---------------------------------------------------------+
  |   Node Allocatable (scheduler target)                    |
  +---------------------------------------------------------+

Pods should never consume below eviction thresholds.

If they do:

  • kubelet evicts Pods
  • or kernel OOM kills things

Node Allocatable ensures scheduler doesn’t overpack.


PART 6 — How All Isolation Components Interact

Here is the full lifecycle of a Pod on a node:

Step 1 — Scheduler

Decides based on:

  • requests
  • node allocatable
  • taints/tolerations
  • NUMA hints (Topology Aware Scheduling in future)

Step 2 — Kubelet

Creates:

  • Pod cgroup
  • Container cgroups

Applies:

  • cpu.weight
  • cpu.cfs_quota_us
  • memory.max
  • memory.high/min (MemoryQoS)
  • pids.max
  • cpuset.cpus (CPUManager)
  • cpuset.mems (TopologyManager)

Kubelet does not do enforcement — kernel does.


Step 3 — Container runtime

(containerd/CRI-O)

  • Creates namespaces
  • Launches container process inside cgroup
  • Systemd applies cgroup settings

Step 4 — Linux Kernel

Enforces:

  • CPU throttling
  • CPU pinning
  • Memory allocation
  • Memory reclaim
  • Direct reclaim
  • PSI pressure
  • blkio/io fairness
  • PIDs limit
  • HugePages accounting
  • NUMA locality

Step 5 — Kubelet Monitoring

Every 10s:

  • reads cgroup stats
  • calculates working set
  • checks eviction thresholds
  • updates Node conditions
  • evicts Pods if necessary

Step 6 — Kernel OOM Killer

If reclaim fails:

  • kills highest badness process
  • may kill kubelet or container runtime
  • node may go NotReady
  • Pods reschedule elsewhere

This is where node meltdown happens.


PART 7 — Putting Everything Together (Node Architecture Diagram)

+-------------------------------------------------------------+
|                         USER PODS                           |
|                                                             |
|  - CPU cgroup (weight, quota, cpuset, pinned CPUs)          |
|  - Memory cgroup (max, high, min)                           |
|  - I/O cgroup (io.max/weight via runtime)                   |
|  - PID cgroup (pids.max)                                    |
|  - HugePages cgroup (hugetlb.X.max)                         |
+------------------------------ kubelet ----------------------+
|  Applies QoS, limits, MemoryQoS, CPUManager, topology rules |
|  Calculates working set, eviction, node conditions          |
+------------------------ container runtime ------------------+
|  Creates namespaces & cgroups, delegates to systemd         |
+----------------------------- systemd -----------------------+
|  Manages full cgroup hierarchy (kubepods.slice)             |
+----------------------------- kernel ------------------------+
|  PSI: detects memory pressure                               |
|  kswapd: reclaim pages                                       |
|  OOM killer: last resort                                     |
|  Apply quota, pinning, hugepages, pids                      |
|  NUMA: memory locality, CPU locality                        |
+----------------------------- hardware ----------------------+
|  CPU cores, NUMA nodes, memory, disks, NIC                  |
+-------------------------------------------------------------+

PART 8 — How to Build the Perfect Node Isolation Template (Recommended)

Here is the ideal configuration for a production-grade cluster:

CPU:

cpuManagerPolicy: static
cpuCFSQuota: true
topologyManagerPolicy: single-numa-node

Memory:

memoryQoS: enabled
evictionHard: memory.available<500Mi
evictionSoft: memory.available<1Gi

Reservations:

systemReserved: cpu=500m,memory=1Gi
kubeReserved: cpu=1,memory=2Gi

PIDs:

podMaxPids: 1024

OS:

  • Use cgroup v2
  • Use systemd unified hierarchy
  • Use containerd
  • Enable PSI reporting
  • Reserve HugePages if required

Hardware:

  • Prefer NUMA-exposed VMs
  • Local SSD for I/O-sensitive workloads
  • High memory bandwidth for ML workloads

PART 9 — Runbook for Node Pressure Events

What to check on pressure:

  1. cat /proc/pressure/memory
  2. cat /sys/fs/cgroup/kubepods.slice/.../memory.current
  3. cat /sys/fs/cgroup/.../memory.stat
  4. dmesg | grep -i oom
  5. journalctl -u kubelet
  6. free -h / vmstat 1 / slabtop
  7. Identify:

    • page cache overgrowth
    • slab overgrowth
    • bad actors with huge working set
    • memory-thrashing pods

SEGMENT 9 SUMMARY

You now have the full 360° understanding of Kubernetes resource isolation:

CPU

  • weights, quota, exclusive cpus, NUMA

Memory

  • limits, working set, eviction, reclaim, OOM

Topology

  • single-numa-node alignment of CPU/memory/devices

Node Allocatable

  • the real capacity available to Pods

Kernel Internals

  • PSI, kswapd, direct reclaim, OOM killer

Runtime & systemd

  • how cgroups actually get created

Hardware Constraints

  • NUMA topology, hugepages, local SSD

This integrated model is the mental foundation for building stable, high-performance Kubernetes clusters.


Leave a comment