kubernetes Resource Isolation - 06. CPU Manager + Topology Manager Deep Dive
Segment 6 is one of the most advanced and misunderstood areas of Kubernetes resource isolation. This is where Kubernetes goes beyond “fair CPU time” and enters the world of exclusive CPUs, NUMA alignment, latency-sensitive workloads, and AI/HPC performance tuning.
SEGMENT 6 — CPU Manager + Topology Manager Deep Dive
We cover:
- Why CPU pinning matters
- CPU Manager Policies (none → static)
- How whole-CPU guarantees work
- How Kubernetes allocates exclusive CPUs internally
- Topology Manager Policies
- Combining CPU + HugePages + Device Plugins
- NUMA-level behavior and pitfalls
- Real-world examples (AI, HFT, Envoy, Redis, Java)
Let’s begin.
PART 1 — Why CPU Pinning Exists in Kubernetes
Normally, Kubernetes gives Pods:
- shared CPU time
- no pinning
- threads migrate across cores
- memory allocated across NUMA nodes
This is totally fine for:
- web apps
- API servers
- batch jobs
- most microservices
But NOT okay for:
- network dataplanes (Envoy, Cilium agent)
- AI inference workers
- Redis / Memcached
- JVM apps with GC-sensitive behavior
- High-frequency trading systems
- HPC workloads
- DPDK-based NFV systems
These workloads suffer from:
- CPU cache thrash
- cross-core migration
- NUMA remote memory access
- scheduler jitter
Therefore Kubernetes introduced:
CPU Manager Topology Manager
PART 2 — CPU Manager Policies
Enable CPU Manager via kubelet:
--cpu-manager-policy=<none|static|dynamic>
1. Policy: none (default)
Behavior:
- cpuset includes all CPUs
- processes float freely
- no exclusivity
- CPU requests → shares
- CPU limits → CFS quota
- lowest isolation
Suitable for 95% of app workloads.
2. Policy: static (the important one)
--cpu-manager-policy=static
Enables:
- Guaranteed Pods with whole-integer CPU request → get dedicated CPU cores
- Kubelet removes those cores from “shared pool”
- Sets
cpuset.cpusto the exclusive CPUs
Requirements:
- QoS class = Guaranteed
- CPU request = limit
- CPU request is a whole number (no millis)
Example:
resources:
requests:
cpu: "2"
limits:
cpu: "2"
Pod will receive:
cpuset.cpus = "4-5" # example cpus
Threads inside the container can ONLY execute on those 2 CPUs.
No throttling
Because:
- limit = request
- no CFS quota
- container has full, unrestricted CPUs
This is extremely powerful.
3. Policy: dynamic (new)
Introduced to:
- allow more flexible CPU allocations
- share assignable CPU pools
- supports fractional CPUs as best-effort
Still evolving — static is the industry standard for low-latency and AI.
PART 3 — How Whole-CPU Guarantees Actually Work (internals)
When the scheduler places a Pod, kubelet:
- Looks at the node’s CPU topology
-
Has two CPU pools:
- sharedPool
- exclusivePool
-
For a 2-core request:
- Removes 2 cores from sharedPool
- Places Pod in exclusivePool
These cores are marked “allocated” in a checkpoint file:
/var/lib/kubelet/cpu_manager_state
Format example:
{
"policyName": "static",
"defaultCpuSet": "0,1,2,3",
"entries": {
"podUID": {
"containerName": "app",
"cpuset": "2,3"
}
}
}
This is how kubelet ensures:
- exclusive use
- state persistence across kubelet restarts
PART 4 — NUMA Awareness via Topology Manager
Enable via kubelet:
--topology-manager-policy=<none|best-effort|restricted|single-numa-node>
Why this matters
On multi-socket systems:
- CPU sockets = NUMA nodes
- Memory attached to each node
- Remote memory accesses are slower
- GPUs/NIC queues tied to specific NUMA domains
If you allocate:
- CPUs from socket 0
- HugePages from socket 1
- GPU from socket 1 Your performance tanks.
Topology Manager Policies
1. none (default)
No alignment. Requests can come from anywhere.
2. best-effort
Try to align NUMA if possible, but does NOT fail Pod if not.
3. restricted
If NUMA alignment cannot be met, the Pod fails to schedule on that node.
This is for moderately sensitive workloads.
4. single-numa-node (strictest)
All resources must come from one NUMA node:
- all CPUs
- hugepages
- device plugin resources If not possible → Pod permanently rejected on that node.
This is what:
- DPDK
- SR-IOV
- high-performance Redis
- AI inference use.
PART 5 — How CPU Manager + Topology Manager Work Together
Flow:
- Scheduler assigns Pod to node
-
Kubelet:
- checks NUMA topology
- requests CPUs, hugepages, GPUs, RDMA devices
-
Policies decide:
- whether the node can satisfy NUMA alignment
-
If success:
- CPU Manager picks physical cores
- cpuset.cpus / cpuset.mems set
- Container runtime starts container with CPU pinning
- Pod now runs in deterministic NUMA domain
PART 6 — Combining CPUs + HugePages + GPUs + NIC queues
This is where Kubernetes becomes HPC/hyper-optimized.
Example: AI inference Pod
- 2 CPUs pinned
- GPU device plugin (NUMA node 1)
- HugePages 2Mi from NUMA node 1
- Topology Manager = single-numa-node
All resources land on the same NUMA node:
cpuset.cpus = "8-9"
cpuset.mems = "1"
hugetlb.2MB.limit = 1024MB (NUMA node 1)
device plugin = GPU0 (NUMA node 1)
This increases performance by:
- 20–40% on some AI models
- reduces latency jitter massively
PART 7 — Real-World Example Configurations
1. AI/ML Inference Pod
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: 1
Node:
cpu-manager-policy=static
topology-manager-policy=single-numa-node
memory-qos=true
Outcome:
- pinned CPUs
- pinned NUMA memory
- local HugePages
- GPU aligned
- extremely stable inference latency
2. Low-latency Envoy/Cilium Pod
requests:
cpu: "2"
limits:
cpu: "2"
Paired with:
- CPU pinning
- dedicated nodes (taints)
- static CPU Manager
Outcome:
- consistent packet processing
- zero jitter
- predictable latency
PART 8 — NUMA Pitfalls & Gotchas
1. Fractional CPUs disable pinning
CPU request must be:
1, 2, 3... NOT 500m
2. Not Guaranteed QoS → no pinning
Meaning:
requests.cpu != limits.cpu → no pinning
3. Multi-container Pod shares CPU set
Pods do NOT get per-container exclusive CPUs. All containers in a Guaranteed Pod share the same pinned CPUs.
4. Resource fragmentation
If cores get allocated unevenly across nodes, Pods may fail NUMA alignment.
5. Restart storm
When kubelet restarts, CPU Manager restores state from checkpoint, but mismatches can cause Pod evictions.
6. Node images (AKS/EKS/GKE) differ
Some cloud providers mask NUMA from the VM. If NUMA is hidden: Topology Manager becomes ineffective.
SEGMENT 6 SUMMARY
You now deeply understand:
CPU Manager
- Enables exclusive CPU allocation
- Requires static policy
- Guarantees whole-CPU cores
- Uses cpuset to pin workloads
- Maintains state in checkpoint file
Topology Manager
-
Ensures NUMA alignment for:
- CPUs
- HugePages
- GPUs
- NIC queues
- Prevents cross-node memory access
- Critical for HPC/AI/DPDK workloads
Combined Power
CPU Manager + Topology Manager transform Kubernetes from a “cloud orchestrator” into a high-performance computing platform.
Next Segment Options
Where would you like to go next?
👉 Segment 7 — Node Allocatable, System Reservations, Eviction & Kubelet internals
(critical for production node sizing & stability)
👉 Segment 8 — Linux Kernel Memory Internals for Kubernetes
(page cache, slab, OOM scoring, reclaim behavior)
👉 Segment 9 — Putting it all together: Full Node Resource Isolation Architecture
(Resource pools, QoS, CPU manager, memory QoS, hugepages, eviction, bin packing)
Pick your next deep dive!
Leave a comment