kubernetes Resource Isolation - 06. CPU Manager + Topology Manager Deep Dive

October 06, 2025 5 minute read

Segment 6 is one of the most advanced and misunderstood areas of Kubernetes resource isolation. This is where Kubernetes goes beyond “fair CPU time” and enters the world of exclusive CPUs, NUMA alignment, latency-sensitive workloads, and AI/HPC performance tuning.

SEGMENT 6 — CPU Manager + Topology Manager Deep Dive

We cover:

Why CPU pinning matters
CPU Manager Policies (none → static)
How whole-CPU guarantees work
How Kubernetes allocates exclusive CPUs internally
Topology Manager Policies
Combining CPU + HugePages + Device Plugins
NUMA-level behavior and pitfalls
Real-world examples (AI, HFT, Envoy, Redis, Java)

Let’s begin.

PART 1 — Why CPU Pinning Exists in Kubernetes

Normally, Kubernetes gives Pods:

shared CPU time
no pinning
threads migrate across cores
memory allocated across NUMA nodes

This is totally fine for:

web apps
API servers
batch jobs
most microservices

But NOT okay for:

network dataplanes (Envoy, Cilium agent)
AI inference workers
Redis / Memcached
JVM apps with GC-sensitive behavior
High-frequency trading systems
HPC workloads
DPDK-based NFV systems

These workloads suffer from:

CPU cache thrash
cross-core migration
NUMA remote memory access
scheduler jitter

Therefore Kubernetes introduced:

CPU Manager Topology Manager

PART 2 — CPU Manager Policies

Enable CPU Manager via kubelet:

--cpu-manager-policy=<none|static|dynamic>

1. Policy: none (default)

Behavior:

cpuset includes all CPUs
processes float freely
no exclusivity
CPU requests → shares
CPU limits → CFS quota
lowest isolation

Suitable for 95% of app workloads.

2. Policy: static (the important one)

--cpu-manager-policy=static

Enables:

Guaranteed Pods with whole-integer CPU request → get dedicated CPU cores
Kubelet removes those cores from “shared pool”
Sets cpuset.cpus to the exclusive CPUs

Requirements:

QoS class = Guaranteed
CPU request = limit
CPU request is a whole number (no millis)

Example:

resources:
  requests:
    cpu: "2"
  limits:
    cpu: "2"

Pod will receive:

cpuset.cpus = "4-5"   # example cpus

Threads inside the container can ONLY execute on those 2 CPUs.

No throttling

Because:

limit = request
no CFS quota
container has full, unrestricted CPUs

This is extremely powerful.

3. Policy: dynamic (new)

Introduced to:

allow more flexible CPU allocations
share assignable CPU pools
supports fractional CPUs as best-effort

Still evolving — static is the industry standard for low-latency and AI.

PART 3 — How Whole-CPU Guarantees Actually Work (internals)

When the scheduler places a Pod, kubelet:

Looks at the node’s CPU topology
Has two CPU pools:
- sharedPool
- exclusivePool
For a 2-core request:
- Removes 2 cores from sharedPool
- Places Pod in exclusivePool

These cores are marked “allocated” in a checkpoint file:

/var/lib/kubelet/cpu_manager_state

Format example:

{
  "policyName": "static",
  "defaultCpuSet": "0,1,2,3",
  "entries": {
    "podUID": {
      "containerName": "app",
      "cpuset": "2,3"
    }
  }
}

This is how kubelet ensures:

exclusive use
state persistence across kubelet restarts

PART 4 — NUMA Awareness via Topology Manager

Enable via kubelet:

--topology-manager-policy=<none|best-effort|restricted|single-numa-node>

Why this matters

On multi-socket systems:

CPU sockets = NUMA nodes
Memory attached to each node
Remote memory accesses are slower
GPUs/NIC queues tied to specific NUMA domains

If you allocate:

CPUs from socket 0
HugePages from socket 1
GPU from socket 1 Your performance tanks.

Topology Manager Policies

1. none (default)

No alignment. Requests can come from anywhere.

2. best-effort

Try to align NUMA if possible, but does NOT fail Pod if not.

3. restricted

If NUMA alignment cannot be met, the Pod fails to schedule on that node.

This is for moderately sensitive workloads.

4. single-numa-node (strictest)

All resources must come from one NUMA node:

all CPUs
hugepages
device plugin resources If not possible → Pod permanently rejected on that node.

This is what:

DPDK
SR-IOV
high-performance Redis
AI inference use.

PART 5 — How CPU Manager + Topology Manager Work Together

Flow:

Scheduler assigns Pod to node
Kubelet:
- checks NUMA topology
- requests CPUs, hugepages, GPUs, RDMA devices
Policies decide:
- whether the node can satisfy NUMA alignment
If success:
- CPU Manager picks physical cores
- cpuset.cpus / cpuset.mems set
Container runtime starts container with CPU pinning
Pod now runs in deterministic NUMA domain

PART 6 — Combining CPUs + HugePages + GPUs + NIC queues

This is where Kubernetes becomes HPC/hyper-optimized.

Example: AI inference Pod

2 CPUs pinned
GPU device plugin (NUMA node 1)
HugePages 2Mi from NUMA node 1
Topology Manager = single-numa-node

All resources land on the same NUMA node:

cpuset.cpus = "8-9"
cpuset.mems = "1"
hugetlb.2MB.limit = 1024MB (NUMA node 1)
device plugin = GPU0 (NUMA node 1)

This increases performance by:

20–40% on some AI models
reduces latency jitter massively

PART 7 — Real-World Example Configurations

1. AI/ML Inference Pod

resources:
  requests:
    cpu: "4"
    memory: "8Gi"
  limits:
    cpu: "4"
    memory: "8Gi"
    nvidia.com/gpu: 1

Node:

cpu-manager-policy=static
topology-manager-policy=single-numa-node
memory-qos=true

Outcome:

pinned CPUs
pinned NUMA memory
local HugePages
GPU aligned
extremely stable inference latency

2. Low-latency Envoy/Cilium Pod

requests:
  cpu: "2"
limits:
  cpu: "2"

Paired with:

CPU pinning
dedicated nodes (taints)
static CPU Manager

Outcome:

consistent packet processing
zero jitter
predictable latency

PART 8 — NUMA Pitfalls & Gotchas

1. Fractional CPUs disable pinning

CPU request must be:

1, 2, 3... NOT 500m

2. Not Guaranteed QoS → no pinning

Meaning:

requests.cpu != limits.cpu → no pinning

3. Multi-container Pod shares CPU set

Pods do NOT get per-container exclusive CPUs. All containers in a Guaranteed Pod share the same pinned CPUs.

4. Resource fragmentation

If cores get allocated unevenly across nodes, Pods may fail NUMA alignment.

5. Restart storm

When kubelet restarts, CPU Manager restores state from checkpoint, but mismatches can cause Pod evictions.

6. Node images (AKS/EKS/GKE) differ

Some cloud providers mask NUMA from the VM. If NUMA is hidden: Topology Manager becomes ineffective.

SEGMENT 6 SUMMARY

You now deeply understand:

CPU Manager

Enables exclusive CPU allocation
Requires static policy
Guarantees whole-CPU cores
Uses cpuset to pin workloads
Maintains state in checkpoint file

Topology Manager

Ensures NUMA alignment for:
- CPUs
- HugePages
- GPUs
- NIC queues
Prevents cross-node memory access
Critical for HPC/AI/DPDK workloads

Combined Power

CPU Manager + Topology Manager transform Kubernetes from a “cloud orchestrator” into a high-performance computing platform.

Next Segment Options

Where would you like to go next?

👉 Segment 7 — Node Allocatable, System Reservations, Eviction & Kubelet internals

(critical for production node sizing & stability)

👉 Segment 8 — Linux Kernel Memory Internals for Kubernetes

(page cache, slab, OOM scoring, reclaim behavior)

👉 Segment 9 — Putting it all together: Full Node Resource Isolation Architecture

(Resource pools, QoS, CPU manager, memory QoS, hugepages, eviction, bin packing)

Pick your next deep dive!

Share on

Twitter Facebook Reddit LinkedIn Mastodon

Maung San