kubernetes Resource Isolation - 06. CPU Manager + Topology Manager Deep Dive

October 06, 2025  5 minute read  

Segment 6 is one of the most advanced and misunderstood areas of Kubernetes resource isolation. This is where Kubernetes goes beyond “fair CPU time” and enters the world of exclusive CPUs, NUMA alignment, latency-sensitive workloads, and AI/HPC performance tuning.

SEGMENT 6 — CPU Manager + Topology Manager Deep Dive

We cover:

  1. Why CPU pinning matters
  2. CPU Manager Policies (none → static)
  3. How whole-CPU guarantees work
  4. How Kubernetes allocates exclusive CPUs internally
  5. Topology Manager Policies
  6. Combining CPU + HugePages + Device Plugins
  7. NUMA-level behavior and pitfalls
  8. Real-world examples (AI, HFT, Envoy, Redis, Java)

Let’s begin.


PART 1 — Why CPU Pinning Exists in Kubernetes

Normally, Kubernetes gives Pods:

  • shared CPU time
  • no pinning
  • threads migrate across cores
  • memory allocated across NUMA nodes

This is totally fine for:

  • web apps
  • API servers
  • batch jobs
  • most microservices

But NOT okay for:

  • network dataplanes (Envoy, Cilium agent)
  • AI inference workers
  • Redis / Memcached
  • JVM apps with GC-sensitive behavior
  • High-frequency trading systems
  • HPC workloads
  • DPDK-based NFV systems

These workloads suffer from:

  • CPU cache thrash
  • cross-core migration
  • NUMA remote memory access
  • scheduler jitter

Therefore Kubernetes introduced:

CPU Manager Topology Manager


PART 2 — CPU Manager Policies

Enable CPU Manager via kubelet:

--cpu-manager-policy=<none|static|dynamic>

1. Policy: none (default)

Behavior:

  • cpuset includes all CPUs
  • processes float freely
  • no exclusivity
  • CPU requests → shares
  • CPU limits → CFS quota
  • lowest isolation

Suitable for 95% of app workloads.


2. Policy: static (the important one)

--cpu-manager-policy=static

Enables:

  • Guaranteed Pods with whole-integer CPU request → get dedicated CPU cores
  • Kubelet removes those cores from “shared pool”
  • Sets cpuset.cpus to the exclusive CPUs

Requirements:

  • QoS class = Guaranteed
  • CPU request = limit
  • CPU request is a whole number (no millis)

Example:

resources:
  requests:
    cpu: "2"
  limits:
    cpu: "2"

Pod will receive:

cpuset.cpus = "4-5"   # example cpus

Threads inside the container can ONLY execute on those 2 CPUs.

No throttling

Because:

  • limit = request
  • no CFS quota
  • container has full, unrestricted CPUs

This is extremely powerful.


3. Policy: dynamic (new)

Introduced to:

  • allow more flexible CPU allocations
  • share assignable CPU pools
  • supports fractional CPUs as best-effort

Still evolving — static is the industry standard for low-latency and AI.


PART 3 — How Whole-CPU Guarantees Actually Work (internals)

When the scheduler places a Pod, kubelet:

  1. Looks at the node’s CPU topology
  2. Has two CPU pools:

    • sharedPool
    • exclusivePool
  3. For a 2-core request:

    • Removes 2 cores from sharedPool
    • Places Pod in exclusivePool

These cores are marked “allocated” in a checkpoint file:

/var/lib/kubelet/cpu_manager_state

Format example:

{
  "policyName": "static",
  "defaultCpuSet": "0,1,2,3",
  "entries": {
    "podUID": {
      "containerName": "app",
      "cpuset": "2,3"
    }
  }
}

This is how kubelet ensures:

  • exclusive use
  • state persistence across kubelet restarts

PART 4 — NUMA Awareness via Topology Manager

Enable via kubelet:

--topology-manager-policy=<none|best-effort|restricted|single-numa-node>

Why this matters

On multi-socket systems:

  • CPU sockets = NUMA nodes
  • Memory attached to each node
  • Remote memory accesses are slower
  • GPUs/NIC queues tied to specific NUMA domains

If you allocate:

  • CPUs from socket 0
  • HugePages from socket 1
  • GPU from socket 1 Your performance tanks.

Topology Manager Policies

1. none (default)

No alignment. Requests can come from anywhere.


2. best-effort

Try to align NUMA if possible, but does NOT fail Pod if not.


3. restricted

If NUMA alignment cannot be met, the Pod fails to schedule on that node.

This is for moderately sensitive workloads.


4. single-numa-node (strictest)

All resources must come from one NUMA node:

  • all CPUs
  • hugepages
  • device plugin resources If not possible → Pod permanently rejected on that node.

This is what:

  • DPDK
  • SR-IOV
  • high-performance Redis
  • AI inference use.

PART 5 — How CPU Manager + Topology Manager Work Together

Flow:

  1. Scheduler assigns Pod to node
  2. Kubelet:

    • checks NUMA topology
    • requests CPUs, hugepages, GPUs, RDMA devices
  3. Policies decide:

    • whether the node can satisfy NUMA alignment
  4. If success:

    • CPU Manager picks physical cores
    • cpuset.cpus / cpuset.mems set
  5. Container runtime starts container with CPU pinning
  6. Pod now runs in deterministic NUMA domain

PART 6 — Combining CPUs + HugePages + GPUs + NIC queues

This is where Kubernetes becomes HPC/hyper-optimized.

Example: AI inference Pod

  • 2 CPUs pinned
  • GPU device plugin (NUMA node 1)
  • HugePages 2Mi from NUMA node 1
  • Topology Manager = single-numa-node

All resources land on the same NUMA node:

cpuset.cpus = "8-9"
cpuset.mems = "1"
hugetlb.2MB.limit = 1024MB (NUMA node 1)
device plugin = GPU0 (NUMA node 1)

This increases performance by:

  • 20–40% on some AI models
  • reduces latency jitter massively

PART 7 — Real-World Example Configurations

1. AI/ML Inference Pod

resources:
  requests:
    cpu: "4"
    memory: "8Gi"
  limits:
    cpu: "4"
    memory: "8Gi"
    nvidia.com/gpu: 1

Node:

cpu-manager-policy=static
topology-manager-policy=single-numa-node
memory-qos=true

Outcome:

  • pinned CPUs
  • pinned NUMA memory
  • local HugePages
  • GPU aligned
  • extremely stable inference latency

2. Low-latency Envoy/Cilium Pod

requests:
  cpu: "2"
limits:
  cpu: "2"

Paired with:

  • CPU pinning
  • dedicated nodes (taints)
  • static CPU Manager

Outcome:

  • consistent packet processing
  • zero jitter
  • predictable latency

PART 8 — NUMA Pitfalls & Gotchas

1. Fractional CPUs disable pinning

CPU request must be:

1, 2, 3... NOT 500m

2. Not Guaranteed QoS → no pinning

Meaning:

requests.cpu != limits.cpu → no pinning

3. Multi-container Pod shares CPU set

Pods do NOT get per-container exclusive CPUs. All containers in a Guaranteed Pod share the same pinned CPUs.

4. Resource fragmentation

If cores get allocated unevenly across nodes, Pods may fail NUMA alignment.

5. Restart storm

When kubelet restarts, CPU Manager restores state from checkpoint, but mismatches can cause Pod evictions.

6. Node images (AKS/EKS/GKE) differ

Some cloud providers mask NUMA from the VM. If NUMA is hidden: Topology Manager becomes ineffective.


SEGMENT 6 SUMMARY

You now deeply understand:

CPU Manager

  • Enables exclusive CPU allocation
  • Requires static policy
  • Guarantees whole-CPU cores
  • Uses cpuset to pin workloads
  • Maintains state in checkpoint file

Topology Manager

  • Ensures NUMA alignment for:

    • CPUs
    • HugePages
    • GPUs
    • NIC queues
  • Prevents cross-node memory access
  • Critical for HPC/AI/DPDK workloads

Combined Power

CPU Manager + Topology Manager transform Kubernetes from a “cloud orchestrator” into a high-performance computing platform.


Next Segment Options

Where would you like to go next?

👉 Segment 7 — Node Allocatable, System Reservations, Eviction & Kubelet internals

(critical for production node sizing & stability)

👉 Segment 8 — Linux Kernel Memory Internals for Kubernetes

(page cache, slab, OOM scoring, reclaim behavior)

👉 Segment 9 — Putting it all together: Full Node Resource Isolation Architecture

(Resource pools, QoS, CPU manager, memory QoS, hugepages, eviction, bin packing)

Pick your next deep dive!

Leave a comment