kubernetes Resource Isolation - 05. I/O, PIDs, CPUSET, HugePages & Other cgroup Controllers

October 06, 2025  5 minute read  

Segment 5** is where we explore the less talked-about but extremely powerful cgroup controllers that Kubernetes uses (or can use) to isolate workloads:

  • I/O (blkio / io controller)
  • PIDs (pids.max)
  • cpuset (CPU pinning & NUMA isolation)
  • hugetlb (HugePages)
  • Miscellaneous controllers (devices, perf_event, freezer, etc.)

This is a very deep systems topic — we’ll go realistic, Kubernetes-specific, and implementation-level.


SEGMENT 5 — I/O, PIDs, CPUSET, HugePages & Other cgroup Controllers

We’ll cover each subsystem with:

  1. What cgroup files exist
  2. How Kubernetes uses them
  3. What guarantees/isolation look like
  4. Real examples
  5. Pitfalls and best practices

PART 1 — I/O Isolation: blkio / io controller

I/O isolation exists in both cgroup v1 (blkio) and cgroup v2 (io). In Kubernetes, I/O isolation is not heavily exposed via PodSpec.

But the runtime does apply some defaults.


1. blkio / io controller capabilities

cgroup v1 (blkio):

  • blkio.weight
  • blkio.weight_device
  • blkio.throttle.read_bps_device
  • blkio.throttle.write_bps_device
  • blkio.throttle.read_iops_device
  • blkio.throttle.write_iops_device

cgroup v2 (io controller):

  • io.max → throttle by IOPS/BPS
  • io.weight → relative fairness

Example:

io.max="259:0 rbps=2097152 wbps=1048576"

2. How Kubernetes uses I/O

Right now:

  • Kubernetes does NOT expose I/O limits or weights in PodSpec (No official API fields like resources.limits.io)

Why?

  • Storage backends vary greatly (local SSD vs network vs CSI)
  • Hard to apply consistently across distributed runtimes
  • Community discussion exists, but no production API yet

Kubernetes only sets:

  • basic default weights via container runtime
  • cgroups for isolation and accounting, but not throttling

3. Practical I/O isolation reality

In practice:

  • If two pods hammer the disk, blkio/io may enforce fairness (cgroup v2 is more capable)
  • On most cloud VMs (AKS/EKS/GKE), the block device itself enforces IOPS/BPS (e.g., Azure disks throttle)

So Kubernetes’ own blkio isolation is minimal.


4. When I/O isolation matters

Most relevant when:

  • You run stateful workloads with local SSD
  • High I/O sidecars (log collectors)
  • UUID or large database Pod
  • AI workloads streaming data
  • HPC nodes

We can manually tune via container runtime if needed (systemd units or containerd config).


PART 2 — PIDs Controller (pids.max)

This is one of the most important cgroups for preventing node meltdown.


1. What it does

Controls how many processes/threads can be created by a Pod or container.

cgroup v1 & v2:

pids.max
pids.current

2. Kubernetes uses pids.max

YES — Kubernetes does enforce PID limits per Pod by default.

Default Kubelet setting:

--pod-max-pids=0   # unlimited

If set, kubelet writes values like:

pids.max = <value>

3. Why PID limits matter

Typical meltdown scenario:

  • A container spawns thousands of threads (Java thread leak, fork bomb)
  • Exhausts Linux PID namespace (~4 million)
  • Node becomes unusable

Setting:

pids.max = 1000

protects the node.


4. How to configure PID limits

Kubelet supports:

--pod-max-pids=1000

ContainerRuntime (CRI-O/containerd) also supports per-container pid limits.

Example in CRI-O:

[crio.runtime]
pids_limit = 2048

This is powerful and recommended for security.


PART 3 — cpuset Controller (CPU pinning & NUMA isolation)

This controller controls:

  • specific CPUs a container can run on
  • NUMA boundaries (which memory node to use)

Files:

  • cpuset.cpus
  • cpuset.mems

1. Does Kubernetes use cpuset for normal Pods?

Yes, but only in one case: CPU Manager “static” policy

Enable:

--cpu-manager-policy=static

Then:

  • A Guaranteed Pod
  • With whole-integer CPU request/limit → Gets exclusive CPUs → Kubelet sets cpuset.cpus="X-Y" on container cgroup

This greatly reduces:

  • scheduling jitter
  • noisy neighbor issues
  • context switching
  • NUMA cross-domain latency

2. NUMA-aware placement (Topology Manager)

Enable:

--topology-manager-policy=restricted|best-effort|single-numa-node

When combined with:

  • static CPU manager
  • hugepages
  • devices (GPUs, NIC queues)

Kubernetes ensures resource allocations are NUMA-aligned.

Super important for:

  • AI inference
  • Redis
  • HFT systems
  • Dataplane agents (Cilium, Envoy)

3. Default behavior (without CPU Manager)

cpuset.cpus = “all cpus on node” cpuset.mems = “all NUMA nodes”

So by default:

  • processes bounce across cores
  • memory can be allocated on remote NUMA nodes
  • Performance may jitter

PART 4 — hugetlb (Huge Pages)

HugePages allow:

  • larger memory pages (2Mi, 1Gi)
  • fewer TLB misses
  • lower CPU overhead on memory-heavy apps

Cgroup controller:

hugetlb.<size>.limit_in_bytes
hugetlb.<size>.usage_in_bytes

1. Kubernetes FULLY supports HugePages

Example:

resources:
  limits:
    hugepages-2Mi: 1Gi
    memory: "0"

Important: cannot be overcommitted HugePages must be reserved at the node OS level:

vm.nr_hugepages=512

2. How hugepage isolation works

  • Kubernetes allocates HugePages at Pod start
  • cgroup hugetlb enforces size-specific limits
  • No memory swapping allowed
  • Very deterministic performance

Critical for:

  • DPDK
  • high-throughput packet processing
  • in-memory DBs (Redis, Memcached)
  • ML inference models

PART 5 — Devices Controller

Not heavily used by Kubernetes except for:

  • Docker/CRI runtimes creating container device nodes
  • GPU devices managed externally via device plugins

cgroup files:

  • devices.allow
  • devices.deny

Kubernetes usually sets:

deny everything
allow only specific devices container needs

But device management is mostly delegated to:

  • container runtime
  • device plugins (NVIDIA, SR-IOV, RDMA)

PART 6 — freezer, perf_event, rdma, net_prio

These controllers exist but Kubernetes barely uses them:

freezer

  • Can pause cgroups
  • Kubernetes does not use this

perf_event

  • Controls access to performance counters
  • Kubernetes does not manage this

rdma

  • Relevant for RDMA/SR-IOV HPC workloads

net_prio / net_cls (deprecated)

  • Used for traffic classification
  • Kubernetes network plugins rarely rely on them now

PART 7 — How These Controllers Combine for Isolation

Here’s a realistic mapping:

Resource Type Controller Kubernetes Support
CPU time cpu.cfs_quota Yes
CPU fairness cpu.shares / cpu.weight Yes
CPU pinning cpuset Yes (CPU Manager static)
Memory memory.max Yes
Memory guarantees memory.min (MemoryQoS) Yes (new)
Memory throttling memory.high Yes (new)
HugePages hugetlb Yes
PIDs pids.max Yes
I/O throttle blkio/io No (not exposed)
Network QoS net_cls/net_prio No
Device access devices Yes (runtime + device plugin)

PART 8 — Practical Best Practices Across Controllers

1. ALWAYS set pids.max (critical)

Prevents node meltdown from:

  • fork bombs
  • thread leaks
  • JVM runaway threads

2. Enable MemoryQoS

Fixes a decade-long Kubernetes memory fairness problem.

3. For CPU-critical workloads:

Enable:

--cpu-manager-policy=static
--topology-manager-policy=single-numa-node

4. For I/O-sensitive workloads:

Use:

  • local SSD
  • fsGroup for permissions
  • separate disk for DB logs/data
  • dedicated nodes (taints)

5. Use HugePages only when known beneficial

If app isn’t written for HugePages, it won’t use them.

6. For NIC/GPU workloads

Use device plugins, not manual device rules.


SEGMENT 5 SUMMARY

You now fully understand the remaining major cgroup controllers:

I/O

  • Kubernetes does not expose blkio/io throttling
  • Cloud disks often enforce IOPS/BPS instead

PIDs

  • Critical protection via pids.max
  • Prevents node-wide PID exhaustion

cpuset

  • Enables CPU pinning, NUMA alignment
  • Requires CPU Manager static policy

hugetlb

  • Full Kubernetes support
  • Needs pre-reserved HugePages on node

Other controllers

  • Devices used heavily for GPU/SR-IOV
  • Others (freezer/perf_event) mostly unused

These controllers cover the last 20% of Kubernetes isolation behavior that most engineers don’t know exists.

Leave a comment