kubernetes Resource Isolation - 05. I/O, PIDs, CPUSET, HugePages & Other cgroup Controllers

October 06, 2025 5 minute read

Segment 5** is where we explore the less talked-about but extremely powerful cgroup controllers that Kubernetes uses (or can use) to isolate workloads:

I/O (blkio / io controller)
PIDs (pids.max)
cpuset (CPU pinning & NUMA isolation)
hugetlb (HugePages)
Miscellaneous controllers (devices, perf_event, freezer, etc.)

This is a very deep systems topic — we’ll go realistic, Kubernetes-specific, and implementation-level.

SEGMENT 5 — I/O, PIDs, CPUSET, HugePages & Other cgroup Controllers

We’ll cover each subsystem with:

What cgroup files exist
How Kubernetes uses them
What guarantees/isolation look like
Real examples
Pitfalls and best practices

PART 1 — I/O Isolation: blkio / io controller

I/O isolation exists in both cgroup v1 (blkio) and cgroup v2 (io). In Kubernetes, I/O isolation is not heavily exposed via PodSpec.

But the runtime does apply some defaults.

1. blkio / io controller capabilities

cgroup v1 (blkio):

blkio.weight
blkio.weight_device
blkio.throttle.read_bps_device
blkio.throttle.write_bps_device
blkio.throttle.read_iops_device
blkio.throttle.write_iops_device

cgroup v2 (io controller):

io.max → throttle by IOPS/BPS
io.weight → relative fairness

Example:

io.max="259:0 rbps=2097152 wbps=1048576"

2. How Kubernetes uses I/O

Right now:

Kubernetes does NOT expose I/O limits or weights in PodSpec (No official API fields like resources.limits.io)

Why?

Storage backends vary greatly (local SSD vs network vs CSI)
Hard to apply consistently across distributed runtimes
Community discussion exists, but no production API yet

Kubernetes only sets:

basic default weights via container runtime
cgroups for isolation and accounting, but not throttling

3. Practical I/O isolation reality

In practice:

If two pods hammer the disk, blkio/io may enforce fairness (cgroup v2 is more capable)
On most cloud VMs (AKS/EKS/GKE), the block device itself enforces IOPS/BPS (e.g., Azure disks throttle)

So Kubernetes’ own blkio isolation is minimal.

4. When I/O isolation matters

Most relevant when:

You run stateful workloads with local SSD
High I/O sidecars (log collectors)
UUID or large database Pod
AI workloads streaming data
HPC nodes

We can manually tune via container runtime if needed (systemd units or containerd config).

PART 2 — PIDs Controller (pids.max)

This is one of the most important cgroups for preventing node meltdown.

1. What it does

Controls how many processes/threads can be created by a Pod or container.

cgroup v1 & v2:

pids.max
pids.current

2. Kubernetes uses pids.max

YES — Kubernetes does enforce PID limits per Pod by default.

Default Kubelet setting:

--pod-max-pids=0   # unlimited

If set, kubelet writes values like:

pids.max = <value>

3. Why PID limits matter

Typical meltdown scenario:

A container spawns thousands of threads (Java thread leak, fork bomb)
Exhausts Linux PID namespace (~4 million)
Node becomes unusable

Setting:

pids.max = 1000

protects the node.

4. How to configure PID limits

Kubelet supports:

--pod-max-pids=1000

ContainerRuntime (CRI-O/containerd) also supports per-container pid limits.

Example in CRI-O:

[crio.runtime]
pids_limit = 2048

This is powerful and recommended for security.

PART 3 — cpuset Controller (CPU pinning & NUMA isolation)

This controller controls:

specific CPUs a container can run on
NUMA boundaries (which memory node to use)

Files:

cpuset.cpus
cpuset.mems

1. Does Kubernetes use cpuset for normal Pods?

Yes, but only in one case: CPU Manager “static” policy

Enable:

--cpu-manager-policy=static

Then:

A Guaranteed Pod
With whole-integer CPU request/limit → Gets exclusive CPUs → Kubelet sets cpuset.cpus="X-Y" on container cgroup

This greatly reduces:

scheduling jitter
noisy neighbor issues
context switching
NUMA cross-domain latency

2. NUMA-aware placement (Topology Manager)

Enable:

--topology-manager-policy=restricted|best-effort|single-numa-node

When combined with:

static CPU manager
hugepages
devices (GPUs, NIC queues)

Kubernetes ensures resource allocations are NUMA-aligned.

Super important for:

AI inference
Redis
HFT systems
Dataplane agents (Cilium, Envoy)

3. Default behavior (without CPU Manager)

cpuset.cpus = “all cpus on node” cpuset.mems = “all NUMA nodes”

So by default:

processes bounce across cores
memory can be allocated on remote NUMA nodes
Performance may jitter

PART 4 — hugetlb (Huge Pages)

HugePages allow:

larger memory pages (2Mi, 1Gi)
fewer TLB misses
lower CPU overhead on memory-heavy apps

Cgroup controller:

hugetlb.<size>.limit_in_bytes
hugetlb.<size>.usage_in_bytes

1. Kubernetes FULLY supports HugePages

Example:

resources:
  limits:
    hugepages-2Mi: 1Gi
    memory: "0"

Important: cannot be overcommitted HugePages must be reserved at the node OS level:

vm.nr_hugepages=512

2. How hugepage isolation works

Kubernetes allocates HugePages at Pod start
cgroup hugetlb enforces size-specific limits
No memory swapping allowed
Very deterministic performance

Critical for:

DPDK
high-throughput packet processing
in-memory DBs (Redis, Memcached)
ML inference models

PART 5 — Devices Controller

Not heavily used by Kubernetes except for:

Docker/CRI runtimes creating container device nodes
GPU devices managed externally via device plugins

cgroup files:

devices.allow
devices.deny

Kubernetes usually sets:

deny everything
allow only specific devices container needs

But device management is mostly delegated to:

container runtime
device plugins (NVIDIA, SR-IOV, RDMA)

PART 6 — freezer, perf_event, rdma, net_prio

These controllers exist but Kubernetes barely uses them:

freezer

Can pause cgroups
Kubernetes does not use this

perf_event

Controls access to performance counters
Kubernetes does not manage this

rdma

Relevant for RDMA/SR-IOV HPC workloads

net_prio / net_cls (deprecated)

Used for traffic classification
Kubernetes network plugins rarely rely on them now

PART 7 — How These Controllers Combine for Isolation

Here’s a realistic mapping:

Resource Type	Controller	Kubernetes Support
CPU time	cpu.cfs_quota	Yes
CPU fairness	cpu.shares / cpu.weight	Yes
CPU pinning	cpuset	Yes (CPU Manager static)
Memory	memory.max	Yes
Memory guarantees	memory.min (MemoryQoS)	Yes (new)
Memory throttling	memory.high	Yes (new)
HugePages	hugetlb	Yes
PIDs	pids.max	Yes
I/O throttle	blkio/io	No (not exposed)
Network QoS	net_cls/net_prio	No
Device access	devices	Yes (runtime + device plugin)

PART 8 — Practical Best Practices Across Controllers

1. ALWAYS set `pids.max` (critical)

Prevents node meltdown from:

fork bombs
thread leaks
JVM runaway threads

2. Enable MemoryQoS

Fixes a decade-long Kubernetes memory fairness problem.

3. For CPU-critical workloads:

Enable:

--cpu-manager-policy=static
--topology-manager-policy=single-numa-node

4. For I/O-sensitive workloads:

Use:

local SSD
fsGroup for permissions
separate disk for DB logs/data
dedicated nodes (taints)

5. Use HugePages only when known beneficial

If app isn’t written for HugePages, it won’t use them.

6. For NIC/GPU workloads

Use device plugins, not manual device rules.

SEGMENT 5 SUMMARY

You now fully understand the remaining major cgroup controllers:

I/O

Kubernetes does not expose blkio/io throttling
Cloud disks often enforce IOPS/BPS instead

PIDs

Critical protection via pids.max
Prevents node-wide PID exhaustion

cpuset

Enables CPU pinning, NUMA alignment
Requires CPU Manager static policy

hugetlb

Full Kubernetes support
Needs pre-reserved HugePages on node

Other controllers

Devices used heavily for GPU/SR-IOV
Others (freezer/perf_event) mostly unused

These controllers cover the last 20% of Kubernetes isolation behavior that most engineers don’t know exists.

Share on

Twitter Facebook Reddit LinkedIn Mastodon

Maung San

SEGMENT 5 — I/O, PIDs, CPUSET, HugePages & Other cgroup Controllers

PART 1 — I/O Isolation: blkio / io controller

1. blkio / io controller capabilities

cgroup v1 (blkio):

cgroup v2 (io controller):

2. How Kubernetes uses I/O

Kubernetes only sets:

3. Practical I/O isolation reality

4. When I/O isolation matters

PART 2 — PIDs Controller (pids.max)

1. What it does

2. Kubernetes uses pids.max

Default Kubelet setting:

3. Why PID limits matter

4. How to configure PID limits

PART 3 — cpuset Controller (CPU pinning & NUMA isolation)

1. Does Kubernetes use cpuset for normal Pods?

Yes, but only in one case: CPU Manager “static” policy

2. NUMA-aware placement (Topology Manager)

3. Default behavior (without CPU Manager)

PART 4 — hugetlb (Huge Pages)

HugePages allow:

1. Kubernetes FULLY supports HugePages

2. How hugepage isolation works

PART 5 — Devices Controller

PART 6 — freezer, perf_event, rdma, net_prio

freezer

perf_event

rdma

net_prio / net_cls (deprecated)

PART 7 — How These Controllers Combine for Isolation

PART 8 — Practical Best Practices Across Controllers

1. ALWAYS set pids.max (critical)

2. Enable MemoryQoS

3. For CPU-critical workloads:

4. For I/O-sensitive workloads:

5. Use HugePages only when known beneficial

6. For NIC/GPU workloads

SEGMENT 5 SUMMARY

I/O

PIDs

cpuset

hugetlb

Other controllers

Share on

Leave a comment

You may also enjoy

DevOps Quick Read - How to Read a Packer Template in 60 Seconds

kubernetes Resource Isolation - 14. A catalog of **cluster design patterns

kubernetes Resource Isolation - 13. Production-ready node & kubelet blueprint

kubernetes Resource Isolation - 12. Ultimate Node Sizing Guide for AKS, EKS, and GKE

1. ALWAYS set `pids.max` (critical)