kubernetes Resource Isolation - 05. I/O, PIDs, CPUSET, HugePages & Other cgroup Controllers
Segment 5** is where we explore the less talked-about but extremely powerful cgroup controllers that Kubernetes uses (or can use) to isolate workloads:
- I/O (blkio / io controller)
- PIDs (pids.max)
- cpuset (CPU pinning & NUMA isolation)
- hugetlb (HugePages)
- Miscellaneous controllers (devices, perf_event, freezer, etc.)
This is a very deep systems topic — we’ll go realistic, Kubernetes-specific, and implementation-level.
SEGMENT 5 — I/O, PIDs, CPUSET, HugePages & Other cgroup Controllers
We’ll cover each subsystem with:
- What cgroup files exist
- How Kubernetes uses them
- What guarantees/isolation look like
- Real examples
- Pitfalls and best practices
PART 1 — I/O Isolation: blkio / io controller
I/O isolation exists in both cgroup v1 (blkio) and cgroup v2 (io).
In Kubernetes, I/O isolation is not heavily exposed via PodSpec.
But the runtime does apply some defaults.
1. blkio / io controller capabilities
cgroup v1 (blkio):
blkio.weightblkio.weight_deviceblkio.throttle.read_bps_deviceblkio.throttle.write_bps_deviceblkio.throttle.read_iops_deviceblkio.throttle.write_iops_device
cgroup v2 (io controller):
io.max→ throttle by IOPS/BPSio.weight→ relative fairness
Example:
io.max="259:0 rbps=2097152 wbps=1048576"
2. How Kubernetes uses I/O
Right now:
- Kubernetes does NOT expose I/O limits or weights in PodSpec
(No official API fields like
resources.limits.io)
Why?
- Storage backends vary greatly (local SSD vs network vs CSI)
- Hard to apply consistently across distributed runtimes
- Community discussion exists, but no production API yet
Kubernetes only sets:
- basic default weights via container runtime
- cgroups for isolation and accounting, but not throttling
3. Practical I/O isolation reality
In practice:
- If two pods hammer the disk, blkio/io may enforce fairness (cgroup v2 is more capable)
- On most cloud VMs (AKS/EKS/GKE), the block device itself enforces IOPS/BPS (e.g., Azure disks throttle)
So Kubernetes’ own blkio isolation is minimal.
4. When I/O isolation matters
Most relevant when:
- You run stateful workloads with local SSD
- High I/O sidecars (log collectors)
- UUID or large database Pod
- AI workloads streaming data
- HPC nodes
We can manually tune via container runtime if needed (systemd units or containerd config).
PART 2 — PIDs Controller (pids.max)
This is one of the most important cgroups for preventing node meltdown.
1. What it does
Controls how many processes/threads can be created by a Pod or container.
cgroup v1 & v2:
pids.max
pids.current
2. Kubernetes uses pids.max
YES — Kubernetes does enforce PID limits per Pod by default.
Default Kubelet setting:
--pod-max-pids=0 # unlimited
If set, kubelet writes values like:
pids.max = <value>
3. Why PID limits matter
Typical meltdown scenario:
- A container spawns thousands of threads (Java thread leak, fork bomb)
- Exhausts Linux PID namespace (~4 million)
- Node becomes unusable
Setting:
pids.max = 1000
protects the node.
4. How to configure PID limits
Kubelet supports:
--pod-max-pids=1000
ContainerRuntime (CRI-O/containerd) also supports per-container pid limits.
Example in CRI-O:
[crio.runtime]
pids_limit = 2048
This is powerful and recommended for security.
PART 3 — cpuset Controller (CPU pinning & NUMA isolation)
This controller controls:
- specific CPUs a container can run on
- NUMA boundaries (which memory node to use)
Files:
cpuset.cpuscpuset.mems
1. Does Kubernetes use cpuset for normal Pods?
Yes, but only in one case: CPU Manager “static” policy
Enable:
--cpu-manager-policy=static
Then:
- A Guaranteed Pod
- With whole-integer CPU request/limit
→ Gets exclusive CPUs
→ Kubelet sets
cpuset.cpus="X-Y"on container cgroup
This greatly reduces:
- scheduling jitter
- noisy neighbor issues
- context switching
- NUMA cross-domain latency
2. NUMA-aware placement (Topology Manager)
Enable:
--topology-manager-policy=restricted|best-effort|single-numa-node
When combined with:
- static CPU manager
- hugepages
- devices (GPUs, NIC queues)
Kubernetes ensures resource allocations are NUMA-aligned.
Super important for:
- AI inference
- Redis
- HFT systems
- Dataplane agents (Cilium, Envoy)
3. Default behavior (without CPU Manager)
cpuset.cpus = “all cpus on node” cpuset.mems = “all NUMA nodes”
So by default:
- processes bounce across cores
- memory can be allocated on remote NUMA nodes
- Performance may jitter
PART 4 — hugetlb (Huge Pages)
HugePages allow:
- larger memory pages (2Mi, 1Gi)
- fewer TLB misses
- lower CPU overhead on memory-heavy apps
Cgroup controller:
hugetlb.<size>.limit_in_bytes
hugetlb.<size>.usage_in_bytes
1. Kubernetes FULLY supports HugePages
Example:
resources:
limits:
hugepages-2Mi: 1Gi
memory: "0"
Important: cannot be overcommitted HugePages must be reserved at the node OS level:
vm.nr_hugepages=512
2. How hugepage isolation works
- Kubernetes allocates HugePages at Pod start
- cgroup hugetlb enforces size-specific limits
- No memory swapping allowed
- Very deterministic performance
Critical for:
- DPDK
- high-throughput packet processing
- in-memory DBs (Redis, Memcached)
- ML inference models
PART 5 — Devices Controller
Not heavily used by Kubernetes except for:
- Docker/CRI runtimes creating container device nodes
- GPU devices managed externally via device plugins
cgroup files:
devices.allowdevices.deny
Kubernetes usually sets:
deny everything
allow only specific devices container needs
But device management is mostly delegated to:
- container runtime
- device plugins (NVIDIA, SR-IOV, RDMA)
PART 6 — freezer, perf_event, rdma, net_prio
These controllers exist but Kubernetes barely uses them:
freezer
- Can pause cgroups
- Kubernetes does not use this
perf_event
- Controls access to performance counters
- Kubernetes does not manage this
rdma
- Relevant for RDMA/SR-IOV HPC workloads
net_prio / net_cls (deprecated)
- Used for traffic classification
- Kubernetes network plugins rarely rely on them now
PART 7 — How These Controllers Combine for Isolation
Here’s a realistic mapping:
| Resource Type | Controller | Kubernetes Support |
|---|---|---|
| CPU time | cpu.cfs_quota | Yes |
| CPU fairness | cpu.shares / cpu.weight | Yes |
| CPU pinning | cpuset | Yes (CPU Manager static) |
| Memory | memory.max | Yes |
| Memory guarantees | memory.min (MemoryQoS) | Yes (new) |
| Memory throttling | memory.high | Yes (new) |
| HugePages | hugetlb | Yes |
| PIDs | pids.max | Yes |
| I/O throttle | blkio/io | No (not exposed) |
| Network QoS | net_cls/net_prio | No |
| Device access | devices | Yes (runtime + device plugin) |
PART 8 — Practical Best Practices Across Controllers
1. ALWAYS set pids.max (critical)
Prevents node meltdown from:
- fork bombs
- thread leaks
- JVM runaway threads
2. Enable MemoryQoS
Fixes a decade-long Kubernetes memory fairness problem.
3. For CPU-critical workloads:
Enable:
--cpu-manager-policy=static
--topology-manager-policy=single-numa-node
4. For I/O-sensitive workloads:
Use:
- local SSD
- fsGroup for permissions
- separate disk for DB logs/data
- dedicated nodes (taints)
5. Use HugePages only when known beneficial
If app isn’t written for HugePages, it won’t use them.
6. For NIC/GPU workloads
Use device plugins, not manual device rules.
SEGMENT 5 SUMMARY
You now fully understand the remaining major cgroup controllers:
I/O
- Kubernetes does not expose blkio/io throttling
- Cloud disks often enforce IOPS/BPS instead
PIDs
- Critical protection via pids.max
- Prevents node-wide PID exhaustion
cpuset
- Enables CPU pinning, NUMA alignment
- Requires CPU Manager static policy
hugetlb
- Full Kubernetes support
- Needs pre-reserved HugePages on node
Other controllers
- Devices used heavily for GPU/SR-IOV
- Others (freezer/perf_event) mostly unused
These controllers cover the last 20% of Kubernetes isolation behavior that most engineers don’t know exists.
Leave a comment