kubernetes Resource Isolation - 01. How Kubernetes Uses cgroups
1. What cgroups are (in our context)
At a very practical level, a cgroup is:
“A set of processes whose resource usage is accounted and limited together.”
Linux exposes this via a virtual filesystem like /sys/fs/cgroup/....
Key controllers we care about for Kubernetes:
- cpu – relative CPU share and throttling
- cpuset – pin processes to specific CPUs/NUMA nodes
- memory – per-group memory usage and limits
- pids – max number of processes/threads
- blkio / io – block I/O throttling
Kubernetes does not manipulate cgroups directly with echo ... > /sys/fs/cgroup/...; instead it delegates this to the container runtime (via the CRI) and/or systemd.
2. Who actually creates cgroups in Kubernetes?
The main actors on a node:
-
kubelet
- Watches API server for Pods scheduled to this node
- Decides which resources to allocate, and with what constraints
- Talks to container runtime via the CRI (Container Runtime Interface)
-
Container runtime (typically
containerd,CRI-O, formerlyDockervia dockershim)-
When asked to start a container, sets up:
- Namespaces (PID, NET, IPC, UTS, MNT, etc.)
- cgroups with the appropriate controllers and limits
- Root filesystem, mounts, etc.
-
-
systemd (if using the
systemdcgroup driver)- Manages the cgroup tree
- The runtime asks systemd to create and manage scoped units, and systemd manipulates the cgroup fs
So the flow is approximately:
- You apply a Pod with
resources.requestsandresources.limits. - Scheduler places Pod on a node.
-
Kubelet on that node:
- Validates node has enough Node Allocatable capacity
- Derives QoS class & runtime config
- Calls container runtime (
containerd/CRI-O) with a CRIRunPodSandbox/CreateContainerincluding resource settings.
-
Runtime:
- Asks systemd (if systemd driver) or writes to
cgroupfs(if cgroupfs driver) to create the cgroup paths. - Starts the container processes inside those cgroups.
- Asks systemd (if systemd driver) or writes to
3. The cgroup hierarchy for Pods & containers
On a typical modern node (systemd driver, cgroups v2), the tree roughly looks like:
/sys/fs/cgroup/
├─ system.slice/
├─ user.slice/
└─ kubepods.slice/
├─ kubepods-besteffort.slice/
│ └─ kubepods-besteffort-pod<uid>.slice/
│ └─ cri-containerd-<container-id>.scope
├─ kubepods-burstable.slice/
│ └─ kubepods-burstable-pod<uid>.slice/
│ └─ cri-containerd-<container-id>.scope
└─ kubepods-guaranteed.slice/
└─ kubepods-guaranteed-pod<uid>.slice/
└─ cri-containerd-<container-id>.scope
On cgroupfs, older layout might look like:
/sys/fs/cgroup/cpu/
└─ kubepods/
├─ besteffort/
│ └─ pod<uid>/
│ └─ <container-id>/
├─ burstable/
│ └─ pod<uid>/
│ └─ <container-id>/
└─ guaranteed/
└─ pod<uid>/
└─ <container-id>/
Key ideas:
- Pod gets its own cgroup; all containers in the Pod live under it.
- Container gets its own child cgroup, which may inherit or further restrict from the Pod cgroup.
- QoS class decides which subtree (besteffort/burstable/guaranteed) the Pod cgroup is placed under.
This hierarchy lets Kubernetes apply:
- Pod-level accounting/limits (e.g., total memory for a Pod)
- Container-level limits (per-container caps)
- Class-level priorities (Guaranteed vs Burstable vs BestEffort)
4. How requests/limits conceptually map to cgroups
We’ll go super detailed in Segment 2–4, but at a high level:
For CPU:
resources.requests.cpu→ cpu.shares (relative weight)resources.limits.cpu→ cpu.cfs_quota_us / cpu.cfs_period_us (max CPU usage over time)
For Memory:
-
resources.limits.memory→ hard limit- v1:
memory.limit_in_bytes - v2:
memory.max
- v1:
-
resources.requests.memoryinforms:- QoS calculation (Guaranteed vs Burstable)
- Node scheduling (bin-packing)
- Sometimes memory.min/high on cgroup v2 for memory QoS (implementation details depend on kubelet version/config)
No limit.memory → Pod can use up to node’s memory (within whatever parent cgroups allow), but will be first to be reclaimed (BestEffort) during pressure.
5. Resource isolation vs. fairness vs. overcommit
Kubernetes uses cgroups not just to “hard isolate” but to balance:
- Isolation – Ensure Pods with limits cannot exceed them (esp. memory).
- Fairness – Use cpu.shares to give higher relative CPU share to Pods with higher requests when the node is fully utilized.
- Overcommit & bin-packing – You can schedule more total requested CPU than physically exists, knowing that CPU is elastic and shared via cgroups.
So in practice:
- cgroups don’t guarantee performance; they bound and shape it.
-
Isolation is strongest for:
- Memory limits (hitting them = OOM kill)
- CPU limits (throttling)
But:
- A noisy neighbor with no CPU limit full-throttling on a mostly empty node will get lots of CPU. Only under contention do shares/quotas bite.
We’ll look at this more in the CPU and memory segments.
6. Where kubelet configuration comes in
The kubelet has several important knobs that affect how cgroups are arranged and used:
--cgroup-driver=systemd|cgroupfs--kube-reserved=cpu=...,memory=...--system-reserved=...--eviction-hard=...,--eviction-soft=...--cgroups-per-qos(usually true on modern clusters)
Roughly:
- kube-reserved/system-reserved: node-level cgroups created to reserve resources for the kubelet and system daemons so Pods don’t starve them.
- cgroups-per-qos: if enabled, the kubepods hierarchy is split by QoS class, giving different baseline protections/priorities.
We’ll deep-dive these in the segments about Node Allocatable and QoS.
7. Inspecting this on a real node (mental model)
On a real node, you’d typically:
- SSH into the node.
-
Find the PID of a container process:
ps aux | grep <your-app-process> -
Inspect its cgroup membership:
cat /proc/<pid>/cgroup
On a systemd + cgroup v2 node, you might see something like:
0::/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<container-id>.scope
Then:
cd /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<container-id>.scope
ls
# cpu.max, memory.current, memory.max, pids.max, etc.
Those files (on v2: cpu.max, memory.max, memory.high, etc.) are where the real isolation is enforced by the kernel.
Leave a comment