kubernetes Resource Isolation - 01. How Kubernetes Uses cgroups

October 04, 2025 4 minute read

1. What cgroups are (in our context)

At a very practical level, a cgroup is:

“A set of processes whose resource usage is accounted and limited together.”

Linux exposes this via a virtual filesystem like /sys/fs/cgroup/....

Key controllers we care about for Kubernetes:

cpu – relative CPU share and throttling
cpuset – pin processes to specific CPUs/NUMA nodes
memory – per-group memory usage and limits
pids – max number of processes/threads
blkio / io – block I/O throttling

Kubernetes does not manipulate cgroups directly with echo ... > /sys/fs/cgroup/...; instead it delegates this to the container runtime (via the CRI) and/or systemd.

2. Who actually creates cgroups in Kubernetes?

The main actors on a node:

kubelet
- Watches API server for Pods scheduled to this node
- Decides which resources to allocate, and with what constraints
- Talks to container runtime via the CRI (Container Runtime Interface)
Container runtime (typically containerd, CRI-O, formerly Docker via dockershim)
- When asked to start a container, sets up:
  - Namespaces (PID, NET, IPC, UTS, MNT, etc.)
  - cgroups with the appropriate controllers and limits
  - Root filesystem, mounts, etc.
systemd (if using the systemd cgroup driver)
- Manages the cgroup tree
- The runtime asks systemd to create and manage scoped units, and systemd manipulates the cgroup fs

So the flow is approximately:

You apply a Pod with resources.requests and resources.limits.
Scheduler places Pod on a node.
Kubelet on that node:
- Validates node has enough Node Allocatable capacity
- Derives QoS class & runtime config
- Calls container runtime (containerd/CRI-O) with a CRI RunPodSandbox / CreateContainer including resource settings.
Runtime:
- Asks systemd (if systemd driver) or writes to cgroupfs (if cgroupfs driver) to create the cgroup paths.
- Starts the container processes inside those cgroups.

3. The cgroup hierarchy for Pods & containers

On a typical modern node (systemd driver, cgroups v2), the tree roughly looks like:

/sys/fs/cgroup/
  ├─ system.slice/
  ├─ user.slice/
  └─ kubepods.slice/
      ├─ kubepods-besteffort.slice/
      │   └─ kubepods-besteffort-pod<uid>.slice/
      │       └─ cri-containerd-<container-id>.scope
      ├─ kubepods-burstable.slice/
      │   └─ kubepods-burstable-pod<uid>.slice/
      │       └─ cri-containerd-<container-id>.scope
      └─ kubepods-guaranteed.slice/
          └─ kubepods-guaranteed-pod<uid>.slice/
              └─ cri-containerd-<container-id>.scope

On cgroupfs, older layout might look like:

/sys/fs/cgroup/cpu/
  └─ kubepods/
      ├─ besteffort/
      │   └─ pod<uid>/
      │       └─ <container-id>/
      ├─ burstable/
      │   └─ pod<uid>/
      │       └─ <container-id>/
      └─ guaranteed/
          └─ pod<uid>/
              └─ <container-id>/

Key ideas:

Pod gets its own cgroup; all containers in the Pod live under it.
Container gets its own child cgroup, which may inherit or further restrict from the Pod cgroup.
QoS class decides which subtree (besteffort/burstable/guaranteed) the Pod cgroup is placed under.

This hierarchy lets Kubernetes apply:

Pod-level accounting/limits (e.g., total memory for a Pod)
Container-level limits (per-container caps)
Class-level priorities (Guaranteed vs Burstable vs BestEffort)

4. How requests/limits conceptually map to cgroups

We’ll go super detailed in Segment 2–4, but at a high level:

For CPU:

resources.requests.cpu → cpu.shares (relative weight)
resources.limits.cpu → cpu.cfs_quota_us / cpu.cfs_period_us (max CPU usage over time)

For Memory:

resources.limits.memory → hard limit
- v1: memory.limit_in_bytes
- v2: memory.max
resources.requests.memory informs:
- QoS calculation (Guaranteed vs Burstable)
- Node scheduling (bin-packing)
- Sometimes memory.min/high on cgroup v2 for memory QoS (implementation details depend on kubelet version/config)

No limit.memory → Pod can use up to node’s memory (within whatever parent cgroups allow), but will be first to be reclaimed (BestEffort) during pressure.

5. Resource isolation vs. fairness vs. overcommit

Kubernetes uses cgroups not just to “hard isolate” but to balance:

Isolation – Ensure Pods with limits cannot exceed them (esp. memory).
Fairness – Use cpu.shares to give higher relative CPU share to Pods with higher requests when the node is fully utilized.
Overcommit & bin-packing – You can schedule more total requested CPU than physically exists, knowing that CPU is elastic and shared via cgroups.

So in practice:

cgroups don’t guarantee performance; they bound and shape it.
Isolation is strongest for:
- Memory limits (hitting them = OOM kill)
- CPU limits (throttling)

But:

A noisy neighbor with no CPU limit full-throttling on a mostly empty node will get lots of CPU. Only under contention do shares/quotas bite.

We’ll look at this more in the CPU and memory segments.

6. Where kubelet configuration comes in

The kubelet has several important knobs that affect how cgroups are arranged and used:

--cgroup-driver=systemd|cgroupfs
--kube-reserved=cpu=...,memory=...
--system-reserved=...
--eviction-hard=..., --eviction-soft=...
--cgroups-per-qos (usually true on modern clusters)

Roughly:

kube-reserved/system-reserved: node-level cgroups created to reserve resources for the kubelet and system daemons so Pods don’t starve them.
cgroups-per-qos: if enabled, the kubepods hierarchy is split by QoS class, giving different baseline protections/priorities.

We’ll deep-dive these in the segments about Node Allocatable and QoS.

7. Inspecting this on a real node (mental model)

On a real node, you’d typically:

SSH into the node.
Find the PID of a container process:
```
ps aux | grep <your-app-process>
```
Inspect its cgroup membership:
```
cat /proc/<pid>/cgroup
```

On a systemd + cgroup v2 node, you might see something like:

0::/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<container-id>.scope

Then:

cd /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<uid>.slice/cri-containerd-<container-id>.scope
ls
# cpu.max, memory.current, memory.max, pids.max, etc.

Those files (on v2: cpu.max, memory.max, memory.high, etc.) are where the real isolation is enforced by the kernel.

Share on

Twitter Facebook Reddit LinkedIn Mastodon

Maung San

kubernetes Resource Isolation - 01. How Kubernetes Uses cgroups

1. What cgroups are (in our context)

2. Who actually creates cgroups in Kubernetes?

3. The cgroup hierarchy for Pods & containers

4. How requests/limits conceptually map to cgroups

5. Resource isolation vs. fairness vs. overcommit

6. Where kubelet configuration comes in

7. Inspecting this on a real node (mental model)

Share on

Leave a comment

You may also enjoy

DevOps Quick Read - How to Read a Packer Template in 60 Seconds

kubernetes Resource Isolation - 14. A catalog of **cluster design patterns

kubernetes Resource Isolation - 13. Production-ready node & kubelet blueprint

kubernetes Resource Isolation - 12. Ultimate Node Sizing Guide for AKS, EKS, and GKE