kubernetes Resource Isolation - 02. From PodSpec → QoS → cgroup layout → actual cgroup settings
Part 1 — How Kubernetes Determines QoS Class
Kubernetes evaluates containers in a Pod and then places the Pod as a whole into one of three QoS buckets:
1. Guaranteed
A Pod is Guaranteed if every container:
- has memory limit = memory request
- has cpu limit = cpu request
- AND all containers have these values set
Example:
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "512Mi"
Also allowed:
cpu: 1 # limit=request
memory: 1Gi
2. Burstable
A Pod is Burstable if:
- it has some requests set but
- NOT everyone has limit=request OR
- at least one container has requests but no limits
Example:
resources:
requests:
cpu: "200m"
memory: "512Mi"
(no limits → burstable)
3. BestEffort
A Pod is BestEffort if:
- No container sets ANY requests or limits for CPU or memory.
Example:
resources: {}
Part 2 — How QoS Determines cgroup Placement
On systemd+cgroupv2 (modern systems):
/sys/fs/cgroup/
kubepods.slice/
kubepods-guaranteed.slice/
kubepods-burstable.slice/
kubepods-besteffort.slice/
Inside each class, you get one cgroup per pod, then one cgroup per container.
Full example path:
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>.slice/cri-containerd-<containerid>.scope
On cgroupfs (legacy):
/sys/fs/cgroup/cpu/kubepods/burstable/pod<UID>/<container-id>/
Why this matters:
- Each QoS class has different baseline protections and different CPU/memory behaviors.
- Guaranteed pods get best isolation.
- BestEffort pods are always the first to be killed under pressure.
Part 3 — How Requests & Limits Map to cgroup Controller Settings
Let’s break it down by resource:
CPU Mapping
Requests → cpu.shares
For cgroup v1:
cpu.shares = request_cpu * 1024
- Request:
100m→ shares = 102 (rounded up) - Request:
1 CPU→ shares = 1024
For cgroup v2:
cpu.weight = a value from 1–10000 (kubelet maps shares → weight internally)
Limits → cpu.cfs_quota_us / cpu.cfs_period_us
Defaults:
cpu.cfs_period_us = 100000(100ms)
If limit = 2 CPUs:
cpu.cfs_quota_us = 200000 (200ms)
This enforces hard max CPU.
If no CPU limit → Pod can burst to entire node.
Memory Mapping
Memory Limit → memory.max (or limit_in_bytes)
For cgroup v2:
memory.max = <limit bytes>
If memory limit is 1Gi:
memory.max = 1073741824
Hitting this causes:
- kernel OOM inside the cgroup
- Kubernetes sees container OOMKilled
Memory Request
This is NOT translated to any cgroup value on traditional setups.
However:
Memory Request affects:
- QoS classification
- scheduler bin-packing
- eviction ordering
-
new kubelet memory QoS feature (v1.22+)
- maps request →
memory.min - maps limit →
memory.high
- maps request →
Part 4 — cgroup settings per QoS class
| QoS Class | CPU Behavior | Memory Behavior | Eviction Priority | Typical Use |
|---|---|---|---|---|
| Guaranteed | Strong isolation (quota + shares) | Hard memory limit; highest protection | Last to be evicted | Critical workloads |
| Burstable | Shared CPU; may be throttled | Can burst up to limit; request gives some protection | Evicted after BestEffort | Most apps |
| BestEffort | Lowest CPU share | No memory limit → can use all memory, but first killed | First to be evicted | Non-critical, debug jobs |
Part 5 — Pod-level vs Container-level Enforcement
Kubernetes enforces limits at:
1. Container level
Hard memory limit → container cannot exceed Hard CPU limit → container cannot exceed
2. Pod-level
Memory:
-
Pod gets a cgroup with
memory.maxwhere:pod_memory_limit = sum(container_limits)
CPU:
- Quota is applied to each container, not the whole Pod (historical reason)
- But you can enable pod-level CPU cgroups (kubelet
--cpu-cfs-quota+ feature gates)
When Pod-level CPU accounting is enabled, you get:
cpu.max at the Pod cgroup
Part 6 — Node Allocatable & How Pods Fit Into the Node’s Hierarchy
Before a Pod gets placed, kubelet ensures there is enough room based on:
Node Capacity
- kube-reserved
- system-reserved
- eviction-hard margins
= Node Allocatable
Only Node Allocatable is schedulable to Pods.
This prevents user pods from starving system components.
Node-level cgroups:
/sys/fs/cgroup/system.slice/ → OS daemons
/sys/fs/cgroup/kubelet.slice/ → kubelet itself
/sys/fs/cgroup/kubepods.slice/ → all pods
These are set by systemd.
Part 7 — Putting It All Together (Example)
Example PodSpec:
apiVersion: v1
kind: Pod
spec:
containers:
- name: api
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "1Gi"
Result:
- limit ≠ request → Burstable
-
Pod cgroup under:
kubepods-burstable.slice
CPU:
shares = 0.5 CPU * 1024 = 512 quota = 2 CPUs → 200000 period = 100000
Memory:
memory.max = 1Gi No memory.min unless MemoryQoS is enabled.
Conclusion of Segment 2
After Segment 2, you should have a clear mental model of:
- How requests/limits classify the Pod
- How QoS maps to cgroup hierarchies
- How CPU/memory settings become real cgroup controller values
-
Pod vs container vs node-level cgroups
Leave a comment