kubernetes Resource Isolation - 02. From PodSpec → QoS → cgroup layout → actual cgroup settings

October 04, 2025  3 minute read  

Part 1 — How Kubernetes Determines QoS Class

Kubernetes evaluates containers in a Pod and then places the Pod as a whole into one of three QoS buckets:

1. Guaranteed

A Pod is Guaranteed if every container:

  • has memory limit = memory request
  • has cpu limit = cpu request
  • AND all containers have these values set

Example:

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Also allowed:

cpu: 1  # limit=request
memory: 1Gi

2. Burstable

A Pod is Burstable if:

  • it has some requests set but
  • NOT everyone has limit=request OR
  • at least one container has requests but no limits

Example:

resources:
  requests:
    cpu: "200m"
    memory: "512Mi"

(no limits → burstable)

3. BestEffort

A Pod is BestEffort if:

  • No container sets ANY requests or limits for CPU or memory.

Example:

resources: {}

Part 2 — How QoS Determines cgroup Placement

On systemd+cgroupv2 (modern systems):

/sys/fs/cgroup/
  kubepods.slice/
    kubepods-guaranteed.slice/
    kubepods-burstable.slice/
    kubepods-besteffort.slice/

Inside each class, you get one cgroup per pod, then one cgroup per container.

Full example path:

/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID>.slice/cri-containerd-<containerid>.scope

On cgroupfs (legacy):

/sys/fs/cgroup/cpu/kubepods/burstable/pod<UID>/<container-id>/

Why this matters:

  • Each QoS class has different baseline protections and different CPU/memory behaviors.
  • Guaranteed pods get best isolation.
  • BestEffort pods are always the first to be killed under pressure.

Part 3 — How Requests & Limits Map to cgroup Controller Settings

Let’s break it down by resource:


CPU Mapping

Requests → cpu.shares

For cgroup v1:

cpu.shares = request_cpu * 1024
  • Request: 100m → shares = 102 (rounded up)
  • Request: 1 CPU → shares = 1024

For cgroup v2:

cpu.weight = a value from 1–10000 (kubelet maps shares → weight internally)

Limits → cpu.cfs_quota_us / cpu.cfs_period_us

Defaults:

  • cpu.cfs_period_us = 100000 (100ms)

If limit = 2 CPUs:

cpu.cfs_quota_us = 200000 (200ms)

This enforces hard max CPU.

If no CPU limit → Pod can burst to entire node.


Memory Mapping

Memory Limit → memory.max (or limit_in_bytes)

For cgroup v2:

memory.max = <limit bytes>

If memory limit is 1Gi:

memory.max = 1073741824

Hitting this causes:

  • kernel OOM inside the cgroup
  • Kubernetes sees container OOMKilled

Memory Request

This is NOT translated to any cgroup value on traditional setups.

However:

Memory Request affects:

  • QoS classification
  • scheduler bin-packing
  • eviction ordering
  • new kubelet memory QoS feature (v1.22+)

    • maps request → memory.min
    • maps limit → memory.high

Part 4 — cgroup settings per QoS class

QoS Class CPU Behavior Memory Behavior Eviction Priority Typical Use
Guaranteed Strong isolation (quota + shares) Hard memory limit; highest protection Last to be evicted Critical workloads
Burstable Shared CPU; may be throttled Can burst up to limit; request gives some protection Evicted after BestEffort Most apps
BestEffort Lowest CPU share No memory limit → can use all memory, but first killed First to be evicted Non-critical, debug jobs

Part 5 — Pod-level vs Container-level Enforcement

Kubernetes enforces limits at:

1. Container level

Hard memory limit → container cannot exceed Hard CPU limit → container cannot exceed

2. Pod-level

Memory:

  • Pod gets a cgroup with memory.max where:

    pod_memory_limit = sum(container_limits)
    

CPU:

  • Quota is applied to each container, not the whole Pod (historical reason)
  • But you can enable pod-level CPU cgroups (kubelet --cpu-cfs-quota + feature gates)

When Pod-level CPU accounting is enabled, you get:

cpu.max at the Pod cgroup

Part 6 — Node Allocatable & How Pods Fit Into the Node’s Hierarchy

Before a Pod gets placed, kubelet ensures there is enough room based on:

Node Capacity
- kube-reserved
- system-reserved
- eviction-hard margins
= Node Allocatable

Only Node Allocatable is schedulable to Pods.

This prevents user pods from starving system components.

Node-level cgroups:

/sys/fs/cgroup/system.slice/     → OS daemons
/sys/fs/cgroup/kubelet.slice/    → kubelet itself
/sys/fs/cgroup/kubepods.slice/   → all pods

These are set by systemd.


Part 7 — Putting It All Together (Example)

Example PodSpec:

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: api
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"
      limits:
        cpu: "2"
        memory: "1Gi"

Result:

  • limit ≠ request → Burstable
  • Pod cgroup under:

    kubepods-burstable.slice
    

CPU:

shares = 0.5 CPU * 1024 = 512 quota = 2 CPUs → 200000 period = 100000

Memory:

memory.max = 1Gi No memory.min unless MemoryQoS is enabled.


Conclusion of Segment 2

After Segment 2, you should have a clear mental model of:

  • How requests/limits classify the Pod
  • How QoS maps to cgroup hierarchies
  • How CPU/memory settings become real cgroup controller values
  • Pod vs container vs node-level cgroups

Leave a comment