kubernetes Resource Isolation - 07. Node Allocatable, System Reservations, Eviction & Kubelet Internals**

October 11, 2025  4 minute read  

If you understand Node Allocatable + kubelet system reservations + eviction thresholds, you can avoid the #1 cause of node instability: Node memory pressure and random Pod evictions.

SEGMENT 7 — Node Allocatable, System Reservations, Eviction & Kubelet Internals

We will cover:

  1. What “Node Allocatable” really means
  2. How resources are carved out on a node
  3. kube-reserved & system-reserved
  4. Eviction thresholds (soft/hard)
  5. Eviction policy & Pod priority
  6. Memory pressure timeline (extremely important)
  7. Why nodes become unstable
  8. Best practices from production clusters

Let’s begin.


PART 1 — What Is Node Allocatable?

Node Allocatable is:

The portion of node resources available for Pods AFTER subtracting system + kubelet + eviction reserves.

Formula:

Allocatable =
   Node Capacity
 - Kube Reserved
 - System Reserved
 - Eviction Thresholds (memory.available)

This determines what the scheduler believes is safe to place on the node.


Node Capacity Example

Node has:

16 vCPU
32Gi memory

PART 2 — CPU/Memory Carving on a Node

This is the most important diagram:

          Node Capacity
    +--------------------------+
    |         kubelet          |
    |   system daemons (OS)    |
    |   container runtime      |  ← system-reserved & kube-reserved
    |----------------------------------------------
    |     eviction-hard/soft   |  ← eviction thresholds
    |----------------------------------------------
    |        Node Allocatable   |
    |  (Pods can be scheduled)  |
    +--------------------------+

Pods can use ONLY Node Allocatable, not full node memory/cpu.


PART 3 — kube-reserved & system-reserved

These flags define how much CPU/memory the node keeps for system components.

Configurable via kubelet flags:

--kube-reserved=cpu=1,memory=2Gi
--system-reserved=cpu=1,memory=1Gi

What they cover:

system-reserved

  • OS services (systemd, journald)
  • kernel overhead
  • network agents
  • background system processes

kube-reserved

  • kubelet
  • kube-proxy
  • container runtime (containerd/CRI-O)
  • CNI plugins
  • CSI drivers
  • device plugins

If you don’t reserve these:

User Pods will starve system processes → node becomes unstable → eviction storms and kubelet heartbeat failures.


PART 4 — Eviction Thresholds (Soft & Hard)

Hard eviction:

--eviction-hard=memory.available<500Mi
  • Immediate
  • Non-negotiable
  • Kubelet starts evicting pods FAST

Soft eviction:

--eviction-soft=memory.available<1Gi
--eviction-soft-grace-period=60s
  • Warning threshold
  • Kubelet waits for grace period
  • Then evicts if pressure continues

PART 5 — Eviction Priority & QoS Ordering

Eviction order is:

  1. BestEffort Pods (no requests/limits)
  2. Burstable Pods (evict higher actual-usage/requests first)
  3. Guaranteed Pods (last to be killed)

Kubelet uses:

  • Pod QoS
  • Pod priorityClass
  • Pod resourcePressure conditions
  • Pod actual usage from cgroup stats

PART 6 — The Memory Pressure Timeline (critical)

This is where node instability happens.

Let’s walk through the real process:


Time = T0 — Everything OK

memory.available > eviction-soft memory.working_set low

Pods using less than their limits.


Time = T1 — Pods increase memory usage

If Pods consume memory that pushes:

memory.available < eviction-soft threshold
  • Soft eviction timer starts
  • Kubelet marks node: MemoryPressure=True
  • Scheduler avoids placing NEW Pods on the node
  • No Pods killed yet

Time = T2 — Soft threshold exceeded beyond grace period

After grace:

  • Kubelet evaluates Pods to evict
  • Picks:

    1. BestEffort
    2. Burstable
    3. Guaranteed

Based on:

  • usage/request ratio
  • locality of memory pressure

Kubelet evicts pods.


Time = T3 — Node goes below eviction-hard threshold

If memory drops below hard threshold:

memory.available < eviction-hard

Kubelet:

  • instantly kills Pods
  • often multiple Pods
  • ignores grace
  • tries to free memory ASAP

This can cause:

  • cascading failures
  • sudden loss of workloads
  • node becoming NotReady
  • kubelet crash/restart loops

Time = T4 — Kernel OOM activates (worse)

If kubelet is too slow:

The Linux kernel OOM killer steps in:

  • randomly kills processes inside Pods
  • may kill kubelet itself (!!)
  • may kill containerd
  • may corrupt Pod state / checkpoint files

This is the worst-case scenario.

90% of “random K8s restarts” are kernel OOM killing something.


Time = T5 — Node NotReady

If kubelet or container runtime is killed:

  • node becomes NotReady
  • pods get rescheduled elsewhere
  • node may need manual intervention

PART 7 — Why Nodes Become Unstable (real-world root causes)

Most common root causes:

1. No system-reserved or kube-reserved

User Pods starve kubelet/containerd.

2. Memory limits not set on Pods

BestEffort pods consume entire node memory.

3. Overcommitting memory

Requests too low → scheduler over-packs the node.

4. CSI drivers, CNIs, and monitoring agents consume memory

And you didn’t reserve memory for them.

5. Page cache consuming memory

Not accounted as “Pod memory” but still affects eviction-hard.

6. Bursty workloads (Java, Node.js, ML inference)

Spike above their memory request.

7. Using horizontal scaling with lots of small Pods

Fragmentation + overhead = node thrash.


PART 8 — How Node Allocatable Impacts Pod Scheduling

If Node Allocatable is too large:

  • scheduler believes node has more memory than real → node gets overloaded → eviction storms

If Node Allocatable is too strict:

  • scheduler underutilizes node → wasted resources

You must find the right balance.


PART 9 — Best Practices for Production

1. Always define system-reserved + kube-reserved

Example:

--system-reserved=cpu=500m,memory=1Gi
--kube-reserved=cpu=1,memory=2Gi

2. Always define eviction thresholds

Recommended:

--eviction-hard=memory.available<500Mi
--eviction-soft=memory.available<1Gi
--eviction-soft-grace-period=1m

3. NEVER run BestEffort Pods in production

Unless it’s a throwaway debug pod.

4. Always set memory limits

Requests/limits needed to avoid runaway containers.

5. Overcommit CPU aggressively, memory cautiously

CPU is elastic. Memory is not.

6. For dense nodes, enable MemoryQoS

It makes page cache behavior fairer and reduces risk of kernel OOM.

7. Allocate extra kernel memory on nodes with large page cache

For data-processing nodes.

8. If running AI/ML workloads

Use:

  • static CPU Manager
  • Topology Manager
  • hugepages
  • system reservations

9. Use Node Feature Discovery + Node labels

Give special nodes more reservations to avoid instability.


SEGMENT 7 SUMMARY

You now understand:

Node Allocatable

  • what portion of a node Pods can use
  • how kubelet subtracts system-reserved

System & Kube Reservations

  • protect OS + kubelet + container runtime
  • essential for stable nodes

Eviction Thresholds

  • soft = warning
  • hard = immediate
  • protect node from total memory exhaustion

Eviction Order

BestEffort → Burstable → Guaranteed

Memory Pressure Timeline

  • how nodes go NotReady
  • how Pods get killed
  • how kernel OOM fights kubelet

Best Practices

  • always set limits
  • configure reservation
  • avoid BestEffort
  • enable MemoryQoS
  • avoid overcommitting memory

This is the foundation for stable production Kubernetes clusters.


Leave a comment