kubernetes Resource Isolation - 07. Node Allocatable, System Reservations, Eviction & Kubelet Internals**

October 11, 2025 4 minute read

If you understand Node Allocatable + kubelet system reservations + eviction thresholds, you can avoid the #1 cause of node instability: Node memory pressure and random Pod evictions.

SEGMENT 7 — Node Allocatable, System Reservations, Eviction & Kubelet Internals

We will cover:

What “Node Allocatable” really means
How resources are carved out on a node
kube-reserved & system-reserved
Eviction thresholds (soft/hard)
Eviction policy & Pod priority
Memory pressure timeline (extremely important)
Why nodes become unstable
Best practices from production clusters

Let’s begin.

PART 1 — What Is Node Allocatable?

Node Allocatable is:

The portion of node resources available for Pods AFTER subtracting system + kubelet + eviction reserves.

Formula:

Allocatable =
   Node Capacity
 - Kube Reserved
 - System Reserved
 - Eviction Thresholds (memory.available)

This determines what the scheduler believes is safe to place on the node.

Node Capacity Example

Node has:

16 vCPU
32Gi memory

PART 2 — CPU/Memory Carving on a Node

This is the most important diagram:

          Node Capacity
    +--------------------------+
    |         kubelet          |
    |   system daemons (OS)    |
    |   container runtime      |  ← system-reserved & kube-reserved
    |----------------------------------------------
    |     eviction-hard/soft   |  ← eviction thresholds
    |----------------------------------------------
    |        Node Allocatable   |
    |  (Pods can be scheduled)  |
    +--------------------------+

Pods can use ONLY Node Allocatable, not full node memory/cpu.

PART 3 — kube-reserved & system-reserved

These flags define how much CPU/memory the node keeps for system components.

Configurable via kubelet flags:

--kube-reserved=cpu=1,memory=2Gi
--system-reserved=cpu=1,memory=1Gi

What they cover:

system-reserved

OS services (systemd, journald)
kernel overhead
network agents
background system processes

kube-reserved

kubelet
kube-proxy
container runtime (containerd/CRI-O)
CNI plugins
CSI drivers
device plugins

If you don’t reserve these:

User Pods will starve system processes → node becomes unstable → eviction storms and kubelet heartbeat failures.

PART 4 — Eviction Thresholds (Soft & Hard)

Hard eviction:

--eviction-hard=memory.available<500Mi

Immediate
Non-negotiable
Kubelet starts evicting pods FAST

Soft eviction:

--eviction-soft=memory.available<1Gi
--eviction-soft-grace-period=60s

Warning threshold
Kubelet waits for grace period
Then evicts if pressure continues

PART 5 — Eviction Priority & QoS Ordering

Eviction order is:

BestEffort Pods (no requests/limits)
Burstable Pods (evict higher actual-usage/requests first)
Guaranteed Pods (last to be killed)

Kubelet uses:

Pod QoS
Pod priorityClass
Pod resourcePressure conditions
Pod actual usage from cgroup stats

PART 6 — The Memory Pressure Timeline (critical)

This is where node instability happens.

Let’s walk through the real process:

Time = T0 — Everything OK

memory.available > eviction-soft memory.working_set low

Pods using less than their limits.

Time = T1 — Pods increase memory usage

If Pods consume memory that pushes:

memory.available < eviction-soft threshold

Soft eviction timer starts
Kubelet marks node: MemoryPressure=True
Scheduler avoids placing NEW Pods on the node
No Pods killed yet

Time = T2 — Soft threshold exceeded beyond grace period

After grace:

Kubelet evaluates Pods to evict
Picks:
1. BestEffort
2. Burstable
3. Guaranteed

Based on:

usage/request ratio
locality of memory pressure

Kubelet evicts pods.

Time = T3 — Node goes below eviction-hard threshold

If memory drops below hard threshold:

memory.available < eviction-hard

Kubelet:

instantly kills Pods
often multiple Pods
ignores grace
tries to free memory ASAP

This can cause:

cascading failures
sudden loss of workloads
node becoming NotReady
kubelet crash/restart loops

Time = T4 — Kernel OOM activates (worse)

If kubelet is too slow:

The Linux kernel OOM killer steps in:

randomly kills processes inside Pods
may kill kubelet itself (!!)
may kill containerd
may corrupt Pod state / checkpoint files

This is the worst-case scenario.

90% of “random K8s restarts” are kernel OOM killing something.

Time = T5 — Node NotReady

If kubelet or container runtime is killed:

node becomes NotReady
pods get rescheduled elsewhere
node may need manual intervention

PART 7 — Why Nodes Become Unstable (real-world root causes)

Most common root causes:

1. No system-reserved or kube-reserved

User Pods starve kubelet/containerd.

2. Memory limits not set on Pods

BestEffort pods consume entire node memory.

3. Overcommitting memory

Requests too low → scheduler over-packs the node.

4. CSI drivers, CNIs, and monitoring agents consume memory

And you didn’t reserve memory for them.

5. Page cache consuming memory

Not accounted as “Pod memory” but still affects eviction-hard.

6. Bursty workloads (Java, Node.js, ML inference)

Spike above their memory request.

7. Using horizontal scaling with lots of small Pods

Fragmentation + overhead = node thrash.

PART 8 — How Node Allocatable Impacts Pod Scheduling

If Node Allocatable is too large:

scheduler believes node has more memory than real → node gets overloaded → eviction storms

If Node Allocatable is too strict:

scheduler underutilizes node → wasted resources

You must find the right balance.

PART 9 — Best Practices for Production

1. Always define system-reserved + kube-reserved

Example:

--system-reserved=cpu=500m,memory=1Gi
--kube-reserved=cpu=1,memory=2Gi

2. Always define eviction thresholds

Recommended:

--eviction-hard=memory.available<500Mi
--eviction-soft=memory.available<1Gi
--eviction-soft-grace-period=1m

3. NEVER run BestEffort Pods in production

Unless it’s a throwaway debug pod.

4. Always set memory limits

Requests/limits needed to avoid runaway containers.

5. Overcommit CPU aggressively, memory cautiously

CPU is elastic. Memory is not.

6. For dense nodes, enable MemoryQoS

It makes page cache behavior fairer and reduces risk of kernel OOM.

7. Allocate extra kernel memory on nodes with large page cache

For data-processing nodes.

8. If running AI/ML workloads

Use:

static CPU Manager
Topology Manager
hugepages
system reservations

9. Use Node Feature Discovery + Node labels

Give special nodes more reservations to avoid instability.

SEGMENT 7 SUMMARY

You now understand:

Node Allocatable

what portion of a node Pods can use
how kubelet subtracts system-reserved

System & Kube Reservations

protect OS + kubelet + container runtime
essential for stable nodes

Eviction Thresholds

soft = warning
hard = immediate
protect node from total memory exhaustion

Eviction Order

BestEffort → Burstable → Guaranteed

Memory Pressure Timeline

how nodes go NotReady
how Pods get killed
how kernel OOM fights kubelet

Best Practices

always set limits
configure reservation
avoid BestEffort
enable MemoryQoS
avoid overcommitting memory

This is the foundation for stable production Kubernetes clusters.

Share on

Twitter Facebook Reddit LinkedIn Mastodon

Maung San