kubernetes Resource Isolation - 07. Node Allocatable, System Reservations, Eviction & Kubelet Internals**
If you understand Node Allocatable + kubelet system reservations + eviction thresholds, you can avoid the #1 cause of node instability: Node memory pressure and random Pod evictions.
SEGMENT 7 — Node Allocatable, System Reservations, Eviction & Kubelet Internals
We will cover:
- What “Node Allocatable” really means
- How resources are carved out on a node
- kube-reserved & system-reserved
- Eviction thresholds (soft/hard)
- Eviction policy & Pod priority
- Memory pressure timeline (extremely important)
- Why nodes become unstable
- Best practices from production clusters
Let’s begin.
PART 1 — What Is Node Allocatable?
Node Allocatable is:
The portion of node resources available for Pods AFTER subtracting system + kubelet + eviction reserves.
Formula:
Allocatable =
Node Capacity
- Kube Reserved
- System Reserved
- Eviction Thresholds (memory.available)
This determines what the scheduler believes is safe to place on the node.
Node Capacity Example
Node has:
16 vCPU
32Gi memory
PART 2 — CPU/Memory Carving on a Node
This is the most important diagram:
Node Capacity
+--------------------------+
| kubelet |
| system daemons (OS) |
| container runtime | ← system-reserved & kube-reserved
|----------------------------------------------
| eviction-hard/soft | ← eviction thresholds
|----------------------------------------------
| Node Allocatable |
| (Pods can be scheduled) |
+--------------------------+
Pods can use ONLY Node Allocatable, not full node memory/cpu.
PART 3 — kube-reserved & system-reserved
These flags define how much CPU/memory the node keeps for system components.
Configurable via kubelet flags:
--kube-reserved=cpu=1,memory=2Gi
--system-reserved=cpu=1,memory=1Gi
What they cover:
system-reserved
- OS services (systemd, journald)
- kernel overhead
- network agents
- background system processes
kube-reserved
- kubelet
- kube-proxy
- container runtime (containerd/CRI-O)
- CNI plugins
- CSI drivers
- device plugins
If you don’t reserve these:
User Pods will starve system processes → node becomes unstable → eviction storms and kubelet heartbeat failures.
PART 4 — Eviction Thresholds (Soft & Hard)
Hard eviction:
--eviction-hard=memory.available<500Mi
- Immediate
- Non-negotiable
- Kubelet starts evicting pods FAST
Soft eviction:
--eviction-soft=memory.available<1Gi
--eviction-soft-grace-period=60s
- Warning threshold
- Kubelet waits for grace period
- Then evicts if pressure continues
PART 5 — Eviction Priority & QoS Ordering
Eviction order is:
- BestEffort Pods (no requests/limits)
- Burstable Pods (evict higher actual-usage/requests first)
- Guaranteed Pods (last to be killed)
Kubelet uses:
- Pod QoS
- Pod priorityClass
- Pod
resourcePressureconditions - Pod actual usage from cgroup stats
PART 6 — The Memory Pressure Timeline (critical)
This is where node instability happens.
Let’s walk through the real process:
Time = T0 — Everything OK
memory.available > eviction-soft memory.working_set low
Pods using less than their limits.
Time = T1 — Pods increase memory usage
If Pods consume memory that pushes:
memory.available < eviction-soft threshold
- Soft eviction timer starts
- Kubelet marks node: MemoryPressure=True
- Scheduler avoids placing NEW Pods on the node
- No Pods killed yet
Time = T2 — Soft threshold exceeded beyond grace period
After grace:
- Kubelet evaluates Pods to evict
-
Picks:
- BestEffort
- Burstable
- Guaranteed
Based on:
- usage/request ratio
- locality of memory pressure
Kubelet evicts pods.
Time = T3 — Node goes below eviction-hard threshold
If memory drops below hard threshold:
memory.available < eviction-hard
Kubelet:
- instantly kills Pods
- often multiple Pods
- ignores grace
- tries to free memory ASAP
This can cause:
- cascading failures
- sudden loss of workloads
- node becoming NotReady
- kubelet crash/restart loops
Time = T4 — Kernel OOM activates (worse)
If kubelet is too slow:
The Linux kernel OOM killer steps in:
- randomly kills processes inside Pods
- may kill kubelet itself (!!)
- may kill containerd
- may corrupt Pod state / checkpoint files
This is the worst-case scenario.
90% of “random K8s restarts” are kernel OOM killing something.
Time = T5 — Node NotReady
If kubelet or container runtime is killed:
- node becomes NotReady
- pods get rescheduled elsewhere
- node may need manual intervention
PART 7 — Why Nodes Become Unstable (real-world root causes)
Most common root causes:
1. No system-reserved or kube-reserved
User Pods starve kubelet/containerd.
2. Memory limits not set on Pods
BestEffort pods consume entire node memory.
3. Overcommitting memory
Requests too low → scheduler over-packs the node.
4. CSI drivers, CNIs, and monitoring agents consume memory
And you didn’t reserve memory for them.
5. Page cache consuming memory
Not accounted as “Pod memory” but still affects eviction-hard.
6. Bursty workloads (Java, Node.js, ML inference)
Spike above their memory request.
7. Using horizontal scaling with lots of small Pods
Fragmentation + overhead = node thrash.
PART 8 — How Node Allocatable Impacts Pod Scheduling
If Node Allocatable is too large:
- scheduler believes node has more memory than real → node gets overloaded → eviction storms
If Node Allocatable is too strict:
- scheduler underutilizes node → wasted resources
You must find the right balance.
PART 9 — Best Practices for Production
1. Always define system-reserved + kube-reserved
Example:
--system-reserved=cpu=500m,memory=1Gi
--kube-reserved=cpu=1,memory=2Gi
2. Always define eviction thresholds
Recommended:
--eviction-hard=memory.available<500Mi
--eviction-soft=memory.available<1Gi
--eviction-soft-grace-period=1m
3. NEVER run BestEffort Pods in production
Unless it’s a throwaway debug pod.
4. Always set memory limits
Requests/limits needed to avoid runaway containers.
5. Overcommit CPU aggressively, memory cautiously
CPU is elastic. Memory is not.
6. For dense nodes, enable MemoryQoS
It makes page cache behavior fairer and reduces risk of kernel OOM.
7. Allocate extra kernel memory on nodes with large page cache
For data-processing nodes.
8. If running AI/ML workloads
Use:
- static CPU Manager
- Topology Manager
- hugepages
- system reservations
9. Use Node Feature Discovery + Node labels
Give special nodes more reservations to avoid instability.
SEGMENT 7 SUMMARY
You now understand:
Node Allocatable
- what portion of a node Pods can use
- how kubelet subtracts system-reserved
System & Kube Reservations
- protect OS + kubelet + container runtime
- essential for stable nodes
Eviction Thresholds
- soft = warning
- hard = immediate
- protect node from total memory exhaustion
Eviction Order
BestEffort → Burstable → Guaranteed
Memory Pressure Timeline
- how nodes go NotReady
- how Pods get killed
- how kernel OOM fights kubelet
Best Practices
- always set limits
- configure reservation
- avoid BestEffort
- enable MemoryQoS
- avoid overcommitting memory
This is the foundation for stable production Kubernetes clusters.
Leave a comment