kubernetes Resource Isolation - 13. Production-ready node & kubelet blueprint
SEGMENT 13 — Production-ready node & kubelet blueprint
1. High-level goals for the blueprint
We want nodes that:
- Don’t randomly OOM kubelet/containerd
- Don’t thrash with eviction storms
- Use MemoryQoS, Node Allocatable, and QoS classes properly
- Support CPUManager static and TopologyManager for special pools
- Are safe for mixed workloads (general pool) and can be specialized per node pool
We’ll assume:
- cgroup v2 + systemd driver
- containerd
- Kubernetes ≥ 1.26
2. Kubelet configuration (config file)
Typically /var/lib/kubelet/config.yaml or via AKS/EKS/GKE node bootstrap.
2.1 General-purpose node pool (default workloads)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: "systemd"
# ---- Resource Reservations & Node Allocatable ----
systemReserved:
cpu: "500m"
memory: "1Gi"
ephemeral-storage: "2Gi"
kubeReserved:
cpu: "1000m"
memory: "2Gi"
ephemeral-storage: "2Gi"
# Optionally also reserve for user-level daemons
# evictionReserved: (AKS/EKS specific, sometimes via flags)
enforceNodeAllocatable:
- "pods"
- "system-reserved"
- "kube-reserved"
# ---- Eviction Settings ----
evictionHard:
"memory.available": "500Mi"
"nodefs.available": "10%"
"nodefs.inodesFree": "5%"
evictionSoft:
"memory.available": "1Gi"
evictionSoftGracePeriod:
"memory.available": "1m"
evictionPressureTransitionPeriod: "30s"
# ---- CPU / CGroup Behavior ----
cpuCFSQuota: true
cpuCFSQuotaPeriod: "100ms" # default, explicit for clarity
cgroupsPerQOS: true
serializeImagePulls: false
# ---- Pod PID limit ----
podPidsLimit: 1024
# ---- Topology / CPU Manager (disabled in general pool) ----
topologyManagerPolicy: "none"
cpuManagerPolicy: "none"
# ---- Memory QoS (if supported by your distro & version) ----
memorySwap:
swapBehavior: "LimitedSwap" # or "NoSwap" in stricter environments
# MemoryQoS-style configs are feature-flagged or auto in some distros;
# if you have a MemoryQoS field, you’d set it here.
# ---- Misc stability options ----
failSwapOn: true
maxPods: 110
How to tune:
- Smaller nodes: you can shrink
kubeReserved/systemReservedslightly, but keep at least 10–12% of memory reserved. - Heavier system agents (CNI, CSI, monitoring): bump
kubeReserved.memoryto 3–4 GiB on 64Gi+ nodes.
2.2 Performance node pool (CPUManager + TopologyManager)
Use this for Redis, Envoy, ML inference, latency-sensitive workloads.
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: "systemd"
# Similar reservations, maybe slightly higher:
systemReserved:
cpu: "500m"
memory: "1Gi"
kubeReserved:
cpu: "1500m"
memory: "3Gi"
enforceNodeAllocatable:
- "pods"
- "system-reserved"
- "kube-reserved"
# Evictions (you can be more conservative here)
evictionHard:
"memory.available": "1Gi"
evictionSoft:
"memory.available": "2Gi"
evictionSoftGracePeriod:
"memory.available": "30s"
# CPU / topology
cgroupsPerQOS: true
cpuCFSQuota: true
cpuManagerPolicy: "static"
cpuManagerReconcilePeriod: "5s"
topologyManagerPolicy: "single-numa-node"
topologyManagerScope: "container" # or "pod" depending on version
podPidsLimit: 2048
maxPods: 60 # fewer pods per node to keep things predictable
Paired with:
- Guaranteed pods with integer CPU requests/limits (e.g.
cpu: "2"). - Node labels/taints so only special workloads land here (see section 5).
3. Systemd unit snippet for kubelet
If you’re not on a managed service or you have access to systemd drop-ins, something like:
/etc/systemd/system/kubelet.service.d/10-custom.conf:
[Service]
Environment="KUBELET_CONFIG_FILE=/var/lib/kubelet/config.yaml"
Environment="KUBELET_EXTRA_ARGS=\
--container-runtime=remote \
--container-runtime-endpoint=unix:///run/containerd/containerd.sock \
--cgroup-driver=systemd \
--kube-reserved=cpu=1000m,memory=2Gi,ephemeral-storage=2Gi \
--system-reserved=cpu=500m,memory=1Gi,ephemeral-storage=2Gi \
--eviction-hard=memory.available<500Mi,nodefs.available<10%,nodefs.inodesFree<5% \
--eviction-soft=memory.available<1Gi \
--eviction-soft-grace-period=memory.available=1m \
--pod-max-pids=1024"
# Ensure restart on failure
Restart=always
RestartSec=10
On managed platforms (AKS/EKS/GKE) you don’t directly own this, but you can approximate via:
- EKS:
kubeletExtraConfigin node group, user data, or Bottlerocket TOML. - AKS: Node config profile / custom node image.
- GKE: Node pools with custom kubelet config (GKE Autopilot does a lot automatically).
4. OS / sysctl tuning (applied on all worker nodes)
Put these into /etc/sysctl.d/99-k8s-tuning.conf:
# --- Networking ---
net.core.somaxconn = 4096
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1
# --- Connection tracking (adjust per-node traffic) ---
net.netfilter.nf_conntrack_max = 262144
# --- VM / Memory ---
vm.swappiness = 1 # prefer reclaim over swap
vm.dirty_ratio = 20
vm.dirty_background_ratio = 5
vm.overcommit_memory = 1 # 0 or 1 depending on DB/Redis tuning
vm.vfs_cache_pressure = 100 # adjust if slab/page cache issues
# Allow a lot of inotify watchers (for controllers, dev workloads)
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 1024
# PID limit (safety)
kernel.pid_max = 4194304
Then:
sysctl --system
Platform-specific notes:
- For Redis/DB nodes you might set
vm.overcommit_memory=1explicitly and tune differently. - For extremely IO-heavy nodes you might dial
vm.vfs_cache_pressureup to push out dentries faster.
5. Node pool patterns: labels + taints
To really make this blueprint useful, split the cluster into at least two node pools:
- general-pool – most workloads
- perf-pool – CPUManager + TopologyManager for special workloads
5.1 General node pool
Label & (optionally) taint:
kubectl label node <node> node-pool=general
# usually no taint, everything can run here
5.2 Perf node pool
kubectl label node <node> node-pool=perf
kubectl taint node <node> perf-only=true:NoSchedule
Then in a Pod spec for latency-critical workloads:
apiVersion: v1
kind: Pod
metadata:
name: redis-perf
spec:
nodeSelector:
node-pool: perf
tolerations:
- key: "perf-only"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: redis
image: redis:7
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "2"
memory: "4Gi"
Because:
- Guaranteed QoS (request=limit)
- Integer CPUs
- Scheduled only on perf nodes
→ CPUManager pins it, TopologyManager keeps it NUMA-local.
6. Per-workload overrides / patterns
6.1 General microservice Pod
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1" # or omit limit if you’re okay with burst
memory: "512Mi"
QoS: Burstable Good for default/general node pool.
6.2 JVM service (general pool)
resources:
requests:
cpu: "500m"
memory: "2Gi"
limits:
cpu: "2"
memory: "3Gi"
And inside the container set:
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=70
6.3 ML/Envoy/Redis on perf pool (pinned CPUs)
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "4"
memory: "8Gi"
QoS: Guaranteed, integer CPUs → pinned.
7. Checklist before rolling this out
For any cluster/node pool:
- Confirm cgroup v2 + systemd driver
- Set
systemReservedandkubeReserved - Set
evictionHardandevictionSoftthresholds - Set
podPidsLimit - Decide which pools get
cpuManagerPolicy=staticandtopologyManagerPolicy=single-numa-node - Apply sysctl tuning via cloud-init/DaemonSet
- Label/taint node pools
- Adjust maxPods per node size (don’t go crazy on density)
Leave a comment