kubernetes Resource Isolation - 13. Production-ready node & kubelet blueprint

October 16, 2025  4 minute read  

SEGMENT 13 — Production-ready node & kubelet blueprint

1. High-level goals for the blueprint

We want nodes that:

  • Don’t randomly OOM kubelet/containerd
  • Don’t thrash with eviction storms
  • Use MemoryQoS, Node Allocatable, and QoS classes properly
  • Support CPUManager static and TopologyManager for special pools
  • Are safe for mixed workloads (general pool) and can be specialized per node pool

We’ll assume:

  • cgroup v2 + systemd driver
  • containerd
  • Kubernetes ≥ 1.26

2. Kubelet configuration (config file)

Typically /var/lib/kubelet/config.yaml or via AKS/EKS/GKE node bootstrap.

2.1 General-purpose node pool (default workloads)

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: "systemd"

# ---- Resource Reservations & Node Allocatable ----
systemReserved:
  cpu: "500m"
  memory: "1Gi"
  ephemeral-storage: "2Gi"

kubeReserved:
  cpu: "1000m"
  memory: "2Gi"
  ephemeral-storage: "2Gi"

# Optionally also reserve for user-level daemons
# evictionReserved: (AKS/EKS specific, sometimes via flags)

enforceNodeAllocatable:
  - "pods"
  - "system-reserved"
  - "kube-reserved"

# ---- Eviction Settings ----
evictionHard:
  "memory.available": "500Mi"
  "nodefs.available": "10%"
  "nodefs.inodesFree": "5%"

evictionSoft:
  "memory.available": "1Gi"
evictionSoftGracePeriod:
  "memory.available": "1m"

evictionPressureTransitionPeriod: "30s"

# ---- CPU / CGroup Behavior ----
cpuCFSQuota: true
cpuCFSQuotaPeriod: "100ms"   # default, explicit for clarity
cgroupsPerQOS: true
serializeImagePulls: false

# ---- Pod PID limit ----
podPidsLimit: 1024

# ---- Topology / CPU Manager (disabled in general pool) ----
topologyManagerPolicy: "none"
cpuManagerPolicy: "none"

# ---- Memory QoS (if supported by your distro & version) ----
memorySwap:
  swapBehavior: "LimitedSwap"   # or "NoSwap" in stricter environments
# MemoryQoS-style configs are feature-flagged or auto in some distros;
# if you have a MemoryQoS field, you’d set it here.

# ---- Misc stability options ----
failSwapOn: true
maxPods: 110

How to tune:

  • Smaller nodes: you can shrink kubeReserved/systemReserved slightly, but keep at least 10–12% of memory reserved.
  • Heavier system agents (CNI, CSI, monitoring): bump kubeReserved.memory to 3–4 GiB on 64Gi+ nodes.

2.2 Performance node pool (CPUManager + TopologyManager)

Use this for Redis, Envoy, ML inference, latency-sensitive workloads.

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: "systemd"

# Similar reservations, maybe slightly higher:
systemReserved:
  cpu: "500m"
  memory: "1Gi"
kubeReserved:
  cpu: "1500m"
  memory: "3Gi"

enforceNodeAllocatable:
  - "pods"
  - "system-reserved"
  - "kube-reserved"

# Evictions (you can be more conservative here)
evictionHard:
  "memory.available": "1Gi"
evictionSoft:
  "memory.available": "2Gi"
evictionSoftGracePeriod:
  "memory.available": "30s"

# CPU / topology
cgroupsPerQOS: true
cpuCFSQuota: true

cpuManagerPolicy: "static"
cpuManagerReconcilePeriod: "5s"

topologyManagerPolicy: "single-numa-node"
topologyManagerScope: "container"  # or "pod" depending on version

podPidsLimit: 2048
maxPods: 60   # fewer pods per node to keep things predictable

Paired with:

  • Guaranteed pods with integer CPU requests/limits (e.g. cpu: "2").
  • Node labels/taints so only special workloads land here (see section 5).

3. Systemd unit snippet for kubelet

If you’re not on a managed service or you have access to systemd drop-ins, something like:

/etc/systemd/system/kubelet.service.d/10-custom.conf:

[Service]
Environment="KUBELET_CONFIG_FILE=/var/lib/kubelet/config.yaml"
Environment="KUBELET_EXTRA_ARGS=\
  --container-runtime=remote \
  --container-runtime-endpoint=unix:///run/containerd/containerd.sock \
  --cgroup-driver=systemd \
  --kube-reserved=cpu=1000m,memory=2Gi,ephemeral-storage=2Gi \
  --system-reserved=cpu=500m,memory=1Gi,ephemeral-storage=2Gi \
  --eviction-hard=memory.available<500Mi,nodefs.available<10%,nodefs.inodesFree<5% \
  --eviction-soft=memory.available<1Gi \
  --eviction-soft-grace-period=memory.available=1m \
  --pod-max-pids=1024"

# Ensure restart on failure
Restart=always
RestartSec=10

On managed platforms (AKS/EKS/GKE) you don’t directly own this, but you can approximate via:

  • EKS: kubeletExtraConfig in node group, user data, or Bottlerocket TOML.
  • AKS: Node config profile / custom node image.
  • GKE: Node pools with custom kubelet config (GKE Autopilot does a lot automatically).

4. OS / sysctl tuning (applied on all worker nodes)

Put these into /etc/sysctl.d/99-k8s-tuning.conf:

# --- Networking ---
net.core.somaxconn = 4096
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1

# --- Connection tracking (adjust per-node traffic) ---
net.netfilter.nf_conntrack_max = 262144

# --- VM / Memory ---
vm.swappiness = 1              # prefer reclaim over swap
vm.dirty_ratio = 20
vm.dirty_background_ratio = 5
vm.overcommit_memory = 1       # 0 or 1 depending on DB/Redis tuning
vm.vfs_cache_pressure = 100    # adjust if slab/page cache issues

# Allow a lot of inotify watchers (for controllers, dev workloads)
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 1024

# PID limit (safety)
kernel.pid_max = 4194304

Then:

sysctl --system

Platform-specific notes:

  • For Redis/DB nodes you might set vm.overcommit_memory=1 explicitly and tune differently.
  • For extremely IO-heavy nodes you might dial vm.vfs_cache_pressure up to push out dentries faster.

5. Node pool patterns: labels + taints

To really make this blueprint useful, split the cluster into at least two node pools:

  1. general-pool – most workloads
  2. perf-pool – CPUManager + TopologyManager for special workloads

5.1 General node pool

Label & (optionally) taint:

kubectl label node <node> node-pool=general
# usually no taint, everything can run here

5.2 Perf node pool

kubectl label node <node> node-pool=perf
kubectl taint node <node> perf-only=true:NoSchedule

Then in a Pod spec for latency-critical workloads:

apiVersion: v1
kind: Pod
metadata:
  name: redis-perf
spec:
  nodeSelector:
    node-pool: perf
  tolerations:
    - key: "perf-only"
      operator: "Exists"
      effect: "NoSchedule"
  containers:
  - name: redis
    image: redis:7
    resources:
      requests:
        cpu: "2"
        memory: "4Gi"
      limits:
        cpu: "2"
        memory: "4Gi"

Because:

  • Guaranteed QoS (request=limit)
  • Integer CPUs
  • Scheduled only on perf nodes

→ CPUManager pins it, TopologyManager keeps it NUMA-local.


6. Per-workload overrides / patterns

6.1 General microservice Pod

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "1"        # or omit limit if you’re okay with burst
    memory: "512Mi"

QoS: Burstable Good for default/general node pool.


6.2 JVM service (general pool)

resources:
  requests:
    cpu: "500m"
    memory: "2Gi"
  limits:
    cpu: "2"
    memory: "3Gi"

And inside the container set:

-XX:+UseContainerSupport
-XX:MaxRAMPercentage=70

6.3 ML/Envoy/Redis on perf pool (pinned CPUs)

resources:
  requests:
    cpu: "4"
    memory: "8Gi"
  limits:
    cpu: "4"
    memory: "8Gi"

QoS: Guaranteed, integer CPUs → pinned.


7. Checklist before rolling this out

For any cluster/node pool:

  • Confirm cgroup v2 + systemd driver
  • Set systemReserved and kubeReserved
  • Set evictionHard and evictionSoft thresholds
  • Set podPidsLimit
  • Decide which pools get cpuManagerPolicy=static and topologyManagerPolicy=single-numa-node
  • Apply sysctl tuning via cloud-init/DaemonSet
  • Label/taint node pools
  • Adjust maxPods per node size (don’t go crazy on density)

Leave a comment