kubernetes Resource Isolation - 11. Kubernetes Performance Tuning Playbooks

October 15, 2025  4 minute read  

Segment 11 is where we turn everything from the previous deep dives into practical tuning playbooks for real workloads. These are the exact patterns used by:

  • AI/ML infra teams
  • FinTech low-latency clusters
  • Telco NFV teams
  • High-throughput data platforms
  • Large enterprise Kubernetes platforms (AKS/EKS/GKE)

We’ll cover CPU, Memory, GC, NUMA, CFS throttling, working set behavior, eviction safety, request/limit design, and more per workload.


SEGMENT 11 — Kubernetes Performance Tuning Playbooks

We will produce 8 workload-specific tuning playbooks:

  1. Java services (Spring Boot / Micronaut / Kafka / Pega / JVM-based apps)
  2. Go microservices (Envoy, API, controller workloads)
  3. Node.js / Python microservices
  4. High-performance Redis / Memcached / in-memory DBs
  5. AI/ML inference workloads (TensorRT, ONNX, PyTorch Serve)
  6. Dataplane workloads (Envoy, NGINX, Cilium agent, DPDK, NFV)
  7. Databases (Postgres, MySQL, Elasticsearch, Cassandra)
  8. Batch/ETL (Spark, Flink, Ray)

Let’s go through each one with recommended CPU/memory patterns, cgroup settings, kubelet implications, GC tuning, and best practices.


PLAYBOOK 1 — Java Applications

(Java is the #1 source of Kubernetes performance issues.)

What to expect from Java:

  • Large and bursty allocations
  • High thread count
  • Off-heap usage (direct buffers, metaspace)
  • Page cache usage (class loading)
  • Predictable memory spikes during GC

CPU Tuning

  1. Never set CPU limit unless required CPU limit introduces throttling → GC pauses → latency spikes Use:
requests.cpu = <expected>
limits.cpu = none
  1. If latency-sensitive: Enable CPU pinning:
Guaranteed QoS (requests=limits)
integer CPUs (2,4,8)
cpuManagerPolicy: static

Memory Tuning

  1. Set limit higher than heap: Example:
heap = 3Gi
limit = 4Gi
  1. Tune:
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=70
-XX:InitialRAMPercentage=70
  1. Tune metaspace:
-XX:MaxMetaspaceSize=512m
  1. Tune thread stack:
-Xss512k

MemoryQoS

Enable MemoryQoS:

  • memory.min = request
  • memory.high = limit x 0.9 Prevents sudden OOM.

Pod configuration

requests:
  cpu: "1"
  memory: "3Gi"
limits:
  memory: "4Gi"
  cpu: "2" (optional)

PLAYBOOK 2 — Go Microservices

What to expect

  • Very efficient CPU usage
  • Low memory footprint
  • But high concurrency may need CPU cycles
  • GC pauses rare but CPU-intensive under load

CPU Tuning

  1. Remove CPU limit Throttle causes significant latency spikes in high qps load.
requests.cpu = N
limits.cpu = none
  1. For HFT or low-latency:
  • Pin 1 dedicated CPU
  • Static CPU Manager
  • single-numa-node

Memory Tuning

  • Go apps rarely exceed heap unless misconfigured
  • Set memory limit ≈ 2x expected RSS Example: If app uses 400Mi:
requests.memory = 400Mi
limits.memory = 800Mi

GOMAXPROCS

Use:

runtime.GOMAXPROCS = cpu_count

Go 1.19+ auto-detects cgroup CPU quota.


PLAYBOOK 3 — Node.js & Python Microservices

Node.js

  • Single-threaded by default
  • Sensitive to CPU throttling
  • Memory usage often unstable

Best patterns:

  • Do NOT set CPU limit
  • Set memory limit ≈ 2–3x heap
  • Scale horizontally

Python

  • GIL → only one thread runs Python bytecode at a time
  • Heavy on page cache for unpickling/ML models

Best patterns:

  • Do not set CPU limit
  • Favor more CPU requests for concurrency
  • Use MemoryQoS to prevent page cache starvation

PLAYBOOK 4 — Redis / Memcached / In-memory data stores

Characteristics:

  • Extremely sensitive to CPU jitter
  • Memory footprint equals data size
  • Must avoid page cache interference
  • Single-threaded or few-threaded

CPU Tuning

DO THIS:

Use Guaranteed QoS:

requests.cpu = 2
limits.cpu   = 2

CPU pinning is critical:

cpuManagerPolicy: static
topologyManagerPolicy: restricted or single-numa-node

Memory Tuning

Memory.limit must include:

  • object overhead
  • fragmentation
  • AOF buffers
  • replication buffers

Recommended:

limit = dataset_size * 1.3

Disable overcommit for Redis:

vm.overcommit_memory = 1

PLAYBOOK 5 — AI/ML Inference Workloads

Characteristics:

  • Spiky memory and page cache
  • NUMA-sensitive
  • GPU memory bottlenecks
  • High CPU for preprocessing

CPU Tuning:

Use integer CPU Guaranteed pods:

requests.cpu=4
limits.cpu=4

Enable CPUManager:

cpuManagerPolicy=static

NUMA Tuning:

Enable:

topologyManagerPolicy=single-numa-node

Ensures:

  • CPU
  • GPU
  • HugePages all come from same NUMA socket → 20–40% speedup.

Memory Tuning:

ML models produce:

  • page cache pressure
  • pinned memory
  • large temporary tensors

Set:

limit = expected_peak * 1.4

Enable MemoryQoS.


GPU Tuning:

Node should have:

  • MIG profiles (NVIDIA)
  • fixed GPU memory budgets
  • exclusive compute setting

PLAYBOOK 6 — Dataplane Agents (Envoy, Cilium agent, NGINX)

Characteristics:

  • Hot code paths
  • Extremely latency-sensitive
  • Should NEVER be throttled
  • High memory for buffers
  • NUMA-sensitive

CPU Tuning:

Absolute must:

requests.cpu = 2
limits.cpu = none

or Guaranteed integer CPU with static policy.

For serious performance:

  • Pin to CPU cores in NUMA socket
  • Reserve core exclusively

Memory:

  • Set moderately high memory limit (buffer heavy)
requests.memory = 1Gi
limits.memory   = 2Gi

PLAYBOOK 7 — Databases (Postgres, MySQL, Elasticsearch, Cassandra)

Common issues:

  • page cache interactions
  • fsync stalls
  • stack overflow on huge queries
  • JVM (ES) GC

CPU:

Databases need stable CPU, but throttling is okay.

Use:

requests.cpu = moderate
limits.cpu = moderate

Memory:

Always leave headroom for:

  • page cache
  • background processes

For Postgres:

shared_buffers ≈ 25% memory
effective_cache_size ≈ 60%

For Elasticsearch:

  • heap = 50% memory
  • limit = heap * 1.3
  • MemoryQoS recommended

PLAYBOOK 8 — Batch/ETL (Spark, Flink, Ray)

Characteristics:

  • Heavy IO
  • heavy page cache
  • transient memory spikes
  • multiple executors

CPU:

Executors need large CPU but not strict latency.

Use limits:

requests.cpu = 1
limits.cpu = 4

Memory:

Executors have:

  • heap
  • off-heap
  • shuffle buffers
  • page cache

Set:

limit = executor_memory * 1.5
requests.memory = executor_memory

MemoryQoS strongly recommended.


GLOBAL BEST PRACTICES ACROSS ALL WORKLOADS

CPU

  • Do NOT set CPU limits unless required
  • Always set requests
  • Use CPUManager + integer CPUs for low latency workloads

Memory

  • Always set memory limits
  • MemoryQoS eliminates sudden OOM kills
  • Overcommit memory very cautiously

QoS

  • Avoid BestEffort
  • Use Burstable for general workloads
  • Use Guaranteed ONLY for latency-sensitive workloads

Eviction

  • Tune eviction thresholds
  • Use kube/system reserved memory

Node Selection

  • NUMA-aware workload placement
  • Local SSD for I/O heavy workloads

SEGMENT 11 SUMMARY

You now have workload-specific tuning playbooks for:

  • Java
  • Go
  • Node/Python
  • Redis
  • AI/ML
  • Dataplane agents
  • Databases
  • Batch/ETL

Each includes:

  • CPU patterns
  • Memory patterns
  • NUMA rules
  • CFS throttling guidance
  • GC tuning
  • MemoryQoS recommendations

This is the actionable knowledge used by senior Kubernetes performance engineers.


Leave a comment