kubernetes Resource Isolation - 11. Kubernetes Performance Tuning Playbooks
Segment 11 is where we turn everything from the previous deep dives into practical tuning playbooks for real workloads. These are the exact patterns used by:
- AI/ML infra teams
- FinTech low-latency clusters
- Telco NFV teams
- High-throughput data platforms
- Large enterprise Kubernetes platforms (AKS/EKS/GKE)
We’ll cover CPU, Memory, GC, NUMA, CFS throttling, working set behavior, eviction safety, request/limit design, and more per workload.
SEGMENT 11 — Kubernetes Performance Tuning Playbooks
We will produce 8 workload-specific tuning playbooks:
- Java services (Spring Boot / Micronaut / Kafka / Pega / JVM-based apps)
- Go microservices (Envoy, API, controller workloads)
- Node.js / Python microservices
- High-performance Redis / Memcached / in-memory DBs
- AI/ML inference workloads (TensorRT, ONNX, PyTorch Serve)
- Dataplane workloads (Envoy, NGINX, Cilium agent, DPDK, NFV)
- Databases (Postgres, MySQL, Elasticsearch, Cassandra)
- Batch/ETL (Spark, Flink, Ray)
Let’s go through each one with recommended CPU/memory patterns, cgroup settings, kubelet implications, GC tuning, and best practices.
PLAYBOOK 1 — Java Applications
(Java is the #1 source of Kubernetes performance issues.)
What to expect from Java:
- Large and bursty allocations
- High thread count
- Off-heap usage (direct buffers, metaspace)
- Page cache usage (class loading)
- Predictable memory spikes during GC
CPU Tuning
- Never set CPU limit unless required CPU limit introduces throttling → GC pauses → latency spikes Use:
requests.cpu = <expected>
limits.cpu = none
- If latency-sensitive: Enable CPU pinning:
Guaranteed QoS (requests=limits)
integer CPUs (2,4,8)
cpuManagerPolicy: static
Memory Tuning
- Set limit higher than heap: Example:
heap = 3Gi
limit = 4Gi
- Tune:
-XX:+UseContainerSupport
-XX:MaxRAMPercentage=70
-XX:InitialRAMPercentage=70
- Tune metaspace:
-XX:MaxMetaspaceSize=512m
- Tune thread stack:
-Xss512k
MemoryQoS
Enable MemoryQoS:
- memory.min = request
- memory.high = limit x 0.9 Prevents sudden OOM.
Pod configuration
requests:
cpu: "1"
memory: "3Gi"
limits:
memory: "4Gi"
cpu: "2" (optional)
PLAYBOOK 2 — Go Microservices
What to expect
- Very efficient CPU usage
- Low memory footprint
- But high concurrency may need CPU cycles
- GC pauses rare but CPU-intensive under load
CPU Tuning
- Remove CPU limit Throttle causes significant latency spikes in high qps load.
requests.cpu = N
limits.cpu = none
- For HFT or low-latency:
- Pin 1 dedicated CPU
- Static CPU Manager
- single-numa-node
Memory Tuning
- Go apps rarely exceed heap unless misconfigured
- Set memory limit ≈ 2x expected RSS Example: If app uses 400Mi:
requests.memory = 400Mi
limits.memory = 800Mi
GOMAXPROCS
Use:
runtime.GOMAXPROCS = cpu_count
Go 1.19+ auto-detects cgroup CPU quota.
PLAYBOOK 3 — Node.js & Python Microservices
Node.js
- Single-threaded by default
- Sensitive to CPU throttling
- Memory usage often unstable
Best patterns:
- Do NOT set CPU limit
- Set memory limit ≈ 2–3x heap
- Scale horizontally
Python
- GIL → only one thread runs Python bytecode at a time
- Heavy on page cache for unpickling/ML models
Best patterns:
- Do not set CPU limit
- Favor more CPU requests for concurrency
- Use MemoryQoS to prevent page cache starvation
PLAYBOOK 4 — Redis / Memcached / In-memory data stores
Characteristics:
- Extremely sensitive to CPU jitter
- Memory footprint equals data size
- Must avoid page cache interference
- Single-threaded or few-threaded
CPU Tuning
DO THIS:
Use Guaranteed QoS:
requests.cpu = 2
limits.cpu = 2
CPU pinning is critical:
cpuManagerPolicy: static
topologyManagerPolicy: restricted or single-numa-node
Memory Tuning
Memory.limit must include:
- object overhead
- fragmentation
- AOF buffers
- replication buffers
Recommended:
limit = dataset_size * 1.3
Disable overcommit for Redis:
vm.overcommit_memory = 1
PLAYBOOK 5 — AI/ML Inference Workloads
Characteristics:
- Spiky memory and page cache
- NUMA-sensitive
- GPU memory bottlenecks
- High CPU for preprocessing
CPU Tuning:
Use integer CPU Guaranteed pods:
requests.cpu=4
limits.cpu=4
Enable CPUManager:
cpuManagerPolicy=static
NUMA Tuning:
Enable:
topologyManagerPolicy=single-numa-node
Ensures:
- CPU
- GPU
- HugePages all come from same NUMA socket → 20–40% speedup.
Memory Tuning:
ML models produce:
- page cache pressure
- pinned memory
- large temporary tensors
Set:
limit = expected_peak * 1.4
Enable MemoryQoS.
GPU Tuning:
Node should have:
- MIG profiles (NVIDIA)
- fixed GPU memory budgets
- exclusive compute setting
PLAYBOOK 6 — Dataplane Agents (Envoy, Cilium agent, NGINX)
Characteristics:
- Hot code paths
- Extremely latency-sensitive
- Should NEVER be throttled
- High memory for buffers
- NUMA-sensitive
CPU Tuning:
Absolute must:
requests.cpu = 2
limits.cpu = none
or Guaranteed integer CPU with static policy.
For serious performance:
- Pin to CPU cores in NUMA socket
- Reserve core exclusively
Memory:
- Set moderately high memory limit (buffer heavy)
requests.memory = 1Gi
limits.memory = 2Gi
PLAYBOOK 7 — Databases (Postgres, MySQL, Elasticsearch, Cassandra)
Common issues:
- page cache interactions
- fsync stalls
- stack overflow on huge queries
- JVM (ES) GC
CPU:
Databases need stable CPU, but throttling is okay.
Use:
requests.cpu = moderate
limits.cpu = moderate
Memory:
Always leave headroom for:
- page cache
- background processes
For Postgres:
shared_buffers ≈ 25% memory
effective_cache_size ≈ 60%
For Elasticsearch:
- heap = 50% memory
- limit = heap * 1.3
- MemoryQoS recommended
PLAYBOOK 8 — Batch/ETL (Spark, Flink, Ray)
Characteristics:
- Heavy IO
- heavy page cache
- transient memory spikes
- multiple executors
CPU:
Executors need large CPU but not strict latency.
Use limits:
requests.cpu = 1
limits.cpu = 4
Memory:
Executors have:
- heap
- off-heap
- shuffle buffers
- page cache
Set:
limit = executor_memory * 1.5
requests.memory = executor_memory
MemoryQoS strongly recommended.
GLOBAL BEST PRACTICES ACROSS ALL WORKLOADS
CPU
- Do NOT set CPU limits unless required
- Always set requests
- Use CPUManager + integer CPUs for low latency workloads
Memory
- Always set memory limits
- MemoryQoS eliminates sudden OOM kills
- Overcommit memory very cautiously
QoS
- Avoid BestEffort
- Use Burstable for general workloads
- Use Guaranteed ONLY for latency-sensitive workloads
Eviction
- Tune eviction thresholds
- Use kube/system reserved memory
Node Selection
- NUMA-aware workload placement
- Local SSD for I/O heavy workloads
SEGMENT 11 SUMMARY
You now have workload-specific tuning playbooks for:
- Java
- Go
- Node/Python
- Redis
- AI/ML
- Dataplane agents
- Databases
- Batch/ETL
Each includes:
- CPU patterns
- Memory patterns
- NUMA rules
- CFS throttling guidance
- GC tuning
- MemoryQoS recommendations
This is the actionable knowledge used by senior Kubernetes performance engineers.
Leave a comment