kubernetes Resource Isolation - 12. Ultimate Node Sizing Guide for AKS, EKS, and GKE

October 16, 2025 6 minute read

Segment 12 is where we get extremely practical about selecting the right node sizes and VM shapes in AKS/EKS/GKE. This is one of the most important but least understood aspects of Kubernetes performance engineering.

Choosing the wrong node size leads to:

Constant evictions
Memory pressure
CPU throttling
NUMA imbalance
Poor inference latency
Overpaying for unused cores
Underpowered control-plane components (CNIs, CSI, monitoring agents)

This guide will help you select the best node types for:

Microservices
JVM workloads
High-throughput services
Dataplanes (Cilium, Envoy)
Redis, Postgres
AI/ML
Spark
GPU workloads

Let’s go deep.

SEGMENT 12 — Ultimate Node Sizing Guide for AKS, EKS, and GKE

We will cover:

The principles for choosing node sizes
CPU-to-memory ratios that actually work
Understanding NUMA (critical!)
Choosing VM families in each cloud (AKS/EKS/GKE)
Node sizes for different workload types
When to use large nodes vs many small nodes
When to use local SSD
Cost optimization rules

PART 1 — Principles of Good Node Sizing

These are universal across AKS/EKS/GKE.

1. Memory pressure kills nodes — not CPU

Always design node capacity with memory as primary constraint.

Nodes rarely fail from high CPU usage. Nodes frequently fail from memory exhaustion → eviction → OOM → kubelet death → NotReady.

2. NUMA topology heavily affects performance

Nodes with ≥ 2 sockets or ≥ 2 NUMA nodes require careful placement.

JVM
Redis
AI inference
network dataplanes cannot randomly bounce across NUMA nodes.

Prefer:

single-NUMA nodes for latency-sensitive workloads

3. Avoid nodes with > 64 vCPUs unless you use pinned CPU workloads

Large nodes → more NUMA topology → more cgroup fragmentation → lower efficiency.

4. Prefer more medium nodes over fewer huge nodes

reduces blast radius
avoids multi-Pod NUMA fragmentation
improves bin packing
reduces eviction chain reactions

5. Always leave space for system daemons

Rule of thumb:

Reserve 6–12% of node memory
Reserve 0.5–1.5 vCPU for system/kube daemons

PART 2 — Recommended CPU : Memory Ratios

Use these ratios as starting points:

Workload Type	Recommended Ratio
Stateless microservices (Go, Node, Python)	1 vCPU : 2–4 GiB
JVM microservices (Spring Boot, Micronaut)	1 vCPU : 3–8 GiB
Databases (Redis, Postgres)	1 vCPU : 4–8 GiB
High-throughput dataplane (Envoy, Cilium)	1 vCPU : 1–2 GiB
AI Inference (CPU-heavy)	1 vCPU : 1–3 GiB
AI w/ GPU	CPU not bottleneck → 1 vCPU : 4–16 GiB
Spark/Flink executors	1 vCPU : 2–8 GiB, memory-bound

PART 3 — NUMA Topology Explained (Critical Selection Factor)

How to think about NUMA:

Single NUMA node = predictable, consistent latency
Multiple NUMA nodes =
- Remote memory access
- 20–80% slowdown for AI/Redis/Envoy
- Complex scheduling

Cloud providers rarely document NUMA, but here’s the real mapping:

AWS (EKS) NUMA

m5 / c5 / r5 → 1 NUMA node up to 24–32 vCPUs
m6i / c6i / r6i → 1 NUMA until ~32–48 vCPUs
m5.24xlarge / c5.24xlarge → 2 NUMA nodes

Azure (AKS) NUMA

Azure uses “CPU groups”, but effectively:

D-series, E-series → 1 NUMA up to ~32 vCPUs
F-series → 1 NUMA up to ~16 vCPUs
Lsv2 → 2+ NUMA nodes (local SSD optimized)

GCP (GKE) NUMA

n2-standard, e2-standard → single NUMA up to 32 vCPUs
n2-highmem/highcpu → single NUMA up to 48 vCPUs
a2 / g2 GPU nodes → big NUMA topology

PART 4 — Recommended VM Families Per Cloud

AKS (Azure)

⭐ Best General Purpose Workload Nodes:

D4s_v5, D8s_v5, D16s_v5 Balance of:
memory
CPU
no NUMA surprises

⭐ Best Compute Nodes:

F4s_v2, F8s_v2 Best for:
Cilium agents
API gateways
small services Avoid > F16 (NUMA segmentation)

⭐ Best Memory-Optimized:

E8ds_v5, E16ds_v5, E20 Ideal for:
Java
Elasticsearch
Redis

⭐ Best for NVMe-heavy workloads:

L8s_v3, L16s_v3 For:
Spark
batch
caching
databases with high random IO

⭐ Best CPU-optimized for AI/DPDK:

D8as_v5, F8as_v4 (start with 8 cores to keep single NUMA)

EKS (AWS)

⭐ Best general workloads:

m6i.large / xlarge / 2xlarge / 4xlarge

⭐ Best for high-throughput:

c6i.xlarge / 2xlarge

⭐ Best memory-heavy:

r6i.xlarge / 2xlarge / 4xlarge

⭐ Best AI CPU-side pre/post processing:

c7g (Graviton3) — extreme performance/price
m7g — best balance

⭐ Best with local SSD:

i3.xlarge / 2xlarge (best throughput in AWS)

Avoid:

m5.24xlarge
c5.18xlarge (NUM A splitting → inconsistent performance)

GKE (Google Cloud)

⭐ Best general workloads:

n2-standard-4 / 8 / 16

⭐ Best memory workloads:

n2-highmem-4 / 8 / 16

⭐ Best CPU-heavy:

c2-standard-4 / 8

⭐ Best local SSD:

n2-standard-8 w/ Local SSD

Avoid:

n1 or older instance types
Very large machine types (> 64 vCPUs)

PART 5 — Node Sizes Per Workload Type

1. Microservices (Go, Node, Python)

Best sizes:

4 vCPU / 16 GiB
8 vCPU / 32 GiB

Why:

Good bin packing
No NUMA pressure
Fits 10–25 Pods safely

Avoid:

Very small nodes (inefficient)
Very large nodes (blast radius)

2. JVM Apps (Spring Boot, Pega, Kafka clients)

Needs:

high memory per Pod
JVM heap + direct buffers

Best sizes:

8 vCPU / 64 GiB
16 vCPU / 128 GiB

If each POD needs 4Gi:

Node with 64Gi can fit 10–12 properly
With headroom for system daemons

3. Redis / Memcached

Needs:

single NUMA node
predictable CPU
local SSD optional

Best sizes:

8 vCPU / 64 GiB
16 vCPU / 128 GiB

Never deploy Redis on:

multi-NUMA 32–64 core nodes (unless CPU pinned)

4. Envoy Proxy / API Gateway

Needs:

stable CPU
no throttling
low jitter

Best sizes:

4 vCPU / 8 GiB
8 vCPU / 16 GiB

Run fewer Pods per node for isolation.

5. AI/ML Inference (CPU-bound)

Needs:

NUMA alignment
large memory for models
predictable batching latency

Best sizes:

8 vCPU / 32 GiB
16 vCPU / 64 GiB

With CPUManager:

Pin 4–8 CPUs exclusively for inference worker

6. AI/ML with GPU

CPU sizing is secondary.

Good rule:

4–6 vCPUs per GPU
16–32 GiB memory per GPU

Node example:

A10 GPU node → 8 vCPU / 32 GiB
A100 GPU node → 32 vCPU / 128–256 GiB

7. Databases (Postgres, MySQL, Elasticsearch)

Needs:

huge page cache
high memory
stable IO

Best sizes:

8 vCPU / 64 GiB
16 vCPU / 128 GiB

With local SSD:

Lsv2 (AKS)
i3/i4i (EKS)
n2-standard w/ local SSD (GKE)

Avoid:

memory-poor compute nodes

8. Spark / Flink / Ray

Executors need:

memory
local SSD
CPU bursts

Best sizes:

16 vCPU / 64 GiB
32 vCPU / 128 GiB
with local SSD

Avoid:

small nodes (executor fragmentation)
massive nodes (NUMA issues)

PART 6 — When to Use Large Nodes vs Small Nodes

Use small/medium nodes (<16 vCPU) for:

microservices
latency-sensitive workloads
Cilium/Envoy
Redis
AI inference
clusters with high Pod churn

Benefits:

low blast radius
easier bin packing
fast autoscaling

Use large nodes (32–64 vCPU) for:

Spark executors
Flink task managers
ETL workloads
AI training (multi-GPU nodes)

Avoid very large nodes (>64 vCPU) unless:

you’re doing ML training
pods are pinned to cores
you fully understand NUMA management

PART 7 — Local SSD Guidance

Use nodes with local SSD when:

Redis
Postgres WAL/logs
Spark shuffle
ML preprocessing
High local IO workloads

Avoid local SSD for:

general microservices (no benefit)
workloads using remote storage (EBS/EFS/Azure Disk/Premium)

PART 8 — Cost Optimization Rules

Use medium nodes for better bin packing
- 8 vCPU / 32 GiB is the global sweet spot
Avoid high-memory SKUs unless necessary
- r-series / E-series cost premium
Graviton (AWS) or Ampere (GKE/Oracle) > x86
- 20–40% cheaper
- better perf
GPU nodes: choose smallest CPU SKU that meets throughput
- oversizing CPU around GPUs is the #1 cost waste in AI clusters
Use autoscaling with Pod Disruption Budgets
- avoids evacuation storms

SEGMENT 12 SUMMARY

You now have a cloud-agnostic, workload-driven node sizing strategy:

Core Principles

memory > CPU
avoid NUMA fragmentation
prefer several medium nodes
leave room for system daemons

Best VM Families

Azure: D-series, E-series, F-series, Lsv2 for SSD
AWS: m6i, c6i, r6i, c7g (Graviton), i3/i4i
GCP: n2-standard, n2-highmem, c2-standard

Per-Workload Node Size Playbooks

Microservices → 4–8 vCPU
JVM → 8–16 vCPU, high-memory
Redis → 8 vCPU single-NUMA
AI inference → 8–16 vCPU
AI GPU → 4–6 CPUs per GPU
Spark → 16–32 vCPU, local SSD

Cost Optimization

medium nodes pack best
avoid big NUMA nodes
Graviton/Ampere highly efficient
GPU nodes should minimize CPU

Share on

Twitter Facebook Reddit LinkedIn Mastodon

Maung San

SEGMENT 12 — Ultimate Node Sizing Guide for AKS, EKS, and GKE

PART 1 — Principles of Good Node Sizing

1. Memory pressure kills nodes — not CPU

2. NUMA topology heavily affects performance

3. Avoid nodes with > 64 vCPUs unless you use pinned CPU workloads

4. Prefer more medium nodes over fewer huge nodes

5. Always leave space for system daemons

PART 2 — Recommended CPU : Memory Ratios

PART 3 — NUMA Topology Explained (Critical Selection Factor)

How to think about NUMA:

AWS (EKS) NUMA

Azure (AKS) NUMA

GCP (GKE) NUMA

PART 4 — Recommended VM Families Per Cloud

AKS (Azure)

⭐ Best General Purpose Workload Nodes:

⭐ Best Compute Nodes:

⭐ Best Memory-Optimized:

⭐ Best for NVMe-heavy workloads:

⭐ Best CPU-optimized for AI/DPDK:

EKS (AWS)

⭐ Best general workloads:

⭐ Best for high-throughput:

⭐ Best memory-heavy:

⭐ Best AI CPU-side pre/post processing:

⭐ Best with local SSD:

Avoid:

GKE (Google Cloud)

⭐ Best general workloads:

⭐ Best memory workloads:

⭐ Best CPU-heavy:

⭐ Best local SSD:

Avoid:

PART 5 — Node Sizes Per Workload Type

1. Microservices (Go, Node, Python)

2. JVM Apps (Spring Boot, Pega, Kafka clients)

3. Redis / Memcached

4. Envoy Proxy / API Gateway

5. AI/ML Inference (CPU-bound)

6. AI/ML with GPU

7. Databases (Postgres, MySQL, Elasticsearch)

8. Spark / Flink / Ray

PART 6 — When to Use Large Nodes vs Small Nodes

Use small/medium nodes (<16 vCPU) for:

Use large nodes (32–64 vCPU) for:

Avoid very large nodes (>64 vCPU) unless:

PART 7 — Local SSD Guidance

PART 8 — Cost Optimization Rules

SEGMENT 12 SUMMARY

Core Principles

Best VM Families

Per-Workload Node Size Playbooks

Cost Optimization

Share on

Leave a comment

You may also enjoy

DevOps Quick Read - How to Read a Packer Template in 60 Seconds

kubernetes Resource Isolation - 14. A catalog of **cluster design patterns

kubernetes Resource Isolation - 13. Production-ready node & kubelet blueprint

kubernetes Resource Isolation - 11. Kubernetes Performance Tuning Playbooks