kubernetes Resource Isolation - 14. A catalog of **cluster design patterns

October 17, 2025 4 minute read

Segment 14 as a catalog of cluster design patterns you can combine:

How to slice the cluster into node pools
How to slice workloads via namespaces, tenants, and QoS
How to use taints/tolerations, priority classes, PDBs, and topology to control behavior
When to make more clusters vs fewer clusters

I’ll keep each pattern fairly tight so you can remix them.

1. Node Pool Segmentation Patterns

1.1 General vs Specialized Pools

Pattern:

general-pool for 80–90% of workloads
One or more specialized pools:
- perf (CPUManager, TopologyManager)
- gpu
- batch
- db or stateful

Mechanics:

Labels:

kubectl label node node-1 node-pool=general
kubectl label node node-2 node-pool=perf

Taints on special pools:

kubectl taint node node-2 perf-only=true:NoSchedule

Workload spec:

nodeSelector:
  node-pool: perf
tolerations:
  - key: "perf-only"
    operator: "Exists"
    effect: "NoSchedule"

When to use: almost always. This is the baseline pattern.

1.2 Horizontal Isolation by “Noisy Class”

Separate node pools for:

system (CNI, CSI, metrics, logging)
user-apps
noisy-batch (Spark, ETL, big cronjobs)

Idea: Keep noisy, spiky workloads from contaminating general services.

Mechanics:

System DaemonSets:

nodeSelector:
  node-role.kubernetes.io/system: "true"

Batch node pool tainted:

kubectl taint node batch-pool batch-only=true:NoSchedule

1.3 Cost/Hardware Pools

Pools by machine type:

spot or preemptible
standard
high-mem
ssd-local

Use them like:

Non-critical workers → spot
Latency-critical → standard
Memory-heavy → high-mem
Spark/Redis → ssd-local

Key: Every pool has labels & taints; workloads choose via nodeSelector / nodeAffinity + tolerations.

2. Namespace & Tenant Patterns

2.1 Namespace-per-team / namespace-per-product

Pattern:

team-a-dev, team-a-prod
product-x-dev, product-x-prod

Controls per namespace:

ResourceQuota
LimitRange
NetworkPolicy
RBAC

Example:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a-prod
spec:
  hard:
    requests.cpu: "40"
    requests.memory: "80Gi"
    limits.cpu: "80"
    limits.memory: "160Gi"
    pods: "200"

When to use: Multi-team clusters, platform teams serving app teams.

2.2 Soft Multi-Tenancy vs Hard Multi-Tenancy

Soft: Same cluster, tenants isolated via namespaces, quotas, network policies, RBAC. Most enterprises.
Hard: Separate clusters per tenant or per BU, sometimes separate accounts/subscriptions.

Rules of thumb:

If tenants can be semi-trusted & share infra → soft.
If you need strong isolation / different compliance regimes / noisy security boundaries → multiple clusters.

3. Workload Admission & QoS Patterns

3.1 Enforce Requests & Limits via Policy

Use an admission policy (OPA/Gatekeeper, Kyverno, or built-in ValidatingAdmissionPolicy) to:

Reject Pods without resources.requests & resources.limits
Forbid BestEffort except for debug namespaces
Enforce max/min resource sizes per namespace

Pattern:

Default: require at least requests and limits.memory.
Exception: special allow-bursty namespace.

3.2 Priority Classes for SLO Layers

Define PriorityClasses like:

system-critical (CNI, kube-dns)
platform-critical (ingress, logging, metrics)
business-critical (user-facing prod services)
batch (ETL, reports)
best-effort (preemptible stuff)

Example:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: business-critical
value: 900
globalDefault: false

Use in Pod spec:

priorityClassName: business-critical

Behavior:

On resource pressure, lower-priority Pods get evicted first.
Scheduler gives high-priority workloads first dibs on resources.

3.3 PodDisruptionBudget (PDB) + Autoscaling

Pattern:

For every stateful or important stateless workload, define PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  minAvailable: 2

Combine with:

HPA for scale-out
Cluster Autoscaler / Karpenter for node scale-out

This gives:

Safe rollouts
Safe node drain / spot preemption
Enough replicas for resilience

4. Topology & Failure-Domain Patterns

4.1 Spread Across Zones / Nodes

Use topology spread constraints or anti-affinity:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: my-api

Or simpler:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              app: my-api

Goal: Avoid all replicas landing on same node or same AZ.

4.2 Zone-aware Node Pools

Per cloud:

Separate node pools per AZ
Label nodes with zone
Use topologySpreadConstraints to distribute workloads evenly

This prevents:

All traffic going through a single zone
Single-AZ outages taking entire app down

5. Security & Network Isolation Patterns

5.1 Zero-Trust-by-default NetworkPolicy

Base policy in each namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Then explicit “allow” policies for:

namespace-local communication
calls to specific backends (DBs, APIs)
calls to observability stack

Pattern: No ingress/egress allowed by default → everything opt-in.

5.2 Security Boundary Namespaces

For particularly sensitive apps, combine:

Dedicated namespace
Dedicated node pool (taints)
Strict NetworkPolicy
Stricter PodSecurity / PSP replacement (restricted baseline)
Separate secrets store (external KMS, Vault, AKV, etc.)

This is a cluster-within-a-cluster pattern.

6. Multi-Cluster Patterns

6.1 Env-tier Clusters

One of the most common:

prod cluster(s)
nonprod cluster(s) (dev/uat/stage)

Sometimes:

prod-us, prod-eu (data residency)

Pros:

Strong blast-radius isolation
Simple mental model: “prod is sacred”

Cons:

More control-plane overhead
You need a GitOps story that understands multiple clusters (ArgoCD, Flux).

6.2 Function-based Clusters

Patterns like:

core-platform cluster (ingress, observability, shared platform services)
app-tenant clusters for main product lines
data cluster for Kafka/Spark/Cassandra

This is helpful if:

Data-plane loads are wildly different than API-plane loads
Observability stack is heavy and you want to isolate it

7. Putting It Together – Example Design

Here’s a concrete cluster design pattern you can adapt:

Clusters

corp-nonprod
corp-prod

Node Pools in each cluster

system (small, stable, for CNI/CSI/monitoring)
general (default microservice nodes, D/E/m6i/n2)
perf (CPUManager+TopologyManager, latency/cpu-critical)
batch (cheaper, spot, larger nodes)
db (memory-heavy, local SSD, tainted)

Namespaces

platform-system (CNI, CSI, logging, metrics, ingress)
platform-observability (Prometheus, Loki, Tempo, etc.)
team-a-dev, team-a-prod
team-b-dev, team-b-prod
shared-services (auth, messaging, etc.)

Controls

ResourceQuota + LimitRange per team namespace
NetworkPolicy default-deny per namespace
PriorityClasses:
- system-critical
- platform-critical
- business-critical
- batch-low

Scheduling hints

Platform & observability → system & general pools
Latency-critical apps → perf pool (Guaranteed, pinned CPUs)
Spark jobs → batch pool (spot, large nodes, local SSD)
Redis/DB → db pool (memory-heavy, local SSD)

8. Quick design checklist

When you design or refactor a cluster, ask:

Do I have at least two node pools? (general + something else)
Are system components isolated or competing with apps?
Do teams have clear namespace boundaries, quotas, and limits?
Are BestEffort workloads controlled or confined?
Do I have PriorityClasses & PDBs for production services?
Are workloads spread across zones and nodes?
Do sensitive workloads have network & node isolation?
Do I need multiple clusters for prod vs nonprod or for legal isolation?

If the answer to most of these is “yes”, you’re in serious platform-engineering territory already.

Share on

Twitter Facebook Reddit LinkedIn Mastodon

Maung San

kubernetes Resource Isolation - 14. A catalog of **cluster design patterns

1. Node Pool Segmentation Patterns

1.1 General vs Specialized Pools

1.2 Horizontal Isolation by “Noisy Class”

1.3 Cost/Hardware Pools

2. Namespace & Tenant Patterns

2.1 Namespace-per-team / namespace-per-product

2.2 Soft Multi-Tenancy vs Hard Multi-Tenancy

3. Workload Admission & QoS Patterns

3.1 Enforce Requests & Limits via Policy

3.2 Priority Classes for SLO Layers

3.3 PodDisruptionBudget (PDB) + Autoscaling

4. Topology & Failure-Domain Patterns

4.1 Spread Across Zones / Nodes

4.2 Zone-aware Node Pools

5. Security & Network Isolation Patterns

5.1 Zero-Trust-by-default NetworkPolicy

5.2 Security Boundary Namespaces

6. Multi-Cluster Patterns

6.1 Env-tier Clusters

6.2 Function-based Clusters

7. Putting It Together – Example Design

Clusters

Node Pools in each cluster

Namespaces

Controls

Scheduling hints

8. Quick design checklist

Share on

Leave a comment

You may also enjoy

DevOps Quick Read - How to Read a Packer Template in 60 Seconds

kubernetes Resource Isolation - 13. Production-ready node & kubelet blueprint

kubernetes Resource Isolation - 12. Ultimate Node Sizing Guide for AKS, EKS, and GKE

kubernetes Resource Isolation - 11. Kubernetes Performance Tuning Playbooks