<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://maungsan.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://maungsan.github.io/" rel="alternate" type="text/html" /><updated>2026-03-09T09:38:01-07:00</updated><id>https://maungsan.github.io/feed.xml</id><title type="html">Sharing IT Knowledge and Experience</title><subtitle>Sharing Knowledge.</subtitle><author><name>Maung San</name><email>msan001@live.com</email></author><entry><title type="html">Terraform-Generated Infrastructure Diagrams with draw.io</title><link href="https://maungsan.github.io/terraform/platform%20engineering/2026/03/09/terraform-generated-drawio-diagrams/" rel="alternate" type="text/html" title="Terraform-Generated Infrastructure Diagrams with draw.io" /><published>2026-03-09T00:00:00-07:00</published><updated>2026-03-09T00:00:00-07:00</updated><id>https://maungsan.github.io/terraform/platform%20engineering/2026/03/09/terraform-generated-drawio-diagrams</id><content type="html" xml:base="https://maungsan.github.io/terraform/platform%20engineering/2026/03/09/terraform-generated-drawio-diagrams/"><![CDATA[<h1 id="terraform-generated-infrastructure-diagrams-with-drawio">Terraform-Generated Infrastructure Diagrams with draw.io</h1>

<p>One recurring problem in infrastructure documentation is <strong>diagram drift</strong>.</p>

<p>The architecture diagram in documentation says one thing.
Terraform state says another.
The deployed environment says something else entirely.</p>

<p>Over time diagrams stop being trusted.</p>

<p>Recently I started experimenting with an idea to reduce this drift:</p>

<p><strong>Generate architecture diagrams automatically from Terraform module outputs.</strong></p>

<p>This is still a <strong>work in progress</strong>, but the early prototype is promising and worth sharing.</p>

<hr />

<h1 id="the-core-idea">The Core Idea</h1>

<p>Terraform modules already know what infrastructure they create.</p>

<p>So instead of manually drawing architecture diagrams, we can generate them directly from Terraform outputs.</p>

<p>The flow looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Terraform Apply
      │
      ▼
terraform output -json
      │
      ▼
Renderer Script
      │
      ▼
Generated draw.io diagrams
</code></pre></div></div>

<p>Result:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>generated/
 ├─ overview.drawio
 └─ modules/
      ├─ network.drawio
      └─ aks.drawio
</code></pre></div></div>

<p>Instead of manually updating diagrams, they become <strong>derived artifacts of infrastructure code</strong>.</p>

<hr />

<h1 id="design-goals">Design Goals</h1>

<p>The approach is built around a few principles.</p>

<h2 id="1-eliminate-diagram-drift">1. Eliminate Diagram Drift</h2>

<p>The diagram reflects <strong>actual deployed infrastructure</strong>, because the values come directly from Terraform outputs.</p>

<p>No manual updates required.</p>

<hr />

<h2 id="2-module-ownership">2. Module Ownership</h2>

<p>Each Terraform module owns its own diagram template.</p>

<p>Example module structure:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>modules/
  network/
    main.tf
    outputs.tf
    diagram/
      overview-layer.xml.tpl
      standalone.xml.tpl
</code></pre></div></div>

<p>This keeps documentation <strong>close to the infrastructure definition</strong>.</p>

<hr />

<h2 id="3-composable-architecture-views">3. Composable Architecture Views</h2>

<p>Each module produces two diagram artifacts:</p>

<ol>
  <li><strong>Standalone module diagram</strong></li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>network.drawio
aks.drawio
</code></pre></div></div>

<ol>
  <li><strong>Layer for a shared platform overview</strong></li>
</ol>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>overview.drawio
</code></pre></div></div>

<p>The overview diagram composes all modules together into a system view.</p>

<hr />

<h2 id="4-reusable-modules">4. Reusable Modules</h2>

<p>Because diagram templates live inside modules, they travel with the module.</p>

<p>If another team reuses a module, they automatically inherit its documentation.</p>

<hr />

<h1 id="terraform-module-diagram-contract">Terraform Module Diagram Contract</h1>

<p>Each module exports structured metadata describing what should appear in the diagram.</p>

<p>Example:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">output</span> <span class="s2">"diagram"</span> <span class="p">{</span>
  <span class="nx">value</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">module_name</span> <span class="p">=</span> <span class="s2">"network"</span>
    <span class="nx">title</span>       <span class="p">=</span> <span class="s2">"Network"</span>

    <span class="nx">layout</span> <span class="p">=</span> <span class="p">{</span>
      <span class="nx">x</span> <span class="p">=</span> <span class="mi">80</span>
      <span class="nx">y</span> <span class="p">=</span> <span class="mi">80</span>
    <span class="p">}</span>

    <span class="nx">resources</span> <span class="p">=</span> <span class="p">{</span>
      <span class="nx">vnet_name</span>   <span class="p">=</span> <span class="nx">azurerm_virtual_network</span><span class="err">.</span><span class="nx">this</span><span class="err">.</span><span class="nx">name</span>
      <span class="nx">subnet_app</span>  <span class="p">=</span> <span class="nx">azurerm_subnet</span><span class="err">.</span><span class="nx">app</span><span class="err">.</span><span class="nx">name</span>
      <span class="nx">subnet_data</span> <span class="p">=</span> <span class="nx">azurerm_subnet</span><span class="err">.</span><span class="nx">data</span><span class="err">.</span><span class="nx">name</span>
    <span class="p">}</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This becomes the <strong>contract between Terraform and the diagram renderer</strong>.</p>

<p>The renderer does not inspect Terraform state directly.
It only consumes these outputs.</p>

<hr />

<h1 id="template-based-diagrams">Template-Based Diagrams</h1>

<p>Each module contains a draw.io template fragment.</p>

<p>Example file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>modules/network/diagram/overview-layer.xml.tpl
</code></pre></div></div>

<p>Example template snippet:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;mxCell</span> <span class="na">id=</span><span class="s">"_vnet"</span>
        <span class="na">value=</span><span class="s">"VNet: "</span>
        <span class="na">style=</span><span class="s">"rounded=1;whiteSpace=wrap;html=1;"</span>
        <span class="na">vertex=</span><span class="s">"1"</span>
        <span class="na">parent=</span><span class="s">"_group"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;mxGeometry</span> <span class="na">x=</span><span class="s">"20"</span> <span class="na">y=</span><span class="s">"50"</span> <span class="na">width=</span><span class="s">"260"</span> <span class="na">height=</span><span class="s">"60"</span> <span class="na">as=</span><span class="s">"geometry"</span><span class="nt">/&gt;</span>
<span class="nt">&lt;/mxCell&gt;</span>
</code></pre></div></div>

<p>Tokens like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
</code></pre></div></div>

<p>are replaced with Terraform output values during rendering.</p>

<hr />

<h1 id="rendering-the-diagrams">Rendering the Diagrams</h1>

<p>After Terraform runs:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>terraform output -json &gt; outputs.json
python render_drawio.py outputs.json
</code></pre></div></div>

<p>The renderer performs three simple steps:</p>

<ol>
  <li>Load Terraform outputs</li>
  <li>Replace tokens inside templates</li>
  <li>Assemble the final draw.io document</li>
</ol>

<hr />

<h1 id="resulting-architecture-view">Resulting Architecture View</h1>

<p>The generated diagram groups modules visually.</p>

<p>Example conceptual layout:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+--------------------+      +---------------------+
| Network            |      | AKS                 |
|                    |      |                     |
| VNet               |─────▶| Kubernetes Cluster  |
| Subnets            |      | Node Resource Group |
+--------------------+      +---------------------+
</code></pre></div></div>

<p>Each module becomes a self-contained block that can be reused or moved independently.</p>

<hr />

<h1 id="why-drawio">Why draw.io?</h1>

<p>draw.io (diagrams.net) works well for generated diagrams because:</p>

<ul>
  <li>widely used</li>
  <li>simple XML format</li>
  <li>easy to generate programmatically</li>
  <li>still editable visually when needed</li>
</ul>

<p>This makes it a good compromise between <strong>automation and flexibility</strong>.</p>

<hr />

<h1 id="current-limitations">Current Limitations</h1>

<p>This approach is still experimental and intentionally simple.</p>

<h2 id="xml-templates-are-verbose">XML Templates Are Verbose</h2>

<p>draw.io XML is not particularly pleasant to maintain.</p>

<p>A future version may replace XML templates with a simpler YAML-based specification.</p>

<hr />

<h2 id="layout-is-static">Layout Is Static</h2>

<p>Modules currently define fixed coordinates:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>layout = {
  x = 80
  y = 80
}
</code></pre></div></div>

<p>Eventually layout could be automated.</p>

<hr />

<h2 id="cross-module-relationships">Cross-Module Relationships</h2>

<p>Modules do not create edges to other modules directly.</p>

<p>Relationships are currently defined separately to avoid tight coupling between modules.</p>

<hr />

<h1 id="future-directions">Future Directions</h1>

<p>If the approach proves useful, several improvements are possible.</p>

<h3 id="shared-style-library">Shared Style Library</h3>

<p>Standard icons and styles for common infrastructure components.</p>

<hr />

<h3 id="yaml-diagram-specification">YAML Diagram Specification</h3>

<p>Instead of raw XML templates:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">nodes</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">vnet</span>
    <span class="na">label</span><span class="pi">:</span> <span class="s2">"</span><span class="s">VNet:</span><span class="nv"> </span><span class="s">"</span>
    <span class="na">x</span><span class="pi">:</span> <span class="m">20</span>
    <span class="na">y</span><span class="pi">:</span> <span class="m">50</span>
</code></pre></div></div>

<hr />

<h3 id="automatic-layout">Automatic Layout</h3>

<p>Modules could declare relationships rather than absolute coordinates.</p>

<hr />

<h3 id="cicd-integration">CI/CD Integration</h3>

<p>Pipelines could automatically publish diagrams after infrastructure deployment.</p>

<hr />

<h1 id="final-thoughts">Final Thoughts</h1>

<p>Infrastructure has increasingly become <strong>fully code-driven</strong>:</p>

<ul>
  <li>infrastructure</li>
  <li>networking</li>
  <li>security policies</li>
  <li>deployment pipelines</li>
</ul>

<p>But diagrams are often still maintained manually.</p>

<p>Generating diagrams directly from Terraform modules could help keep architecture documentation <strong>closer to reality</strong>.</p>

<p>This is still an early prototype, but the idea of <strong>diagrams-as-code owned by Terraform modules</strong> looks promising.</p>

<p>More updates as the experiment evolves.</p>]]></content><author><name>Maung San</name><email>msan001@live.com</email></author><category term="Terraform" /><category term="Platform Engineering" /><category term="Terraform" /><category term="Draw.io" /><category term="Infrastructure Documentation" /><category term="Diagrams as Code" /><category term="Platform Engineering" /><summary type="html"><![CDATA[Terraform-Generated Infrastructure Diagrams with draw.io]]></summary></entry><entry><title type="html">Beyond Terraform — Using Terragrunt to Manage Infrastructure at Scale</title><link href="https://maungsan.github.io/terraform/infrastructure%20as%20code/2026/03/08/terragrunt-deep-dive/" rel="alternate" type="text/html" title="Beyond Terraform — Using Terragrunt to Manage Infrastructure at Scale" /><published>2026-03-08T00:00:00-08:00</published><updated>2026-03-08T00:00:00-08:00</updated><id>https://maungsan.github.io/terraform/infrastructure%20as%20code/2026/03/08/terragrunt-deep-dive</id><content type="html" xml:base="https://maungsan.github.io/terraform/infrastructure%20as%20code/2026/03/08/terragrunt-deep-dive/"><![CDATA[<h1 id="beyond-terraform--using-terragrunt-to-manage-infrastructure-at-scale">Beyond Terraform — Using Terragrunt to Manage Infrastructure at Scale”</h1>

<p>Terraform is excellent at provisioning infrastructure, but large
Terraform codebases tend to accumulate problems:</p>

<ul>
  <li>Copy‑pasted backend configuration\</li>
  <li>Repeated provider configuration\</li>
  <li>Environment drift across dev/stage/prod\</li>
  <li>Dependency ordering between stacks\</li>
  <li>Painful module orchestration</li>
</ul>

<p><strong>Terragrunt</strong> solves these operational problems without replacing
Terraform. It wraps Terraform and adds features for <strong>composition, DRY
configuration, and orchestration of infrastructure stacks.</strong></p>

<p>This guide explains <strong>how experienced engineers actually structure
Terragrunt in production environments.</strong></p>

<hr />

<h1 id="what-terragrunt-really-does">What Terragrunt Really Does</h1>

<p>Terragrunt is a thin wrapper around Terraform that adds operational
capabilities Terraform intentionally avoids.</p>

<hr />
<p>Capability                          Why It Matters
  ———————————– ———————————–
  DRY Terraform configuration         Avoid copy‑pasting backend and
                                      providers</p>

<p>Environment hierarchy               Clean dev/stage/prod structure</p>

<p>Dependency management               Apply stacks in correct order</p>

<p>Remote state automation             Automatically configure state</p>

<p>Module orchestration                Run <code class="language-plaintext highlighter-rouge">apply</code> across multiple modules
  ———————————————————————–</p>

<p>Terraform itself remains responsible for infrastructure provisioning.
Terragrunt focuses on <strong>codebase management and orchestration.</strong></p>

<hr />

<h1 id="typical-terraform-scaling-problem">Typical Terraform Scaling Problem</h1>

<p>A real infrastructure repository often starts simple and gradually
becomes messy.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>terraform/
 ├── dev
 │   ├── vpc
 │   ├── eks
 │   └── rds
 ├── stage
 │   ├── vpc
 │   ├── eks
 │   └── rds
 └── prod
     ├── vpc
     ├── eks
     └── rds
</code></pre></div></div>

<p>Each directory typically contains:</p>

<ul>
  <li>backend configuration</li>
  <li>provider configuration</li>
  <li>repeated variables</li>
  <li>identical module references</li>
</ul>

<p>As the infrastructure grows, <strong>copy‑paste drift becomes inevitable.</strong></p>

<hr />

<h1 id="terragrunt-repository-layout">Terragrunt Repository Layout</h1>

<p>Terragrunt separates <strong>infrastructure modules</strong> from <strong>environment
configuration</strong>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>infrastructure-live/
 ├── dev
 │   ├── vpc
 │   │   └── terragrunt.hcl
 │   ├── eks
 │   │   └── terragrunt.hcl
 │   └── rds
 │       └── terragrunt.hcl
 ├── stage
 └── prod
</code></pre></div></div>

<p>Terraform modules live in a separate repository:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>infrastructure-modules/
 ├── vpc
 ├── eks
 └── rds
</code></pre></div></div>

<p>Terragrunt orchestrates module usage across environments.</p>

<hr />

<h1 id="installing-terragrunt">Installing Terragrunt</h1>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>brew install terragrunt
</code></pre></div></div>

<p>Or download the binary:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>https://terragrunt.gruntwork.io/docs/getting-started/install/
</code></pre></div></div>

<p>Verify installation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>terragrunt --version
</code></pre></div></div>

<hr />

<h1 id="minimal-terragrunt-configuration">Minimal Terragrunt Configuration</h1>

<p>Example <code class="language-plaintext highlighter-rouge">terragrunt.hcl</code>:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">terraform</span> <span class="p">{</span>
  <span class="nx">source</span> <span class="p">=</span> <span class="s2">"git::ssh://git@github.com/company/infrastructure-modules.git//vpc"</span>
<span class="p">}</span>

<span class="nx">inputs</span> <span class="err">=</span> <span class="p">{</span>
  <span class="nx">cidr_block</span> <span class="p">=</span> <span class="s2">"10.0.0.0/16"</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Execution:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>terragrunt apply
</code></pre></div></div>

<p>Terragrunt downloads the module and executes Terraform internally.</p>

<hr />

<h1 id="the-most-important-feature-include">The Most Important Feature: include</h1>

<p>Large infrastructures require shared configuration. Terragrunt solves
this with <strong>inheritance via <code class="language-plaintext highlighter-rouge">include</code>.</strong></p>

<p>Root configuration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>infrastructure-live/terragrunt.hcl
</code></pre></div></div>

<p>Example:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">remote_state</span> <span class="p">{</span>
  <span class="nx">backend</span> <span class="p">=</span> <span class="s2">"s3"</span>

  <span class="nx">config</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">bucket</span>         <span class="p">=</span> <span class="s2">"company-terraform-state"</span>
    <span class="nx">key</span>            <span class="p">=</span> <span class="s2">"${path_relative_to_include()}/terraform.tfstate"</span>
    <span class="nx">region</span>         <span class="p">=</span> <span class="s2">"us-east-1"</span>
    <span class="nx">encrypt</span>        <span class="p">=</span> <span class="kc">true</span>
    <span class="nx">dynamodb_table</span> <span class="p">=</span> <span class="s2">"terraform-locks"</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Child module:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dev/vpc/terragrunt.hcl
</code></pre></div></div>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">include</span> <span class="p">{</span>
  <span class="nx">path</span> <span class="p">=</span> <span class="nx">find_in_parent_folders</span><span class="err">()</span>
<span class="p">}</span>

<span class="nx">terraform</span> <span class="p">{</span>
  <span class="nx">source</span> <span class="p">=</span> <span class="s2">"git::ssh://git@github.com/company/infrastructure-modules.git//vpc"</span>
<span class="p">}</span>

<span class="nx">inputs</span> <span class="err">=</span> <span class="p">{</span>
  <span class="nx">cidr_block</span> <span class="p">=</span> <span class="s2">"10.0.0.0/16"</span>
<span class="p">}</span>
</code></pre></div></div>

<p>All modules inherit shared configuration automatically.</p>

<hr />

<h1 id="remote-state-without-duplication">Remote State Without Duplication</h1>

<p>Terraform normally requires each module to define its backend.</p>

<p>Terragrunt can generate it automatically.</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">remote_state</span> <span class="p">{</span>
  <span class="nx">backend</span> <span class="p">=</span> <span class="s2">"s3"</span>

  <span class="nx">generate</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">path</span>      <span class="p">=</span> <span class="s2">"backend.tf"</span>
    <span class="nx">if_exists</span> <span class="p">=</span> <span class="s2">"overwrite"</span>
  <span class="p">}</span>

  <span class="nx">config</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">bucket</span> <span class="p">=</span> <span class="s2">"company-terraform-state"</span>
    <span class="nx">key</span>    <span class="p">=</span> <span class="s2">"${path_relative_to_include()}/terraform.tfstate"</span>
    <span class="nx">region</span> <span class="p">=</span> <span class="s2">"us-east-1"</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When Terragrunt runs, <code class="language-plaintext highlighter-rouge">backend.tf</code> is generated dynamically.</p>

<hr />

<h1 id="environment-configuration">Environment Configuration</h1>

<p>Terragrunt supports environment‑level variables.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>live/
 ├── terragrunt.hcl
 ├── dev
 │   └── env.hcl
 ├── stage
 │   └── env.hcl
 └── prod
     └── env.hcl
</code></pre></div></div>

<p>Example <code class="language-plaintext highlighter-rouge">env.hcl</code>:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">locals</span> <span class="p">{</span>
  <span class="nx">env</span> <span class="p">=</span> <span class="s2">"dev"</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Load it in a module:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">locals</span> <span class="p">{</span>
  <span class="nx">env_config</span> <span class="p">=</span> <span class="nx">read_terragrunt_config</span><span class="err">(</span><span class="nx">find_in_parent_folders</span><span class="err">(</span><span class="s2">"env.hcl"</span><span class="err">))</span>
<span class="p">}</span>

<span class="nx">inputs</span> <span class="err">=</span> <span class="p">{</span>
  <span class="nx">environment</span> <span class="p">=</span> <span class="nx">local</span><span class="err">.</span><span class="nx">env_config</span><span class="err">.</span><span class="nx">locals</span><span class="err">.</span><span class="nx">env</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This allows environment‑aware configuration without repeating variables.</p>

<hr />

<h1 id="managing-dependencies">Managing Dependencies</h1>

<p>Infrastructure stacks often depend on one another.</p>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VPC -&gt; EKS -&gt; Application
</code></pre></div></div>

<p>Terragrunt provides a <code class="language-plaintext highlighter-rouge">dependency</code> block.</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">dependency</span> <span class="s2">"vpc"</span> <span class="p">{</span>
  <span class="nx">config_path</span> <span class="p">=</span> <span class="s2">"../vpc"</span>
<span class="p">}</span>

<span class="nx">inputs</span> <span class="err">=</span> <span class="p">{</span>
  <span class="nx">vpc_id</span> <span class="p">=</span> <span class="nx">dependency</span><span class="err">.</span><span class="nx">vpc</span><span class="err">.</span><span class="nx">outputs</span><span class="err">.</span><span class="nx">vpc_id</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Terragrunt reads outputs from Terraform state automatically.</p>

<hr />

<h1 id="running-multiple-modules">Running Multiple Modules</h1>

<p>Instead of applying each module individually:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd vpc
terraform apply

cd eks
terraform apply
</code></pre></div></div>

<p>Use Terragrunt orchestration:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>terragrunt run-all apply
</code></pre></div></div>

<p>Terragrunt:</p>

<ol>
  <li>Builds a dependency graph</li>
  <li>Applies modules in correct order</li>
  <li>Parallelizes independent stacks</li>
</ol>

<hr />

<h1 id="module-version-pinning">Module Version Pinning</h1>

<p>Modules are typically referenced from Git.</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">terraform</span> <span class="p">{</span>
  <span class="nx">source</span> <span class="p">=</span> <span class="s2">"git::ssh://git@github.com/company/infrastructure-modules.git//eks?ref=v1.4.0"</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Benefits:</p>

<ul>
  <li>controlled upgrades</li>
  <li>reproducible infrastructure</li>
  <li>easy rollback</li>
</ul>

<hr />

<h1 id="provider-generation">Provider Generation</h1>

<p>Provider configuration can also be generated.</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">generate</span> <span class="s2">"provider"</span> <span class="p">{</span>
  <span class="nx">path</span>      <span class="p">=</span> <span class="s2">"provider.tf"</span>
  <span class="nx">if_exists</span> <span class="p">=</span> <span class="s2">"overwrite"</span>

  <span class="nx">contents</span> <span class="p">=</span> <span class="o">&lt;&lt;</span><span class="no">EOF</span><span class="sh">
provider "aws" {
  region = "us-east-1"
}
</span><span class="no">EOF
</span><span class="p">}</span>
</code></pre></div></div>

<p>This prevents provider duplication across modules.</p>

<hr />

<h1 id="terragrunt-repository-design-production-layout">Terragrunt Repository Design (Production Layout)</h1>

<p>Most teams eventually converge on a layered layout.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>repo-root
 ├── infrastructure-modules
 │    ├── vpc
 │    ├── eks
 │    └── rds
 │
 └── infrastructure-live
      ├── terragrunt.hcl
      │
      ├── dev
      │   ├── env.hcl
      │   ├── vpc
      │   ├── eks
      │   └── rds
      │
      ├── stage
      │   └── ...
      │
      └── prod
          └── ...
</code></pre></div></div>

<p>Responsibilities:</p>

<p>Modules repository:</p>

<ul>
  <li>reusable Terraform modules</li>
  <li>versioned releases</li>
  <li>no environment values</li>
</ul>

<p>Live repository:</p>

<ul>
  <li>environment configuration</li>
  <li>Terragrunt orchestration</li>
  <li>module version pinning</li>
</ul>

<hr />

<h1 id="multiaccount--multiregion-pattern">Multi‑Account / Multi‑Region Pattern</h1>

<p>Terragrunt scales well to multi‑account architectures.</p>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>live/
 ├── prod
 │   ├── account.hcl
 │   ├── us-east-1
 │   │   ├── vpc
 │   │   └── eks
 │   └── us-west-2
 │       └── vpc
</code></pre></div></div>

<p>Example <code class="language-plaintext highlighter-rouge">account.hcl</code>:</p>

<div class="language-hcl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">locals</span> <span class="p">{</span>
  <span class="nx">account_id</span> <span class="p">=</span> <span class="s2">"123456789012"</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Provider configuration can dynamically assume roles based on this
configuration.</p>

<hr />

<h1 id="cicd-integration">CI/CD Integration</h1>

<p>Terragrunt works well inside pipelines.</p>

<p>Example CI stage:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>terragrunt run-all plan --terragrunt-non-interactive
</code></pre></div></div>

<p>Typical automation tools:</p>

<ul>
  <li>Atlantis</li>
  <li>Spacelift</li>
  <li>GitHub Actions</li>
  <li>GitLab CI</li>
  <li>Azure DevOps</li>
</ul>

<hr />

<h1 id="when-terragrunt-is-worth-using">When Terragrunt Is Worth Using</h1>

<p>Terragrunt becomes valuable when:</p>

<ul>
  <li>Terraform repositories exceed ~10 modules</li>
  <li>multiple environments exist</li>
  <li>backend duplication becomes painful</li>
  <li>stack dependencies grow complex</li>
</ul>

<p>For very small infrastructures, plain Terraform is simpler.</p>

<hr />

<h1 id="key-takeaways">Key Takeaways</h1>

<p>Terragrunt does not replace Terraform.</p>

<p>It solves the <strong>operational scaling problems Terraform intentionally
avoids</strong>.</p>

<p>The most valuable capabilities are:</p>

<ol>
  <li>DRY configuration</li>
  <li>dependency orchestration</li>
  <li>environment hierarchy</li>
  <li>remote state automation</li>
  <li>multi‑module execution</li>
</ol>

<p>These features allow Terraform infrastructures to scale <strong>without
turning into copy‑paste chaos.</strong></p>]]></content><author><name>Maung San</name><email>msan001@live.com</email></author><category term="Terraform" /><category term="Infrastructure as Code" /><category term="Terragrunt" /><category term="Terraform" /><category term="Platform Engineering" /><category term="DevOps" /><category term="Infrastructure Architecture" /><summary type="html"><![CDATA[Beyond Terraform — Using Terragrunt to Manage Infrastructure at Scale”]]></summary></entry><entry><title type="html">Infrastructure in 60 Seconds — How to Read a Helm Chart</title><link href="https://maungsan.github.io/kubernetes/2026/03/07/k8s-read-helm-chart-60-seconds/" rel="alternate" type="text/html" title="Infrastructure in 60 Seconds — How to Read a Helm Chart" /><published>2026-03-07T00:00:00-08:00</published><updated>2026-03-07T00:00:00-08:00</updated><id>https://maungsan.github.io/kubernetes/2026/03/07/k8s-read-helm-chart-60-seconds</id><content type="html" xml:base="https://maungsan.github.io/kubernetes/2026/03/07/k8s-read-helm-chart-60-seconds/"><![CDATA[<h2 id="infrastructure-in-60-seconds--how-to-read-a-helm-chart">Infrastructure in 60 Seconds — How to Read a Helm Chart</h2>

<p>When engineers open a Helm chart for the first time, the immediate reaction is often confusion. The repository contains multiple YAML templates, a large <code class="language-plaintext highlighter-rouge">values.yaml</code>, helper files, and sometimes nested subcharts. Reading every template line-by-line is inefficient.</p>

<p>Instead, experienced engineers reconstruct the <strong>deployment model</strong> by scanning a few key signals in a specific order.</p>

<p>The goal is to quickly answer:</p>

<ul>
  <li>What workloads does this chart deploy?</li>
  <li>What parts of the deployment are configurable?</li>
  <li>What dependencies does it include?</li>
  <li>What infrastructure assumptions does it make?</li>
</ul>

<p>Once those answers are clear, the rest of the chart becomes much easier to reason about.</p>

<hr />

<h2 id="step-1--start-with-chartyaml">Step 1 — Start With <code class="language-plaintext highlighter-rouge">Chart.yaml</code></h2>

<p>This file defines the <strong>identity and dependencies</strong> of the chart.</p>

<p>Look for:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>name
version
appVersion
dependencies
</code></pre></div></div>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>name: payments-api
version: 0.3.2
appVersion: 1.12.0
</code></pre></div></div>

<p>Signals to extract quickly:</p>

<ul>
  <li>Is this an <strong>application chart</strong> or <strong>platform component</strong>?</li>
  <li>Does the chart depend on other charts (databases, ingress controllers, monitoring)?</li>
  <li>Does the chart bundle infrastructure components or assume they already exist?</li>
</ul>

<p>The <code class="language-plaintext highlighter-rouge">dependencies</code> section is particularly important. It reveals whether the chart deploys additional systems like Redis, PostgreSQL, or Prometheus.</p>

<p>This tells you <strong>how self‑contained the deployment really is</strong>.</p>

<hr />

<h2 id="step-2--scan-valuesyaml-not-the-templates">Step 2 — Scan <code class="language-plaintext highlighter-rouge">values.yaml</code> (Not the Templates)</h2>

<p>Most Helm charts are driven almost entirely by <code class="language-plaintext highlighter-rouge">values.yaml</code>.</p>

<p>The templates simply interpolate those values.</p>

<p>Engineers should scan <code class="language-plaintext highlighter-rouge">values.yaml</code> first because it reveals:</p>

<ul>
  <li>configurable components</li>
  <li>optional features</li>
  <li>scaling behavior</li>
  <li>external integrations</li>
</ul>

<p>Look for sections like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>image
resources
replicaCount
service
ingress
env
</code></pre></div></div>

<p>These usually map directly to Kubernetes constructs.</p>

<p>Example mental model:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>values.yaml
      ↓
templates/*.yaml
      ↓
Rendered Kubernetes manifests
</code></pre></div></div>

<p>If you understand the values structure, you already understand <strong>how the deployment behaves</strong>.</p>

<hr />

<h2 id="step-3--identify-the-workload-type">Step 3 — Identify the Workload Type</h2>

<p>Next locate the core workload in <code class="language-plaintext highlighter-rouge">templates/</code>.</p>

<p>Typical files include:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>deployment.yaml
statefulset.yaml
daemonset.yaml
job.yaml
cronjob.yaml
</code></pre></div></div>

<p>The workload type reveals the <strong>runtime model</strong> of the application.</p>

<p>Example signals:</p>

<p>Deployment<br />
→ stateless service</p>

<p>StatefulSet<br />
→ database or stateful workload</p>

<p>DaemonSet<br />
→ node‑level agent (monitoring, logging, networking)</p>

<p>Understanding this immediately tells you <strong>how the system behaves operationally</strong>.</p>

<hr />

<h2 id="step-4--look-for-external-exposure">Step 4 — Look for External Exposure</h2>

<p>Next determine how the application is exposed.</p>

<p>Search for templates containing:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>service.yaml
ingress.yaml
gateway.yaml
route.yaml
</code></pre></div></div>

<p>Signals:</p>

<p>Service type</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ClusterIP
NodePort
LoadBalancer
</code></pre></div></div>

<p>Ingress or Gateway configuration indicates the application expects external traffic.</p>

<p>This reveals <strong>how traffic enters the system</strong>.</p>

<hr />

<h2 id="step-5--check-resource-configuration">Step 5 — Check Resource Configuration</h2>

<p>One of the most important operational signals is how the chart defines resource limits.</p>

<p>Look for:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>resources:
  limits:
  requests:
</code></pre></div></div>

<p>These values affect:</p>

<ul>
  <li>pod scheduling</li>
  <li>cluster capacity</li>
  <li>performance characteristics</li>
</ul>

<p>Charts without proper resource definitions often cause production issues.</p>

<p>Experienced engineers always scan this section early.</p>

<hr />

<h2 id="step-6--look-for-environment-injection">Step 6 — Look for Environment Injection</h2>

<p>Next identify how runtime configuration is injected.</p>

<p>Common patterns:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>env
envFrom
configMapRef
secretRef
</code></pre></div></div>

<p>This reveals where application configuration originates.</p>

<p>Example signals:</p>

<p>ConfigMap<br />
→ non‑sensitive configuration</p>

<p>Secret<br />
→ credentials or tokens</p>

<p>External secret systems may also appear through integrations with secret operators.</p>

<p>Understanding this tells you <strong>where configuration lives outside the chart</strong>.</p>

<hr />

<h2 id="step-7--check-helpers-and-template-logic">Step 7 — Check Helpers and Template Logic</h2>

<p>Most mature charts include helper functions in:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>templates/_helpers.tpl
</code></pre></div></div>

<p>These contain reusable template logic such as:</p>

<ul>
  <li>naming conventions</li>
  <li>label generation</li>
  <li>chart metadata</li>
</ul>

<p>You typically do not need to read every helper function, but scanning them reveals:</p>

<ul>
  <li>naming patterns</li>
  <li>resource label structure</li>
  <li>how multiple components are grouped</li>
</ul>

<p>This helps interpret rendered manifests later.</p>

<hr />

<h2 id="step-8--look-for-subcharts">Step 8 — Look for Subcharts</h2>

<p>Some charts embed other charts inside:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>charts/
</code></pre></div></div>

<p>or reference them through dependencies.</p>

<p>This often indicates the chart deploys a <strong>complete stack</strong>, not just an application.</p>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>application
↓
redis
↓
database
↓
metrics stack
</code></pre></div></div>

<p>Subcharts increase operational complexity, so identifying them early is important.</p>

<hr />

<h2 id="reconstructing-the-deployment-model">Reconstructing the Deployment Model</h2>

<p>After scanning these areas, you should be able to reconstruct the architecture mentally.</p>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Helm Chart
   ↓
Deployment (API)
   ↓
Service (ClusterIP)
   ↓
Ingress (public access)
   ↓
ConfigMaps + Secrets
   ↓
Optional Redis subchart
</code></pre></div></div>

<p>You now understand the <strong>core topology</strong> without reading every template.</p>

<hr />

<h2 id="signals-that-a-helm-chart-is-complex">Signals That a Helm Chart Is Complex</h2>

<p>Experienced engineers also watch for these indicators:</p>

<p>Large <code class="language-plaintext highlighter-rouge">values.yaml</code> (hundreds of lines)</p>

<p>Heavy template logic in <code class="language-plaintext highlighter-rouge">_helpers.tpl</code></p>

<p>Multiple workload types in <code class="language-plaintext highlighter-rouge">templates/</code></p>

<p>Embedded subcharts</p>

<p>Conditional blocks such as:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
</code></pre></div></div>

<p>These patterns usually indicate the chart supports <strong>multiple deployment modes</strong>.</p>

<hr />

<h2 id="key-takeaway">Key Takeaway</h2>

<p>When reading an unfamiliar Helm chart, scan in this order:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Chart.yaml
values.yaml
templates/ workload
service / ingress
resources
environment configuration
subcharts
</code></pre></div></div>

<p>This sequence allows you to reconstruct the deployment model quickly without reading every file.</p>]]></content><author><name>Maung San</name><email>msan001@live.com</email></author><category term="Kubernetes" /><category term="Infrastructure in 60 Seconds" /><category term="Kubernetes" /><category term="Helm" /><category term="Platform Engineering" /><summary type="html"><![CDATA[Infrastructure in 60 Seconds — How to Read a Helm Chart]]></summary></entry><entry><title type="html">Infrastructure in 60 Seconds — How to Read a Kubernetes Deployment</title><link href="https://maungsan.github.io/kubernetes/2026/03/07/k8s-read-kubernetes-deployment-60-seconds/" rel="alternate" type="text/html" title="Infrastructure in 60 Seconds — How to Read a Kubernetes Deployment" /><published>2026-03-07T00:00:00-08:00</published><updated>2026-03-07T00:00:00-08:00</updated><id>https://maungsan.github.io/kubernetes/2026/03/07/k8s-read-kubernetes-deployment-60-seconds</id><content type="html" xml:base="https://maungsan.github.io/kubernetes/2026/03/07/k8s-read-kubernetes-deployment-60-seconds/"><![CDATA[<h2 id="infrastructure-in-60-seconds--how-to-read-a-kubernetes-deployment">Infrastructure in 60 Seconds — How to Read a Kubernetes Deployment</h2>

<p>When a Deployment becomes part of a production incident, reading it top to bottom is usually too slow. By the time you finish scanning every field, the real question has already shifted: what part of this object actually controls rollout behavior, runtime behavior, or recovery behavior?</p>

<p>Seasoned engineers usually do not read a Deployment as YAML. They read it as an <strong>operational contract</strong> between the application, the scheduler, and the rollout controller.</p>

<p>The fastest way to understand a Deployment is to answer a small set of questions:</p>

<ul>
  <li>What pods is this object trying to keep alive?</li>
  <li>What image is actually being deployed?</li>
  <li>How does rollout happen?</li>
  <li>What makes a pod healthy or unhealthy?</li>
  <li>What scheduling or runtime constraints exist?</li>
  <li>What other objects does this Deployment depend on?</li>
</ul>

<p>Once those answers are clear, most of the remaining YAML becomes supporting detail.</p>

<hr />

<h2 id="step-1--start-with-metadata-only-long-enough-to-establish-context">Step 1 — Start With Metadata Only Long Enough to Establish Context</h2>

<p>Do not get stuck in labels immediately. Start by identifying the basic context:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span>
  <span class="na">namespace</span><span class="pi">:</span>
</code></pre></div></div>

<p>That tells you where this object lives and usually what system or bounded context it belongs to.</p>

<p>Then glance at labels and annotations only for <strong>high-signal clues</strong> such as:</p>

<ul>
  <li>release ownership</li>
  <li>GitOps ownership</li>
  <li>team or service identity</li>
  <li>sidecar injection hints</li>
  <li>restart or checksum annotations</li>
</ul>

<p>Examples of useful signals:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">app.kubernetes.io/name</code></li>
  <li><code class="language-plaintext highlighter-rouge">app.kubernetes.io/part-of</code></li>
  <li><code class="language-plaintext highlighter-rouge">argocd.argoproj.io/instance</code></li>
  <li><code class="language-plaintext highlighter-rouge">sidecar.istio.io/inject</code></li>
  <li>checksum annotations tied to ConfigMaps or Secrets</li>
</ul>

<p>This step is not about detail. It is about understanding what broader system is managing the Deployment.</p>

<hr />

<h2 id="step-2--find-the-pod-template-immediately">Step 2 — Find the Pod Template Immediately</h2>

<p>The most important part of a Deployment is not the Deployment object itself. It is the pod template under:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">spec</span><span class="pi">:</span>
  <span class="na">template</span><span class="pi">:</span>
</code></pre></div></div>

<p>This is the future state the controller keeps trying to realize.</p>

<p>If you understand the pod template, you understand the real workload.</p>

<p>At minimum, scan for:</p>

<ul>
  <li>container images</li>
  <li>ports</li>
  <li>environment injection</li>
  <li>volume mounts</li>
  <li>service account</li>
  <li>resource requests and limits</li>
</ul>

<p>A good mental shortcut is:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Deployment = rollout logic + pod template
</code></pre></div></div>

<p>If the pod template changes, Kubernetes creates a new ReplicaSet and begins rollout behavior.</p>

<p>That is why most operational questions eventually come back to the template.</p>

<hr />

<h2 id="step-3--check-replicas-before-anything-fancy">Step 3 — Check <code class="language-plaintext highlighter-rouge">replicas</code> Before Anything Fancy</h2>

<p>Look at:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">spec</span><span class="pi">:</span>
  <span class="na">replicas</span><span class="pi">:</span>
</code></pre></div></div>

<p>This tells you the intended steady-state pod count.</p>

<p>It sounds obvious, but in practice this answers several important questions immediately:</p>

<ul>
  <li>Is this workload expected to be highly available?</li>
  <li>Is it intentionally single replica?</li>
  <li>Are we dealing with a horizontally scaled service or a singleton process?</li>
</ul>

<p>For example:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">replicas: 1</code> means update strategy and readiness become much more sensitive</li>
  <li><code class="language-plaintext highlighter-rouge">replicas: 2</code> or more suggests some availability expectations</li>
  <li>missing <code class="language-plaintext highlighter-rouge">replicas</code> may indicate HPA-managed behavior or default assumptions</li>
</ul>

<p>For incident response, this single field often explains why a rollout created downtime or why there is no failover behavior.</p>

<hr />

<h2 id="step-4--read-the-selector-carefully">Step 4 — Read the Selector Carefully</h2>

<p>Look at:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">spec</span><span class="pi">:</span>
  <span class="na">selector</span><span class="pi">:</span>
    <span class="na">matchLabels</span><span class="pi">:</span>
</code></pre></div></div>

<p>This is one of the highest-risk parts of the object because it defines <strong>which pods belong to this Deployment</strong>.</p>

<p>Experienced engineers treat the selector as identity, not decoration.</p>

<p>Why it matters:</p>

<ul>
  <li>it determines which ReplicaSets the Deployment manages</li>
  <li>it must align with pod template labels</li>
  <li>bad selector design creates dangerous ownership confusion</li>
</ul>

<p>Then compare it with:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">spec</span><span class="pi">:</span>
  <span class="na">template</span><span class="pi">:</span>
    <span class="na">metadata</span><span class="pi">:</span>
      <span class="na">labels</span><span class="pi">:</span>
</code></pre></div></div>

<p>Those labels must match the selector correctly.</p>

<p>When debugging unexpected rollouts or pod ownership issues, this is one of the first places worth checking.</p>

<hr />

<h2 id="step-5--read-the-container-image-like-a-supply-chain-signal">Step 5 — Read the Container Image Like a Supply-Chain Signal</h2>

<p>Inside the pod template, go straight to:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">spec</span><span class="pi">:</span>
  <span class="na">template</span><span class="pi">:</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="na">containers</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span>
        <span class="na">image</span><span class="pi">:</span>
</code></pre></div></div>

<p>This is not just “what image runs.” It tells you:</p>

<ul>
  <li>what artifact is being deployed</li>
  <li>whether the deployment is pinned or floating</li>
  <li>whether the image naming aligns with environment and registry conventions</li>
</ul>

<p>High-signal things to notice:</p>

<ul>
  <li>specific immutable tag vs generic tag</li>
  <li>internal registry vs public registry</li>
  <li>image naming patterns tied to platform conventions</li>
</ul>

<p>Examples:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">myregistry.azurecr.io/payments/api:1.4.7</code></li>
  <li><code class="language-plaintext highlighter-rouge">repo/service:latest</code></li>
</ul>

<p>Seasoned engineers get nervous when they see mutable tags like <code class="language-plaintext highlighter-rouge">latest</code>, because rollout behavior becomes harder to reason about and recovery becomes less deterministic.</p>

<hr />

<h2 id="step-6--check-rollout-strategy-before-you-check-probes">Step 6 — Check Rollout Strategy Before You Check Probes</h2>

<p>Look at:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">spec</span><span class="pi">:</span>
  <span class="na">strategy</span><span class="pi">:</span>
    <span class="na">type</span><span class="pi">:</span>
    <span class="na">rollingUpdate</span><span class="pi">:</span>
      <span class="na">maxSurge</span><span class="pi">:</span>
      <span class="na">maxUnavailable</span><span class="pi">:</span>
</code></pre></div></div>

<p>This tells you how Kubernetes replaces old pods with new ones.</p>

<p>This is where you determine whether the Deployment is optimized for:</p>

<ul>
  <li>availability</li>
  <li>speed</li>
  <li>conservative rollout</li>
  <li>aggressive replacement</li>
</ul>

<p>Examples:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">maxUnavailable: 0</code> favors continuity</li>
  <li><code class="language-plaintext highlighter-rouge">maxSurge: 0</code> may create tighter capacity behavior</li>
  <li>default RollingUpdate behavior may be acceptable for stateless services but fragile for constrained clusters</li>
</ul>

<p>For experienced engineers, rollout strategy often explains production pain faster than probes do. Many “application issues” are really rollout math issues under limited capacity.</p>

<hr />

<h2 id="step-7--then-read-probes-as-recovery-policy">Step 7 — Then Read Probes as Recovery Policy</h2>

<p>Now inspect:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">livenessProbe</span>
<span class="s">readinessProbe</span>
<span class="s">startupProbe</span>
</code></pre></div></div>

<p>Do not read probes as health checks only. Read them as <strong>traffic control and restart policy signals</strong>.</p>

<p>What each really means operationally:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">readinessProbe</code> controls when the pod is eligible for traffic</li>
  <li><code class="language-plaintext highlighter-rouge">livenessProbe</code> controls when Kubernetes kills and restarts the container</li>
  <li><code class="language-plaintext highlighter-rouge">startupProbe</code> protects slow-starting applications from premature restart loops</li>
</ul>

<p>This is where you ask:</p>

<ul>
  <li>Can the app start slowly?</li>
  <li>Can it accept traffic before dependencies are ready?</li>
  <li>Can a bad liveness probe create artificial restarts?</li>
  <li>Can readiness failures explain why rollout stalls?</li>
</ul>

<p>In production, many “deployment problems” are actually probe problems.</p>

<hr />

<h2 id="step-8--read-resources-as-scheduling-intent">Step 8 — Read Resources as Scheduling Intent</h2>

<p>Check:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">resources</span><span class="pi">:</span>
  <span class="na">requests</span><span class="pi">:</span>
  <span class="na">limits</span><span class="pi">:</span>
</code></pre></div></div>

<p>This is one of the most important sections for platform engineers because it expresses how the workload negotiates with the scheduler and node capacity.</p>

<p>Read it as:</p>

<ul>
  <li>what minimum capacity the pod requires</li>
  <li>what maximum runtime envelope it may consume</li>
  <li>whether the values seem realistic for the application type</li>
</ul>

<p>Signals to look for:</p>

<ul>
  <li>missing requests</li>
  <li>equal requests and limits</li>
  <li>suspiciously small CPU or memory values</li>
  <li>very high limits relative to requests</li>
</ul>

<p>These values influence:</p>

<ul>
  <li>placement</li>
  <li>eviction pressure</li>
  <li>autoscaling behavior</li>
  <li>noisy-neighbor effects</li>
</ul>

<p>A Deployment without sensible resource settings is often a future incident waiting to happen.</p>

<hr />

<h2 id="step-9--check-environment-and-configuration-injection">Step 9 — Check Environment and Configuration Injection</h2>

<p>Next inspect:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">env</span><span class="pi">:</span>
<span class="na">envFrom</span><span class="pi">:</span>
<span class="na">configMapRef</span><span class="pi">:</span>
<span class="na">secretRef</span><span class="pi">:</span>
<span class="na">volumes</span><span class="pi">:</span>
<span class="na">volumeMounts</span><span class="pi">:</span>
</code></pre></div></div>

<p>This reveals where runtime configuration comes from and what external dependencies the workload assumes.</p>

<p>Important questions:</p>

<ul>
  <li>Does the app require ConfigMaps or Secrets to start?</li>
  <li>Is configuration mounted as files or injected as environment variables?</li>
  <li>Are there external certificates, tokens, or identity bindings involved?</li>
  <li>Is the pod coupled to storage or projected volumes?</li>
</ul>

<p>This step often explains why a Deployment looks correct but pods still fail at runtime.</p>

<p>The Deployment may be syntactically fine while its dependencies are missing, stale, or out of sync.</p>

<hr />

<h2 id="step-10--scan-scheduling-and-identity-constraints">Step 10 — Scan Scheduling and Identity Constraints</h2>

<p>Then inspect high-signal pod spec fields such as:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">serviceAccountName</code></li>
  <li><code class="language-plaintext highlighter-rouge">nodeSelector</code></li>
  <li><code class="language-plaintext highlighter-rouge">tolerations</code></li>
  <li><code class="language-plaintext highlighter-rouge">affinity</code></li>
  <li><code class="language-plaintext highlighter-rouge">topologySpreadConstraints</code></li>
  <li>security context fields</li>
</ul>

<p>These fields reveal <strong>where the pod is allowed to run and under what identity</strong>.</p>

<p>This is operationally important because many production issues come from scheduling constraints rather than application logic.</p>

<p>Examples:</p>

<ul>
  <li>wrong service account → cloud identity failures</li>
  <li>strict node selectors → unschedulable pods</li>
  <li>missing tolerations → pods never land on intended node pools</li>
  <li>topology constraints → rollout stalls in small clusters</li>
</ul>

<p>For seasoned engineers, this section often explains “why pods are Pending” faster than events do.</p>

<hr />

<h2 id="step-11--understand-what-the-deployment-does-not-tell-you">Step 11 — Understand What the Deployment Does <em>Not</em> Tell You</h2>

<p>A Deployment alone does not fully explain a running service.</p>

<p>It usually depends on surrounding objects:</p>

<ul>
  <li>Service</li>
  <li>Ingress / Gateway</li>
  <li>ConfigMaps</li>
  <li>Secrets</li>
  <li>HPA</li>
  <li>PDB</li>
  <li>NetworkPolicy</li>
  <li>ServiceAccount and RBAC</li>
  <li>external secret or identity systems</li>
</ul>

<p>One of the fastest ways to avoid misdiagnosis is to treat a Deployment as one part of a workload bundle, not the full application definition.</p>

<p>A Deployment may be valid while the real failure lives in one of those adjacent objects.</p>

<hr />

<h2 id="reconstruct-the-operational-model">Reconstruct the Operational Model</h2>

<p>After scanning those sections, you should be able to build a mental model quickly.</p>

<p>Example:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Deployment
  ↓
3 replicas of an API pod
  ↓
Rolling update with no downtime target
  ↓
Traffic gated by readiness probe
  ↓
Restart policy driven by liveness probe
  ↓
Config from Secret + ConfigMap
  ↓
Scheduled only on workload nodes
  ↓
Uses cloud identity via service account
</code></pre></div></div>

<p>That is the point of the exercise. You are not memorizing YAML. You are reconstructing the workload’s operational behavior.</p>

<hr />

<h2 id="signals-that-a-deployment-deserves-extra-attention">Signals That a Deployment Deserves Extra Attention</h2>

<p>Experienced engineers usually slow down when they see patterns like these:</p>

<ul>
  <li>mutable image tags</li>
  <li>no resource requests</li>
  <li>liveness probe without startup probe on slow apps</li>
  <li>strict affinity combined with small clusters</li>
  <li>single replica plus aggressive rollout settings</li>
  <li>heavy use of annotations from multiple controllers</li>
  <li>environment injection spread across many sources</li>
  <li>checksum annotations implying config-driven restarts</li>
</ul>

<p>These are not always wrong, but they usually indicate higher operational sensitivity.</p>

<hr />

<h2 id="key-takeaway">Key Takeaway</h2>

<p>To understand a Kubernetes Deployment quickly, scan in this order:</p>

<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>metadata context
pod template
replicas
selector and pod labels
image
rollout strategy
probes
resources
configuration injection
scheduling and identity constraints
adjacent dependencies
</code></pre></div></div>

<p>That sequence helps you reconstruct how the workload behaves in production, which is far more useful than simply knowing what the YAML syntax means.</p>]]></content><author><name>Maung San</name><email>msan001@live.com</email></author><category term="Kubernetes" /><category term="Infrastructure in 60 Seconds" /><category term="Kubernetes" /><category term="Deployment" /><category term="Platform Engineering" /><summary type="html"><![CDATA[Infrastructure in 60 Seconds — How to Read a Kubernetes Deployment]]></summary></entry><entry><title type="html">Infrastructure in 60 Seconds — How to Read a Dockerfile</title><link href="https://maungsan.github.io/containers/2026/03/06/containers-read-dockerfile-60-seconds-v2/" rel="alternate" type="text/html" title="Infrastructure in 60 Seconds — How to Read a Dockerfile" /><published>2026-03-06T00:00:00-08:00</published><updated>2026-03-06T00:00:00-08:00</updated><id>https://maungsan.github.io/containers/2026/03/06/containers-read-dockerfile-60-seconds-v2</id><content type="html" xml:base="https://maungsan.github.io/containers/2026/03/06/containers-read-dockerfile-60-seconds-v2/"><![CDATA[<h2 id="infrastructure-in-60-seconds--how-to-read-a-dockerfile">Infrastructure in 60 Seconds — How to Read a Dockerfile</h2>

<p>Dockerfiles are often opened during incidents, security reviews, or performance investigations. When that happens, reading them line‑by‑line is rarely the fastest way to understand what is going on.</p>

<p>Experienced engineers read Dockerfiles as <strong>image construction pipelines</strong>. The goal is to reconstruct how the runtime environment is built and identify signals that affect reproducibility, security posture, build speed, and runtime behavior.</p>

<p>Instead of parsing every instruction, focus on a small number of signals that reveal the entire container lifecycle.</p>

<p>The fastest way to understand a Dockerfile is to answer these questions:</p>

<p>• What base image defines the runtime environment?<br />
• Is this a multi‑stage build?<br />
• What dependencies are installed?<br />
• What application artifacts are copied into the image?<br />
• What process actually runs in the container?</p>

<p>Once those answers are clear, the rest of the file usually becomes predictable.</p>

<hr />

<h2 id="-step-1--identify-the-base-image-the-supply-chain-root">🧱 Step 1 — Identify the Base Image (The Supply Chain Root)</h2>

<p>Start with the first <code class="language-plaintext highlighter-rouge">FROM</code> instruction.</p>

<p>Example:</p>

<p>FROM node:20-alpine</p>

<p>or</p>

<p>FROM mcr.microsoft.com/dotnet/aspnet:8.0</p>

<p>This line defines:</p>

<p>• the operating system layer<br />
• the runtime environment<br />
• the security patch source<br />
• the expected base image size</p>

<p>Seasoned engineers immediately look for these signals:</p>

<ul>
  <li>pinned version vs floating tag</li>
  <li><code class="language-plaintext highlighter-rouge">alpine</code>, <code class="language-plaintext highlighter-rouge">slim</code>, or minimal variants</li>
  <li>internal registry vs public registry</li>
</ul>

<p>Examples:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FROM node:20-alpine
FROM python:3.11-slim
FROM mycompany.azurecr.io/platform/base-runtime:2.4
</code></pre></div></div>

<p>This step answers a fundamental question:</p>

<p><strong>What environment does every container instance ultimately inherit from?</strong></p>

<hr />

<h2 id="-step-2--detect-multistage-builds">🧩 Step 2 — Detect Multi‑Stage Builds</h2>

<p>Next scan for <strong>multiple <code class="language-plaintext highlighter-rouge">FROM</code> statements</strong>.</p>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FROM node:20 AS build
FROM nginx:alpine
</code></pre></div></div>

<p>Multiple stages usually mean:</p>

<p>build image<br />
↓<br />
compile or bundle artifacts<br />
↓<br />
copy only runtime artifacts<br />
↓<br />
create smaller final image</p>

<p>Mental model:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Build Stage
   ↓
Compile / package application
   ↓
Runtime Stage
   ↓
Minimal production container
</code></pre></div></div>

<p>This pattern reduces:</p>

<ul>
  <li>final image size</li>
  <li>attack surface</li>
  <li>unnecessary toolchains in runtime containers</li>
</ul>

<p>When investigating performance or security issues, multi‑stage builds are a strong signal of <strong>image optimization maturity</strong>.</p>

<hr />

<h2 id="-step-3--locate-dependency-installation">📦 Step 3 — Locate Dependency Installation</h2>

<p>Next scan <code class="language-plaintext highlighter-rouge">RUN</code> instructions that install dependencies.</p>

<p>Common patterns:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>RUN apt-get install
RUN apk add
RUN pip install
RUN npm install
RUN dotnet restore
</code></pre></div></div>

<p>These instructions reveal:</p>

<p>• language ecosystem<br />
• system library dependencies<br />
• build-time toolchains<br />
• potential security exposure surface</p>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>RUN apt-get update &amp;&amp; apt-get install -y     curl     ca-certificates     libpq-dev
</code></pre></div></div>

<p>Large dependency blocks often explain:</p>

<ul>
  <li>slow container builds</li>
  <li>large image sizes</li>
  <li>expanded vulnerability surface</li>
</ul>

<p>Experienced engineers quickly check whether <strong>build dependencies accidentally remain in the runtime image</strong>.</p>

<hr />

<h2 id="-step-4--understand-file-copy-strategy">📁 Step 4 — Understand File Copy Strategy</h2>

<p>Next inspect how the application code enters the container.</p>

<p>Typical instructions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>COPY
ADD
</code></pre></div></div>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>COPY package.json ./
COPY package-lock.json ./
RUN npm install

COPY src/ ./src
</code></pre></div></div>

<p>This ordering is intentional.</p>

<p>Experienced engineers look for caching patterns:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Copy dependency manifests
↓
Install dependencies
↓
Copy application source code
</code></pre></div></div>

<p>Why this matters:</p>

<p>Docker layer caching allows dependency installation to be reused when only source files change.</p>

<p>Poor ordering causes <strong>dependency layers to rebuild on every commit</strong>, slowing CI pipelines dramatically.</p>

<hr />

<h2 id="️-step-5--inspect-environment-and-build-arguments">⚙️ Step 5 — Inspect Environment and Build Arguments</h2>

<p>Next check for:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ENV
ARG
</code></pre></div></div>

<p>Examples:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ENV NODE_ENV=production
ARG BUILD_VERSION
</code></pre></div></div>

<p>Key differences:</p>

<p>ARG → build-time variables<br />
ENV → runtime environment variables</p>

<p>Signals revealed here:</p>

<p>• environment assumptions<br />
• runtime configuration defaults<br />
• version injection patterns</p>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ARG VERSION
ENV APP_VERSION=$VERSION
</code></pre></div></div>

<p>These patterns often connect the Docker build process with CI/CD pipelines.</p>

<hr />

<h2 id="-step-6--identify-the-runtime-process">🚀 Step 6 — Identify the Runtime Process</h2>

<p>Now find the container’s startup command.</p>

<p>Look for:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CMD
ENTRYPOINT
</code></pre></div></div>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CMD ["node", "server.js"]
</code></pre></div></div>

<p>or</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ENTRYPOINT ["dotnet", "payments-api.dll"]
</code></pre></div></div>

<p>This line reveals the <strong>actual workload process</strong>.</p>

<p>Everything earlier in the Dockerfile simply prepares the environment required to run this command.</p>

<p>When debugging runtime behavior, this is one of the most important lines in the file.</p>

<hr />

<h2 id="-step-7--look-for-security-signals">🔐 Step 7 — Look for Security Signals</h2>

<p>Experienced engineers also quickly scan for security posture signals.</p>

<p>Things worth checking:</p>

<p>• containers running as root<br />
• absence of a <code class="language-plaintext highlighter-rouge">USER</code> directive<br />
• leftover package managers<br />
• unnecessary build toolchains in runtime images</p>

<p>Example improvement pattern:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>RUN adduser --system appuser
USER appuser
</code></pre></div></div>

<p>Running containers as non‑root is a common baseline in hardened Kubernetes platforms.</p>

<p>Another signal:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FROM node:latest
</code></pre></div></div>

<p>Mutable tags like <code class="language-plaintext highlighter-rouge">latest</code> make image reproducibility harder and complicate incident debugging.</p>

<hr />

<h2 id="-reconstruct-the-image-build-pipeline">🧠 Reconstruct the Image Build Pipeline</h2>

<p>After scanning these sections, you should be able to mentally reconstruct the container build process.</p>

<p>Example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Base runtime image (node:20-alpine)
        ↓
Install system dependencies
        ↓
Install Node dependencies
        ↓
Copy application code
        ↓
Set environment configuration
        ↓
Start application process
</code></pre></div></div>

<p>At this point you understand how the container is assembled and what environment the application runs inside.</p>

<hr />

<h2 id="️-signals-that-a-dockerfile-deserves-extra-attention">⚠️ Signals That a Dockerfile Deserves Extra Attention</h2>

<p>Experienced engineers slow down when they see patterns like:</p>

<ul>
  <li>mutable base image tags (<code class="language-plaintext highlighter-rouge">latest</code>)</li>
  <li>large dependency installs in runtime images</li>
  <li>missing <code class="language-plaintext highlighter-rouge">.dockerignore</code></li>
  <li>application code copied before dependency manifests</li>
  <li>unnecessary package managers left installed</li>
  <li>lack of explicit runtime user</li>
</ul>

<p>These signals often correlate with:</p>

<p>• slow builds<br />
• oversized images<br />
• increased vulnerability surface<br />
• inconsistent deployments</p>

<hr />

<h2 id="-key-takeaway">🧭 Key Takeaway</h2>

<p>To understand a Dockerfile quickly, scan in this order:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>base image
multi-stage structure
dependency installation
file copy strategy
environment configuration
runtime command
security signals
</code></pre></div></div>

<p>This sequence allows you to reconstruct how the container image is built and how the workload will behave in production — without reading every line of the Dockerfile.</p>]]></content><author><name>Maung San</name><email>msan001@live.com</email></author><category term="Containers" /><category term="Infrastructure in 60 Seconds" /><category term="Docker" /><category term="Containers" /><category term="Platform Engineering" /><summary type="html"><![CDATA[Infrastructure in 60 Seconds — How to Read a Dockerfile]]></summary></entry><entry><title type="html">Infrastructure in 60 Seconds — How to Read a Terraform Module</title><link href="https://maungsan.github.io/terraform/2026/03/06/tf-read-terraform-module-60-seconds/" rel="alternate" type="text/html" title="Infrastructure in 60 Seconds — How to Read a Terraform Module" /><published>2026-03-06T00:00:00-08:00</published><updated>2026-03-06T00:00:00-08:00</updated><id>https://maungsan.github.io/terraform/2026/03/06/tf-read-terraform-module-60-seconds</id><content type="html" xml:base="https://maungsan.github.io/terraform/2026/03/06/tf-read-terraform-module-60-seconds/"><![CDATA[<h2 id="infrastructure-in-60-seconds--how-to-read-a-terraform-module">Infrastructure in 60 Seconds — How to Read a Terraform Module</h2>

<p>Opening a Terraform module for the first time can be disorienting. Mature infrastructure repositories often contain dozens of resources, nested modules, dynamic expressions, and environment‑specific behavior. Reading the code from top to bottom rarely helps.</p>

<p>Experienced engineers instead reconstruct the <strong>infrastructure topology</strong> by scanning a few structural signals. The goal is not to understand every line immediately — it is to answer a small set of questions quickly:</p>

<p>• What infrastructure does this module create?<br />
• What external systems does it depend on?<br />
• What inputs control its behavior?<br />
• What outputs does it expose to the rest of the platform?</p>

<p>Once those answers are clear, the rest of the module becomes predictable.</p>

<hr />

<h2 id="step-1--identify-the-provider-and-platform">Step 1 — Identify the Provider and Platform</h2>

<p>Start by locating the <strong>provider configuration</strong>.</p>

<p>Typical signals appear in:</p>

<p>providers.tf<br />
main.tf<br />
terraform blocks</p>

<p>Example:</p>

<p>provider “azurerm” {
  features {}
}</p>

<p>or</p>

<p>required_providers {
  aws = {
    source = “hashicorp/aws”
  }
}</p>

<p>This immediately tells you <strong>which infrastructure domain the module controls</strong>.</p>

<p>Examples:</p>

<p>azurerm → Azure infrastructure<br />
aws → AWS infrastructure<br />
google → GCP infrastructure<br />
kubernetes → cluster resources<br />
helm → application deployments</p>

<p>This step narrows the scope of what the module can possibly create.</p>

<hr />

<h2 id="step-2--identify-the-primary-resource-type">Step 2 — Identify the Primary Resource Type</h2>

<p>Next search for the dominant <code class="language-plaintext highlighter-rouge">resource</code> blocks.</p>

<p>Example:</p>

<p>resource “azurerm_kubernetes_cluster” “this” {}</p>

<p>resource “aws_vpc” “main” {}</p>

<p>resource “azurerm_storage_account” “logs” {}</p>

<p>Most modules revolve around <strong>one primary resource type</strong>. Everything else usually supports that resource.</p>

<p>For example:</p>

<p>AKS module → network, identities, monitoring attached to cluster<br />
VPC module → subnets, routing tables, gateways<br />
Storage module → networking, encryption, access policies</p>

<p>Identifying the primary resource reveals the <strong>module’s architectural purpose</strong>.</p>

<hr />

<h2 id="step-3--scan-for-nested-modules">Step 3 — Scan for Nested Modules</h2>

<p>Many production Terraform modules are composed of smaller modules.</p>

<p>Look for:</p>

<p>module “network” {}
module “monitoring” {}
module “identity” {}</p>

<p>These often represent <strong>platform building blocks</strong>.</p>

<p>Example architecture reconstruction:</p>

<p>module.cluster
  ↓
module.network
module.identity
module.monitoring</p>

<p>This step shows whether the module represents:</p>

<p>• a small reusable component<br />
• a platform layer<br />
• a full environment stack</p>

<p>Understanding this hierarchy prevents you from misreading supporting infrastructure as the main purpose of the module.</p>

<hr />

<h2 id="step-4--check-data-sources-external-dependencies">Step 4 — Check Data Sources (External Dependencies)</h2>

<p>Data sources reveal infrastructure that already exists.</p>

<p>Search for:</p>

<p>data “…”</p>

<p>Examples:</p>

<p>data “azurerm_subnet”
data “aws_ami”
data “azurerm_resource_group”</p>

<p>This tells you the module <strong>depends on external infrastructure</strong>.</p>

<p>Typical signals:</p>

<p>existing network topology<br />
existing identity systems<br />
shared platform resources</p>

<p>This step helps reconstruct <strong>the boundary between this module and the wider platform</strong>.</p>

<hr />

<h2 id="step-5--identify-input-variables">Step 5 — Identify Input Variables</h2>

<p>Variables define how the module is controlled.</p>

<p>Look for:</p>

<p>variable “…”</p>

<p>Typical examples:</p>

<p>variable “environment” {}
variable “location” {}
variable “cluster_name” {}
variable “subnet_id” {}</p>

<p>Experienced engineers read variables not to understand syntax but to identify:</p>

<p>• required inputs<br />
• optional configuration paths<br />
• environment-specific behavior</p>

<p>Large variable surfaces often indicate the module is designed to support <strong>multiple deployment patterns</strong>.</p>

<hr />

<h2 id="step-6--scan-locals-for-derived-architecture">Step 6 — Scan Locals for Derived Architecture</h2>

<p>Locals often encode important design decisions.</p>

<p>Example:</p>

<p>locals {
  cluster_name = “${var.environment}-aks”
  tags = merge(var.tags, {
    platform = “core”
  })
}</p>

<p>These blocks frequently reveal:</p>

<p>naming conventions<br />
tagging strategy<br />
environment isolation<br />
derived resource structure</p>

<p>Scanning locals can quickly expose <strong>organizational infrastructure patterns</strong>.</p>

<hr />

<h2 id="step-7--identify-outputs">Step 7 — Identify Outputs</h2>

<p>Outputs reveal how other modules depend on this module.</p>

<p>Search for:</p>

<p>output “…”</p>

<p>Example:</p>

<p>output “cluster_id” {}
output “subnet_ids” {}
output “vnet_id” {}</p>

<p>Outputs usually represent <strong>integration points</strong> with the rest of the infrastructure.</p>

<p>Example mental model:</p>

<p>Network Module
     ↓
AKS Module
     ↓
Application Platform</p>

<p>Understanding outputs tells you where this module fits in the larger architecture.</p>

<hr />

<h2 id="step-8--watch-for-conditional-infrastructure">Step 8 — Watch for Conditional Infrastructure</h2>

<p>Terraform modules often support multiple deployment modes using conditionals.</p>

<p>Examples:</p>

<p>count = var.enable_monitoring ? 1 : 0</p>

<p>or</p>

<p>for_each = var.subnets</p>

<p>These patterns indicate:</p>

<p>feature flags<br />
environment-specific behavior<br />
optional platform components</p>

<p>When you see many conditional expressions, expect the module to support <strong>several operational configurations</strong>.</p>

<hr />

<h2 id="reconstructing-the-architecture">Reconstructing the Architecture</h2>

<p>After scanning these areas, you should be able to reconstruct the infrastructure model quickly.</p>

<p>Example mental model:</p>

<p>Terraform Module
      ↓
Primary Resource (AKS Cluster)
      ↓
Supporting Infrastructure
   • networking
   • managed identity
   • monitoring
      ↓
Outputs exposed to platform modules</p>

<p>This allows you to reason about the module without reading every line of Terraform.</p>

<hr />

<h2 id="signals-that-a-terraform-module-is-complex">Signals That a Terraform Module Is Complex</h2>

<p>Experienced engineers watch for these indicators:</p>

<p>large variable surfaces<br />
many nested modules<br />
heavy conditional logic<br />
dynamic blocks<br />
cross‑module outputs</p>

<p>These signals often indicate the module supports <strong>multiple environments or platform layers</strong>.</p>

<hr />

<h2 id="key-takeaway">Key Takeaway</h2>

<p>To understand a Terraform module quickly, scan in this order:</p>

<p>provider / terraform block<br />
primary resource types<br />
nested modules<br />
data sources<br />
variables<br />
locals<br />
outputs</p>

<p>This sequence reveals the module’s role in the infrastructure architecture without reading the entire codebase.</p>]]></content><author><name>Maung San</name><email>msan001@live.com</email></author><category term="Terraform" /><category term="Infrastructure in 60 Seconds" /><category term="Terraform" /><category term="Infrastructure as Code" /><category term="Platform Engineering" /><summary type="html"><![CDATA[Infrastructure in 60 Seconds — How to Read a Terraform Module]]></summary></entry><entry><title type="html">DevOps Quick Read - How to Read a Packer Template in 60 Seconds</title><link href="https://maungsan.github.io/devops/2026/03/05/devops-read-packer-template-60-seconds/" rel="alternate" type="text/html" title="DevOps Quick Read - How to Read a Packer Template in 60 Seconds" /><published>2026-03-05T00:00:00-08:00</published><updated>2026-03-05T00:00:00-08:00</updated><id>https://maungsan.github.io/devops/2026/03/05/devops-read-packer-template-60-seconds</id><content type="html" xml:base="https://maungsan.github.io/devops/2026/03/05/devops-read-packer-template-60-seconds/"><![CDATA[<h2 id="-how-to-read-a-packer-template-in-60-seconds">⚡ How to Read a Packer Template in 60 Seconds</h2>

<p>If you inherit a Packer template in an existing infrastructure repository, reading it line-by-line is usually the wrong approach.</p>

<p>Senior engineers typically reconstruct the architecture first, then fill in details only if needed.</p>

<p>With some pattern recognition you can usually understand a Packer build in under a minute by answering four questions:</p>

<p>• What platform is the image built on?<br />
• What base operating system does it start from?<br />
• What software is installed during provisioning?<br />
• Where is the final image stored?</p>

<p>Once those are clear, you already understand most of the pipeline.</p>

<hr />

<h2 id="step-1--identify-the-target-platform">Step 1 — Identify the Target Platform</h2>

<p>Start by locating the <strong>builder / source block</strong>.</p>

<p>Examples:</p>

<p>source “azure-arm” “image” {}
source “amazon-ebs” “image” {}
source “googlecompute” “image” {}
source “vmware-iso” “image” {}</p>

<p>This tells you <strong>where the image is being built</strong>.</p>

<p>Common builders:</p>

<table>
  <thead>
    <tr>
      <th>Builder</th>
      <th>Platform</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>azure-arm</td>
      <td>Azure VM Image</td>
    </tr>
    <tr>
      <td>amazon-ebs</td>
      <td>AWS AMI</td>
    </tr>
    <tr>
      <td>googlecompute</td>
      <td>Google Cloud Image</td>
    </tr>
    <tr>
      <td>vmware-iso</td>
      <td>On‑prem VMware</td>
    </tr>
    <tr>
      <td>docker</td>
      <td>Container image</td>
    </tr>
  </tbody>
</table>

<p>Once you see the builder you can already visualize the architecture:</p>

<p>Packer → Temporary VM → Provision → Capture Image</p>

<hr />

<h2 id="step-2--identify-the-base-image">Step 2 — Identify the Base Image</h2>

<p>Next determine what operating system the image starts from.</p>

<p>Look for fields like:</p>

<p>image_publisher<br />
image_offer<br />
image_sku</p>

<p>or on AWS:</p>

<p>source_ami</p>

<p>Example:</p>

<p>image_publisher = “Canonical”
image_offer     = “UbuntuServer”
image_sku       = “20_04-lts”</p>

<p>Meaning the build starts from <strong>Ubuntu 20.04</strong>.</p>

<p>The base image defines the entire starting state of the system. If it changes, everything downstream changes as well.</p>

<p>Many enterprise pipelines fail simply because the upstream image was updated or deprecated.</p>

<hr />

<h2 id="step-3--locate-provisioners">Step 3 — Locate Provisioners</h2>

<p>Provisioners describe what happens inside the temporary VM.</p>

<p>Search for:</p>

<p>provisioner</p>

<p>Common types include:</p>

<p>provisioner “shell”<br />
provisioner “powershell”<br />
provisioner “ansible”<br />
provisioner “file”</p>

<p>These blocks reveal the <strong>purpose of the image</strong>.</p>

<p>Example:</p>

<p>provisioner “shell” {
  inline = [
    “apt install docker”,
    “apt install nginx”
  ]
}</p>

<p>You can immediately infer that the image likely prepares a <strong>web server environment</strong>.</p>

<p>Another example:</p>

<p>provisioner “ansible” {
  playbook_file = “hardening.yml”
}</p>

<p>This typically indicates a <strong>security-hardened base image</strong>.</p>

<p>Provisioners usually reveal why the image exists.</p>

<hr />

<h2 id="step-4--check-the-communicator">Step 4 — Check the Communicator</h2>

<p>Packer needs a way to connect to the temporary machine.</p>

<p>Look for:</p>

<p>communicator</p>

<p>Common values:</p>

<table>
  <thead>
    <tr>
      <th>Communicator</th>
      <th>OS</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>ssh</td>
      <td>Linux</td>
    </tr>
    <tr>
      <td>winrm</td>
      <td>Windows</td>
    </tr>
  </tbody>
</table>

<p>Example:</p>

<p>communicator = “ssh”</p>

<p>This indicates a Linux build environment.</p>

<hr />

<h2 id="step-5--identify-image-output">Step 5 — Identify Image Output</h2>

<p>Next determine where the final image is stored.</p>

<p>Look for fields such as:</p>

<p>shared_image_gallery<br />
managed_image_name<br />
ami_name</p>

<p>Modern Azure builds typically publish to:</p>

<p>Azure Shared Image Gallery</p>

<p>This matters because other infrastructure systems depend on this image, such as:</p>

<p>Terraform deployments<br />
VM scale sets<br />
AKS node pools</p>

<p>Understanding the output tells you how the image participates in the wider infrastructure pipeline.</p>

<hr />

<h2 id="step-6--scan-variables-and-locals">Step 6 — Scan Variables and Locals</h2>

<p>Variables reveal how the template integrates with CI/CD pipelines.</p>

<p>Example:</p>

<p>variable “client_id” {
  default = env(“pipeline-client-id”)
}</p>

<p>This indicates that credentials are injected by the pipeline environment.</p>

<p>Locals are helper values used during the build.</p>

<p>Example:</p>

<p>locals {
  imagetag = formatdate(“MMDDYYYY-HHmm”, timestamp())
}</p>

<p>This pattern typically creates unique image versions for each build.</p>

<hr />

<h2 id="step-7--check-postprocessors-optional">Step 7 — Check Post‑Processors (Optional)</h2>

<p>Some templates include post‑processors such as:</p>

<p>docker-tag<br />
manifest<br />
compress<br />
vagrant</p>

<p>These steps usually publish metadata or push artifacts after the image is built.</p>

<p>Not every template uses them.</p>

<hr />

<h2 id="reconstructing-the-architecture-quickly">Reconstructing the Architecture Quickly</h2>

<p>After scanning those sections you should be able to mentally reconstruct the pipeline.</p>

<p>Example mental model:</p>

<p>Base Image (Ubuntu 20.04)
        ↓
Temporary Azure VM
        ↓
Provisioners install Docker, monitoring agents, and security tools
        ↓
Image captured
        ↓
Stored in Shared Image Gallery
        ↓
Terraform deploys infrastructure from that image</p>

<p>You now understand the system without reading every line.</p>

<hr />

<h2 id="the-typical-enterprise-image-pipeline">The Typical Enterprise Image Pipeline</h2>

<p>Most organizations structure Packer pipelines roughly like this:</p>

<p>Git Repository<br />
        ↓<br />
CI/CD Pipeline<br />
        ↓<br />
Packer Build<br />
        ↓<br />
Shared Image Gallery<br />
        ↓<br />
Terraform Deployment<br />
        ↓<br />
VM Scale Sets or Platform Nodes</p>

<p>This pattern is commonly referred to as a <strong>golden image pipeline</strong>.</p>

<hr />

<h2 id="common-causes-of-packer-pipeline-failures">Common Causes of Packer Pipeline Failures</h2>

<h3 id="base-image-changes">Base Image Changes</h3>

<p>Using a dynamic reference like “latest” can cause builds to break when the upstream image changes.</p>

<h3 id="provisioner-timing-issues">Provisioner Timing Issues</h3>

<p>Examples include:</p>

<p>apt lock conflicts<br />
services not ready<br />
reboots not handled correctly</p>

<p>These often create flaky builds.</p>

<h3 id="missing-credentials">Missing Credentials</h3>

<p>Many templates rely on pipeline‑injected secrets through environment variables.</p>

<p>If those variables are not injected correctly, authentication errors occur.</p>

<hr />

<h2 id="where-packer-fits-in-modern-infrastructure">Where Packer Fits in Modern Infrastructure</h2>

<p>In many modern platforms Packer is no longer used for traditional application servers.</p>

<p>Instead it typically produces:</p>

<p>• hardened base operating systems<br />
• platform VM images<br />
• AKS node base images<br />
• enterprise baseline machine images</p>

<p>It becomes the <strong>first stage of the infrastructure pipeline</strong>, producing standardized images that downstream systems deploy.</p>

<hr />

<h2 id="key-takeaway">Key Takeaway</h2>

<p>To understand most Packer templates quickly, scan these sections:</p>

<p>Builder / Source<br />
Base Image<br />
Provisioners<br />
Communicator<br />
Image Destination<br />
Variables</p>

<p>With those pieces you can usually reconstruct the entire pipeline in under a minute.</p>]]></content><author><name>Maung San</name><email>msan001@live.com</email></author><category term="DevOps" /><category term="DevOps" /><category term="Packer" /><category term="Image Pipelines" /><category term="Infrastructure as Code" /><summary type="html"><![CDATA[⚡ How to Read a Packer Template in 60 Seconds]]></summary></entry><entry><title type="html">Infrastructure in 60 Seconds — How to Read a CloudFormation Template Like a Pro</title><link href="https://maungsan.github.io/aws/2026/03/04/aws-read-cloudformation-template-60-seconds/" rel="alternate" type="text/html" title="Infrastructure in 60 Seconds — How to Read a CloudFormation Template Like a Pro" /><published>2026-03-04T00:00:00-08:00</published><updated>2026-03-04T00:00:00-08:00</updated><id>https://maungsan.github.io/aws/2026/03/04/aws-read-cloudformation-template-60-seconds</id><content type="html" xml:base="https://maungsan.github.io/aws/2026/03/04/aws-read-cloudformation-template-60-seconds/"><![CDATA[<h2 id="infrastructure-in-60-seconds--how-to-read-a-cloudformation-template-like-a-pro">Infrastructure in 60 Seconds — How to Read a CloudFormation Template Like a Pro</h2>

<p>Large CloudFormation templates can easily reach hundreds or thousands of lines. Reading them sequentially rarely helps when the real goal is to understand what infrastructure is actually being created.</p>

<p>Experienced engineers treat a CloudFormation template as a <strong>dependency graph</strong>, not a text document. The objective is to quickly reconstruct:</p>

<p>• what infrastructure exists<br />
• how resources depend on each other<br />
• which parts are configurable<br />
• which outputs other systems consume</p>

<p>Once that mental model is clear, the rest of the template becomes easier to navigate.</p>

<hr />

<h2 id="step-1--identify-the-stack-purpose">Step 1 — Identify the Stack Purpose</h2>

<p>Start by scanning the template description and top-level structure.</p>

<p>CloudFormation templates usually follow this structure:</p>

<p>Parameters<br />
Mappings<br />
Conditions<br />
Resources<br />
Outputs</p>

<p>Immediately jump to <strong>Resources</strong> to understand what the stack actually builds. Everything else mostly controls behavior around those resources.</p>

<hr />

<h2 id="step-2--scan-the-resource-types-first">Step 2 — Scan the Resource Types First</h2>

<p>Look for:</p>

<p>Type: AWS::</p>

<p>Examples:</p>

<p>AWS::EC2::Instance<br />
AWS::EC2::VPC<br />
AWS::RDS::DBInstance<br />
AWS::Lambda::Function<br />
AWS::EKS::Cluster</p>

<p>Experienced engineers scan resource types before reading properties because resource types quickly reveal the <strong>architectural layer</strong> of the stack.</p>

<p>Examples:</p>

<p>VPC, Subnet, RouteTable<br />
→ networking layer</p>

<p>ECS Service, Lambda, EKS<br />
→ compute layer</p>

<p>S3, DynamoDB, RDS<br />
→ data layer</p>

<p>IAM Role, Policy<br />
→ identity layer</p>

<p>Within a few seconds you can determine whether the template defines:</p>

<p>• foundational infrastructure<br />
• application platform resources<br />
• a single service deployment</p>

<hr />

<h2 id="step-3--identify-the-primary-resource">Step 3 — Identify the Primary Resource</h2>

<p>Most templates revolve around one main resource and supporting infrastructure.</p>

<p>Examples:</p>

<p>EKS cluster template<br />
→ main resource: AWS::EKS::Cluster</p>

<p>Lambda stack<br />
→ main resource: AWS::Lambda::Function</p>

<p>VPC stack<br />
→ main resource: AWS::EC2::VPC</p>

<p>Supporting resources often include:</p>

<p>security groups<br />
roles<br />
logging resources<br />
network attachments</p>

<p>Identifying the primary resource tells you <strong>why the stack exists</strong>.</p>

<hr />

<h2 id="step-4--look-for-implicit-dependency-signals">Step 4 — Look for Implicit Dependency Signals</h2>

<p>CloudFormation builds a dependency graph automatically.</p>

<p>Key signals include:</p>

<p>Ref<br />
Fn::GetAtt<br />
DependsOn</p>

<p>Examples:</p>

<p>Ref: MySecurityGroup</p>

<p>Fn::GetAtt:</p>
<ul>
  <li>MyLoadBalancer</li>
  <li>DNSName</li>
</ul>

<p>These references tell you how resources connect.</p>

<p>Example mental model:</p>

<p>Load Balancer
↓
Target Group
↓
Auto Scaling Group
↓
EC2 instances</p>

<p>You are reconstructing the <strong>resource graph</strong>, not reading YAML.</p>

<hr />

<h2 id="step-5--check-parameters-for-external-control">Step 5 — Check Parameters for External Control</h2>

<p>Parameters define how the template is controlled by users, pipelines, or higher-level stacks.</p>

<p>Example:</p>

<p>Parameters:
  Environment:
    Type: String</p>

<p>InstanceType:
    Type: String</p>

<p>Signals to extract quickly:</p>

<p>• environment configuration<br />
• instance sizing<br />
• networking inputs<br />
• external resource identifiers</p>

<p>Large parameter sets usually indicate the template supports <strong>multiple environments or deployment modes</strong>.</p>

<hr />

<h2 id="step-6--check-conditions-for-deployment-variants">Step 6 — Check Conditions for Deployment Variants</h2>

<p>Conditions allow templates to deploy different infrastructure depending on environment.</p>

<p>Example:</p>

<p>Conditions:
  IsProduction: !Equals [ !Ref Environment, prod ]</p>

<p>These often control:</p>

<p>optional logging systems<br />
high-availability resources<br />
multi-AZ behavior<br />
monitoring stacks</p>

<p>Conditional resources usually signal <strong>environment-aware infrastructure</strong>.</p>

<hr />

<h2 id="step-7--scan-outputs-to-understand-stack-integration">Step 7 — Scan Outputs to Understand Stack Integration</h2>

<p>Outputs show what the stack exposes to other stacks.</p>

<p>Example:</p>

<p>Outputs:
  VpcId:
    Value: !Ref VPC</p>

<p>ClusterEndpoint:
    Value: !GetAtt EKSCluster.Endpoint</p>

<p>Outputs usually reveal:</p>

<p>network identifiers<br />
service endpoints<br />
resource ARNs</p>

<p>These values often feed into:</p>

<p>Terraform deployments<br />
CloudFormation nested stacks<br />
CI/CD pipelines</p>

<p>Outputs tell you how this stack fits into the <strong>larger system architecture</strong>.</p>

<hr />

<h2 id="step-8--watch-for-nested-stacks">Step 8 — Watch for Nested Stacks</h2>

<p>Some templates include:</p>

<p>Type: AWS::CloudFormation::Stack</p>

<p>This indicates a <strong>nested stack architecture</strong>.</p>

<p>Example:</p>

<p>network stack<br />
↓
security stack<br />
↓
application stack</p>

<p>Nested stacks usually mean the infrastructure is split into logical layers.</p>

<hr />

<h2 id="reconstruct-the-architecture">Reconstruct the Architecture</h2>

<p>After scanning these signals, you should be able to build a mental model quickly.</p>

<p>Example:</p>

<p>VPC stack
↓
subnets and route tables
↓
security groups
↓
EKS cluster
↓
node groups
↓
application workloads</p>

<p>You now understand the <strong>core architecture</strong> without reading every property.</p>

<hr />

<h2 id="signals-that-a-cloudformation-template-is-complex">Signals That a CloudFormation Template Is Complex</h2>

<p>Experienced engineers slow down when they see:</p>

<p>very large parameter sets<br />
many conditions controlling resources<br />
heavy use of intrinsic functions<br />
deep nested stacks<br />
large IAM policy blocks</p>

<p>These signals indicate the template supports multiple deployment patterns or complex environments.</p>

<hr />

<h2 id="key-takeaway">Key Takeaway</h2>

<p>To understand a CloudFormation template quickly, scan in this order:</p>

<p>resource types<br />
primary resource<br />
dependency references<br />
parameters<br />
conditions<br />
outputs<br />
nested stacks</p>

<p>This approach reconstructs the infrastructure graph quickly without reading the entire template.</p>]]></content><author><name>Maung San</name><email>msan001@live.com</email></author><category term="AWS" /><category term="Infrastructure in 60 Seconds" /><category term="AWS" /><category term="CloudFormation" /><category term="Infrastructure as Code" /><summary type="html"><![CDATA[Infrastructure in 60 Seconds — How to Read a CloudFormation Template Like a Pro]]></summary></entry><entry><title type="html">kubernetes Resource Isolation - 14. A catalog of **cluster design patterns</title><link href="https://maungsan.github.io/kubernetes/containers/kubernetes%20resource%20isolation/2025/10/17/k8s-iso-14/" rel="alternate" type="text/html" title="kubernetes Resource Isolation - 14. A catalog of **cluster design patterns" /><published>2025-10-17T00:00:00-07:00</published><updated>2025-10-17T00:00:00-07:00</updated><id>https://maungsan.github.io/kubernetes/containers/kubernetes%20resource%20isolation/2025/10/17/k8s-iso-14</id><content type="html" xml:base="https://maungsan.github.io/kubernetes/containers/kubernetes%20resource%20isolation/2025/10/17/k8s-iso-14/"><![CDATA[<p><strong>Segment 14</strong> as a catalog of <strong>cluster design patterns</strong> you can combine:</p>

<ul>
  <li>How to slice the cluster into <strong>node pools</strong></li>
  <li>How to slice workloads via <strong>namespaces, tenants, and QoS</strong></li>
  <li>How to use <strong>taints/tolerations, priority classes, PDBs, and topology</strong> to control behavior</li>
  <li>When to make <strong>more clusters vs fewer clusters</strong></li>
</ul>

<p>I’ll keep each pattern fairly tight so you can remix them.</p>

<hr />

<h2 id="1-node-pool-segmentation-patterns">1. Node Pool Segmentation Patterns</h2>

<h3 id="11-general-vs-specialized-pools">1.1 <strong>General vs Specialized Pools</strong></h3>

<p><strong>Pattern:</strong></p>

<ul>
  <li><strong>general-pool</strong> for 80–90% of workloads</li>
  <li>
    <p>One or more <strong>specialized pools</strong>:</p>

    <ul>
      <li><code class="language-plaintext highlighter-rouge">perf</code> (CPUManager, TopologyManager)</li>
      <li><code class="language-plaintext highlighter-rouge">gpu</code></li>
      <li><code class="language-plaintext highlighter-rouge">batch</code></li>
      <li><code class="language-plaintext highlighter-rouge">db</code> or <code class="language-plaintext highlighter-rouge">stateful</code></li>
    </ul>
  </li>
</ul>

<p><strong>Mechanics:</strong></p>

<ul>
  <li>
    <p>Labels:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl label node node-1 node-pool<span class="o">=</span>general
kubectl label node node-2 node-pool<span class="o">=</span>perf
</code></pre></div>    </div>
  </li>
  <li>
    <p>Taints on special pools:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl taint node node-2 perf-only<span class="o">=</span><span class="nb">true</span>:NoSchedule
</code></pre></div>    </div>
  </li>
  <li>
    <p>Workload spec:</p>

    <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">nodeSelector</span><span class="pi">:</span>
  <span class="na">node-pool</span><span class="pi">:</span> <span class="s">perf</span>
<span class="na">tolerations</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">key</span><span class="pi">:</span> <span class="s2">"</span><span class="s">perf-only"</span>
    <span class="na">operator</span><span class="pi">:</span> <span class="s2">"</span><span class="s">Exists"</span>
    <span class="na">effect</span><span class="pi">:</span> <span class="s2">"</span><span class="s">NoSchedule"</span>
</code></pre></div>    </div>
  </li>
</ul>

<p><strong>When to use:</strong> almost always. This is the baseline pattern.</p>

<hr />

<h3 id="12-horizontal-isolation-by-noisy-class">1.2 <strong>Horizontal Isolation by “Noisy Class”</strong></h3>

<p>Separate node pools for:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">system</code> (CNI, CSI, metrics, logging)</li>
  <li><code class="language-plaintext highlighter-rouge">user-apps</code></li>
  <li><code class="language-plaintext highlighter-rouge">noisy-batch</code> (Spark, ETL, big cronjobs)</li>
</ul>

<p><strong>Idea:</strong>
Keep noisy, spiky workloads from contaminating general services.</p>

<p><strong>Mechanics:</strong></p>

<ul>
  <li>
    <p>System DaemonSets:</p>

    <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">nodeSelector</span><span class="pi">:</span>
  <span class="na">node-role.kubernetes.io/system</span><span class="pi">:</span> <span class="s2">"</span><span class="s">true"</span>
</code></pre></div>    </div>
  </li>
  <li>
    <p>Batch node pool tainted:</p>

    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kubectl taint node batch-pool batch-only<span class="o">=</span><span class="nb">true</span>:NoSchedule
</code></pre></div>    </div>
  </li>
</ul>

<hr />

<h3 id="13-costhardware-pools">1.3 <strong>Cost/Hardware Pools</strong></h3>

<p>Pools by machine type:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">spot</code> or <code class="language-plaintext highlighter-rouge">preemptible</code></li>
  <li><code class="language-plaintext highlighter-rouge">standard</code></li>
  <li><code class="language-plaintext highlighter-rouge">high-mem</code></li>
  <li><code class="language-plaintext highlighter-rouge">ssd-local</code></li>
</ul>

<p>Use them like:</p>

<ul>
  <li>Non-critical workers → <code class="language-plaintext highlighter-rouge">spot</code></li>
  <li>Latency-critical → <code class="language-plaintext highlighter-rouge">standard</code></li>
  <li>Memory-heavy → <code class="language-plaintext highlighter-rouge">high-mem</code></li>
  <li>Spark/Redis → <code class="language-plaintext highlighter-rouge">ssd-local</code></li>
</ul>

<p><strong>Key:</strong>
Every pool has labels &amp; taints; workloads choose via <code class="language-plaintext highlighter-rouge">nodeSelector</code> / <code class="language-plaintext highlighter-rouge">nodeAffinity</code> + tolerations.</p>

<hr />

<h2 id="2-namespace--tenant-patterns">2. Namespace &amp; Tenant Patterns</h2>

<h3 id="21-namespace-per-team--namespace-per-product">2.1 <strong>Namespace-per-team / namespace-per-product</strong></h3>

<p><strong>Pattern:</strong></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">team-a-dev</code>, <code class="language-plaintext highlighter-rouge">team-a-prod</code></li>
  <li><code class="language-plaintext highlighter-rouge">product-x-dev</code>, <code class="language-plaintext highlighter-rouge">product-x-prod</code></li>
</ul>

<p><strong>Controls per namespace:</strong></p>

<ul>
  <li><strong>ResourceQuota</strong></li>
  <li><strong>LimitRange</strong></li>
  <li><strong>NetworkPolicy</strong></li>
  <li><strong>RBAC</strong></li>
</ul>

<p>Example:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ResourceQuota</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">team-a-quota</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">team-a-prod</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">hard</span><span class="pi">:</span>
    <span class="na">requests.cpu</span><span class="pi">:</span> <span class="s2">"</span><span class="s">40"</span>
    <span class="na">requests.memory</span><span class="pi">:</span> <span class="s2">"</span><span class="s">80Gi"</span>
    <span class="na">limits.cpu</span><span class="pi">:</span> <span class="s2">"</span><span class="s">80"</span>
    <span class="na">limits.memory</span><span class="pi">:</span> <span class="s2">"</span><span class="s">160Gi"</span>
    <span class="na">pods</span><span class="pi">:</span> <span class="s2">"</span><span class="s">200"</span>
</code></pre></div></div>

<p><strong>When to use:</strong>
Multi-team clusters, platform teams serving app teams.</p>

<hr />

<h3 id="22-soft-multi-tenancy-vs-hard-multi-tenancy">2.2 <strong>Soft Multi-Tenancy vs Hard Multi-Tenancy</strong></h3>

<ul>
  <li><strong>Soft</strong>: Same cluster, tenants isolated via namespaces, quotas, network policies, RBAC. Most enterprises.</li>
  <li><strong>Hard</strong>: Separate clusters per tenant or per BU, sometimes separate accounts/subscriptions.</li>
</ul>

<p><strong>Rules of thumb:</strong></p>

<ul>
  <li>If tenants can be semi-trusted &amp; share infra → soft.</li>
  <li>If you need strong isolation / different compliance regimes / noisy security boundaries → multiple clusters.</li>
</ul>

<hr />

<h2 id="3-workload-admission--qos-patterns">3. Workload Admission &amp; QoS Patterns</h2>

<h3 id="31-enforce-requests--limits-via-policy">3.1 <strong>Enforce Requests &amp; Limits via Policy</strong></h3>

<p>Use an admission policy (OPA/Gatekeeper, Kyverno, or built-in ValidatingAdmissionPolicy) to:</p>

<ul>
  <li>Reject Pods without <code class="language-plaintext highlighter-rouge">resources.requests</code> &amp; <code class="language-plaintext highlighter-rouge">resources.limits</code></li>
  <li>Forbid BestEffort except for <code class="language-plaintext highlighter-rouge">debug</code> namespaces</li>
  <li>Enforce max/min resource sizes per namespace</li>
</ul>

<p><strong>Pattern:</strong></p>

<ul>
  <li><strong>Default</strong>: require at least <code class="language-plaintext highlighter-rouge">requests</code> and <code class="language-plaintext highlighter-rouge">limits.memory</code>.</li>
  <li><strong>Exception</strong>: special <code class="language-plaintext highlighter-rouge">allow-bursty</code> namespace.</li>
</ul>

<hr />

<h3 id="32-priority-classes-for-slo-layers">3.2 <strong>Priority Classes for SLO Layers</strong></h3>

<p>Define <strong>PriorityClasses</strong> like:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">system-critical</code>   (CNI, kube-dns)</li>
  <li><code class="language-plaintext highlighter-rouge">platform-critical</code> (ingress, logging, metrics)</li>
  <li><code class="language-plaintext highlighter-rouge">business-critical</code> (user-facing prod services)</li>
  <li><code class="language-plaintext highlighter-rouge">batch</code>             (ETL, reports)</li>
  <li><code class="language-plaintext highlighter-rouge">best-effort</code>       (preemptible stuff)</li>
</ul>

<p>Example:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">scheduling.k8s.io/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">PriorityClass</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">business-critical</span>
<span class="na">value</span><span class="pi">:</span> <span class="m">900</span>
<span class="na">globalDefault</span><span class="pi">:</span> <span class="no">false</span>
</code></pre></div></div>

<p>Use in Pod spec:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">priorityClassName</span><span class="pi">:</span> <span class="s">business-critical</span>
</code></pre></div></div>

<p><strong>Behavior:</strong></p>

<ul>
  <li>On resource pressure, lower-priority Pods get evicted first.</li>
  <li>Scheduler gives high-priority workloads first dibs on resources.</li>
</ul>

<hr />

<h3 id="33-poddisruptionbudget-pdb--autoscaling">3.3 <strong>PodDisruptionBudget (PDB) + Autoscaling</strong></h3>

<p>Pattern:</p>

<ul>
  <li>For every <strong>stateful</strong> or <strong>important stateless</strong> workload, define PDB:</li>
</ul>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">policy/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">PodDisruptionBudget</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">minAvailable</span><span class="pi">:</span> <span class="m">2</span>
</code></pre></div></div>

<p>Combine with:</p>

<ul>
  <li><strong>HPA</strong> for scale-out</li>
  <li><strong>Cluster Autoscaler / Karpenter</strong> for node scale-out</li>
</ul>

<p>This gives:</p>

<ul>
  <li>Safe rollouts</li>
  <li>Safe node drain / spot preemption</li>
  <li>Enough replicas for resilience</li>
</ul>

<hr />

<h2 id="4-topology--failure-domain-patterns">4. Topology &amp; Failure-Domain Patterns</h2>

<h3 id="41-spread-across-zones--nodes">4.1 <strong>Spread Across Zones / Nodes</strong></h3>

<p>Use <strong>topology spread constraints</strong> or anti-affinity:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">topologySpreadConstraints</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">maxSkew</span><span class="pi">:</span> <span class="m">1</span>
    <span class="na">topologyKey</span><span class="pi">:</span> <span class="s">topology.kubernetes.io/zone</span>
    <span class="na">whenUnsatisfiable</span><span class="pi">:</span> <span class="s">ScheduleAnyway</span>
    <span class="na">labelSelector</span><span class="pi">:</span>
      <span class="na">matchLabels</span><span class="pi">:</span>
        <span class="na">app</span><span class="pi">:</span> <span class="s">my-api</span>
</code></pre></div></div>

<p>Or simpler:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">affinity</span><span class="pi">:</span>
  <span class="na">podAntiAffinity</span><span class="pi">:</span>
    <span class="na">preferredDuringSchedulingIgnoredDuringExecution</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">weight</span><span class="pi">:</span> <span class="m">100</span>
        <span class="na">podAffinityTerm</span><span class="pi">:</span>
          <span class="na">topologyKey</span><span class="pi">:</span> <span class="s">kubernetes.io/hostname</span>
          <span class="na">labelSelector</span><span class="pi">:</span>
            <span class="na">matchLabels</span><span class="pi">:</span>
              <span class="na">app</span><span class="pi">:</span> <span class="s">my-api</span>
</code></pre></div></div>

<p>Goal:
Avoid all replicas landing on same node or same AZ.</p>

<hr />

<h3 id="42-zone-aware-node-pools">4.2 <strong>Zone-aware Node Pools</strong></h3>

<p>Per cloud:</p>

<ul>
  <li>Separate node pools per AZ</li>
  <li>Label nodes with zone</li>
  <li>Use <code class="language-plaintext highlighter-rouge">topologySpreadConstraints</code> to distribute workloads evenly</li>
</ul>

<p>This prevents:</p>

<ul>
  <li>All traffic going through a single zone</li>
  <li>Single-AZ outages taking entire app down</li>
</ul>

<hr />

<h2 id="5-security--network-isolation-patterns">5. Security &amp; Network Isolation Patterns</h2>

<h3 id="51-zero-trust-by-default-networkpolicy">5.1 <strong>Zero-Trust-by-default NetworkPolicy</strong></h3>

<p>Base policy in each namespace:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">networking.k8s.io/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">NetworkPolicy</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">default-deny</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">podSelector</span><span class="pi">:</span> <span class="pi">{}</span>
  <span class="na">policyTypes</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="s">Ingress</span>
    <span class="pi">-</span> <span class="s">Egress</span>
</code></pre></div></div>

<p>Then explicit “allow” policies for:</p>

<ul>
  <li>namespace-local communication</li>
  <li>calls to specific backends (DBs, APIs)</li>
  <li>calls to observability stack</li>
</ul>

<p><strong>Pattern:</strong>
No ingress/egress allowed by default → everything opt-in.</p>

<hr />

<h3 id="52-security-boundary-namespaces">5.2 <strong>Security Boundary Namespaces</strong></h3>

<p>For particularly sensitive apps, combine:</p>

<ul>
  <li>Dedicated namespace</li>
  <li>Dedicated node pool (taints)</li>
  <li>Strict <code class="language-plaintext highlighter-rouge">NetworkPolicy</code></li>
  <li>Stricter <code class="language-plaintext highlighter-rouge">PodSecurity</code> / PSP replacement (restricted baseline)</li>
  <li>Separate secrets store (external KMS, Vault, AKV, etc.)</li>
</ul>

<p>This is a <strong>cluster-within-a-cluster</strong> pattern.</p>

<hr />

<h2 id="6-multi-cluster-patterns">6. Multi-Cluster Patterns</h2>

<h3 id="61-env-tier-clusters">6.1 <strong>Env-tier Clusters</strong></h3>

<p>One of the most common:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">prod</code> cluster(s)</li>
  <li><code class="language-plaintext highlighter-rouge">nonprod</code> cluster(s) (dev/uat/stage)</li>
</ul>

<p>Sometimes:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">prod-us</code>, <code class="language-plaintext highlighter-rouge">prod-eu</code> (data residency)</li>
</ul>

<p><strong>Pros:</strong></p>

<ul>
  <li>Strong blast-radius isolation</li>
  <li>Simple mental model: “prod is sacred”</li>
</ul>

<p><strong>Cons:</strong></p>

<ul>
  <li>More control-plane overhead</li>
  <li>You need a GitOps story that understands multiple clusters (ArgoCD, Flux).</li>
</ul>

<hr />

<h3 id="62-function-based-clusters">6.2 <strong>Function-based Clusters</strong></h3>

<p>Patterns like:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">core-platform</code> cluster (ingress, observability, shared platform services)</li>
  <li><code class="language-plaintext highlighter-rouge">app-tenant</code> clusters for main product lines</li>
  <li><code class="language-plaintext highlighter-rouge">data</code> cluster for Kafka/Spark/Cassandra</li>
</ul>

<p>This is helpful if:</p>

<ul>
  <li>Data-plane loads are wildly different than API-plane loads</li>
  <li>Observability stack is heavy and you want to isolate it</li>
</ul>

<hr />

<h2 id="7-putting-it-together--example-design">7. Putting It Together – Example Design</h2>

<p>Here’s a concrete <strong>cluster design pattern</strong> you can adapt:</p>

<h3 id="clusters">Clusters</h3>

<ul>
  <li><code class="language-plaintext highlighter-rouge">corp-nonprod</code></li>
  <li><code class="language-plaintext highlighter-rouge">corp-prod</code></li>
</ul>

<h3 id="node-pools-in-each-cluster">Node Pools in each cluster</h3>

<ul>
  <li><code class="language-plaintext highlighter-rouge">system</code> (small, stable, for CNI/CSI/monitoring)</li>
  <li><code class="language-plaintext highlighter-rouge">general</code> (default microservice nodes, D/E/m6i/n2)</li>
  <li><code class="language-plaintext highlighter-rouge">perf</code> (CPUManager+TopologyManager, latency/cpu-critical)</li>
  <li><code class="language-plaintext highlighter-rouge">batch</code> (cheaper, spot, larger nodes)</li>
  <li><code class="language-plaintext highlighter-rouge">db</code> (memory-heavy, local SSD, tainted)</li>
</ul>

<h3 id="namespaces">Namespaces</h3>

<ul>
  <li><code class="language-plaintext highlighter-rouge">platform-system</code> (CNI, CSI, logging, metrics, ingress)</li>
  <li><code class="language-plaintext highlighter-rouge">platform-observability</code> (Prometheus, Loki, Tempo, etc.)</li>
  <li><code class="language-plaintext highlighter-rouge">team-a-dev</code>, <code class="language-plaintext highlighter-rouge">team-a-prod</code></li>
  <li><code class="language-plaintext highlighter-rouge">team-b-dev</code>, <code class="language-plaintext highlighter-rouge">team-b-prod</code></li>
  <li><code class="language-plaintext highlighter-rouge">shared-services</code> (auth, messaging, etc.)</li>
</ul>

<h3 id="controls">Controls</h3>

<ul>
  <li>ResourceQuota + LimitRange per team namespace</li>
  <li>NetworkPolicy default-deny per namespace</li>
  <li>
    <p>PriorityClasses:</p>

    <ul>
      <li><code class="language-plaintext highlighter-rouge">system-critical</code></li>
      <li><code class="language-plaintext highlighter-rouge">platform-critical</code></li>
      <li><code class="language-plaintext highlighter-rouge">business-critical</code></li>
      <li><code class="language-plaintext highlighter-rouge">batch-low</code></li>
    </ul>
  </li>
</ul>

<h3 id="scheduling-hints">Scheduling hints</h3>

<ul>
  <li>Platform &amp; observability → <code class="language-plaintext highlighter-rouge">system</code> &amp; <code class="language-plaintext highlighter-rouge">general</code> pools</li>
  <li>Latency-critical apps → <code class="language-plaintext highlighter-rouge">perf</code> pool (Guaranteed, pinned CPUs)</li>
  <li>Spark jobs → <code class="language-plaintext highlighter-rouge">batch</code> pool (spot, large nodes, local SSD)</li>
  <li>Redis/DB → <code class="language-plaintext highlighter-rouge">db</code> pool (memory-heavy, local SSD)</li>
</ul>

<hr />

<h2 id="8-quick-design-checklist">8. Quick design checklist</h2>

<p>When you design or refactor a cluster, ask:</p>

<ol>
  <li><strong>Do I have at least two node pools?</strong> (general + something else)</li>
  <li><strong>Are system components isolated or competing with apps?</strong></li>
  <li><strong>Do teams have clear namespace boundaries, quotas, and limits?</strong></li>
  <li><strong>Are BestEffort workloads controlled or confined?</strong></li>
  <li><strong>Do I have PriorityClasses &amp; PDBs for production services?</strong></li>
  <li><strong>Are workloads spread across zones and nodes?</strong></li>
  <li><strong>Do sensitive workloads have network &amp; node isolation?</strong></li>
  <li><strong>Do I need multiple clusters for prod vs nonprod or for legal isolation?</strong></li>
</ol>

<p>If the answer to most of these is “yes”, you’re in <strong>serious platform-engineering territory</strong> already.</p>]]></content><author><name>Maung San</name><email>msan001@live.com</email></author><category term="Kubernetes" /><category term="Containers" /><category term="Kubernetes Resource Isolation" /><category term="cgroups" /><category term="kubelet" /><category term="resource isolation" /><category term="container runtime" /><category term="systemd" /><category term="cpu" /><category term="memory" /><category term="pod" /><category term="QoS" /><category term="kubepods" /><category term="CRI" /><category term="containerd" /><category term="CRI-O" /><category term="scheduling" /><category term="node allocatable" /><category term="overcommit" /><category term="bin-packing" /><summary type="html"><![CDATA[Segment 14 as a catalog of cluster design patterns you can combine:]]></summary></entry><entry><title type="html">kubernetes Resource Isolation - 12. Ultimate Node Sizing Guide for AKS, EKS, and GKE</title><link href="https://maungsan.github.io/kubernetes/containers/kubernetes%20resource%20isolation/2025/10/16/k8s-iso-12/" rel="alternate" type="text/html" title="kubernetes Resource Isolation - 12. Ultimate Node Sizing Guide for AKS, EKS, and GKE" /><published>2025-10-16T00:00:00-07:00</published><updated>2025-10-16T00:00:00-07:00</updated><id>https://maungsan.github.io/kubernetes/containers/kubernetes%20resource%20isolation/2025/10/16/k8s-iso-12</id><content type="html" xml:base="https://maungsan.github.io/kubernetes/containers/kubernetes%20resource%20isolation/2025/10/16/k8s-iso-12/"><![CDATA[<p><strong>Segment 12</strong> is where we get <em>extremely practical</em> about selecting the right node sizes and VM shapes in AKS/EKS/GKE. This is one of the <strong>most important but least understood</strong> aspects of Kubernetes performance engineering.</p>

<p>Choosing the wrong node size leads to:</p>

<ul>
  <li>Constant evictions</li>
  <li>Memory pressure</li>
  <li>CPU throttling</li>
  <li>NUMA imbalance</li>
  <li>Poor inference latency</li>
  <li>Overpaying for unused cores</li>
  <li>Underpowered control-plane components (CNIs, CSI, monitoring agents)</li>
</ul>

<p>This guide will help you select the <strong>best node types</strong> for:</p>

<ul>
  <li>Microservices</li>
  <li>JVM workloads</li>
  <li>High-throughput services</li>
  <li>Dataplanes (Cilium, Envoy)</li>
  <li>Redis, Postgres</li>
  <li>AI/ML</li>
  <li>Spark</li>
  <li>GPU workloads</li>
</ul>

<p>Let’s go deep.</p>

<hr />

<h1 id="segment-12--ultimate-node-sizing-guide-for-aks-eks-and-gke"><strong>SEGMENT 12 — Ultimate Node Sizing Guide for AKS, EKS, and GKE</strong></h1>

<p>We will cover:</p>

<ol>
  <li>The principles for choosing node sizes</li>
  <li>CPU-to-memory ratios that actually work</li>
  <li>Understanding NUMA (critical!)</li>
  <li>Choosing VM families in each cloud (AKS/EKS/GKE)</li>
  <li>Node sizes for different workload types</li>
  <li>When to use large nodes vs many small nodes</li>
  <li>When to use local SSD</li>
  <li>Cost optimization rules</li>
</ol>

<hr />

<h1 id="part-1--principles-of-good-node-sizing"><strong>PART 1 — Principles of Good Node Sizing</strong></h1>

<p>These are universal across AKS/EKS/GKE.</p>

<h2 id="1-memory-pressure-kills-nodes--not-cpu"><strong>1. Memory pressure kills nodes — not CPU</strong></h2>

<p>Always design node capacity with memory as <em>primary constraint</em>.</p>

<p>Nodes rarely fail from high CPU usage.
Nodes frequently fail from memory exhaustion → eviction → OOM → kubelet death → NotReady.</p>

<h2 id="2-numa-topology-heavily-affects-performance"><strong>2. NUMA topology heavily affects performance</strong></h2>

<p>Nodes with <strong>≥ 2 sockets or ≥ 2 NUMA nodes</strong> require careful placement.</p>

<ul>
  <li>JVM</li>
  <li>Redis</li>
  <li>AI inference</li>
  <li>network dataplanes
<strong>cannot</strong> randomly bounce across NUMA nodes.</li>
</ul>

<p>Prefer:</p>

<ul>
  <li><strong>single-NUMA nodes for latency-sensitive workloads</strong></li>
</ul>

<h2 id="3-avoid-nodes-with--64-vcpus-unless-you-use-pinned-cpu-workloads"><strong>3. Avoid nodes with &gt; 64 vCPUs unless you use pinned CPU workloads</strong></h2>

<p>Large nodes → more NUMA topology → more cgroup fragmentation → lower efficiency.</p>

<h2 id="4-prefer-more-medium-nodes-over-fewer-huge-nodes"><strong>4. Prefer more medium nodes over fewer huge nodes</strong></h2>

<ul>
  <li>reduces blast radius</li>
  <li>avoids multi-Pod NUMA fragmentation</li>
  <li>improves bin packing</li>
  <li>reduces eviction chain reactions</li>
</ul>

<h2 id="5-always-leave-space-for-system-daemons"><strong>5. Always leave space for system daemons</strong></h2>

<p>Rule of thumb:</p>

<ul>
  <li>Reserve <strong>6–12% of node memory</strong></li>
  <li>Reserve <strong>0.5–1.5 vCPU</strong> for system/kube daemons</li>
</ul>

<hr />

<h1 id="part-2--recommended-cpu--memory-ratios"><strong>PART 2 — Recommended CPU : Memory Ratios</strong></h1>

<p>Use these ratios as starting points:</p>

<table>
  <thead>
    <tr>
      <th>Workload Type</th>
      <th>Recommended Ratio</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Stateless microservices (Go, Node, Python)</td>
      <td><strong>1 vCPU : 2–4 GiB</strong></td>
    </tr>
    <tr>
      <td>JVM microservices (Spring Boot, Micronaut)</td>
      <td><strong>1 vCPU : 3–8 GiB</strong></td>
    </tr>
    <tr>
      <td>Databases (Redis, Postgres)</td>
      <td><strong>1 vCPU : 4–8 GiB</strong></td>
    </tr>
    <tr>
      <td>High-throughput dataplane (Envoy, Cilium)</td>
      <td><strong>1 vCPU : 1–2 GiB</strong></td>
    </tr>
    <tr>
      <td>AI Inference (CPU-heavy)</td>
      <td><strong>1 vCPU : 1–3 GiB</strong></td>
    </tr>
    <tr>
      <td>AI w/ GPU</td>
      <td>CPU not bottleneck → <strong>1 vCPU : 4–16 GiB</strong></td>
    </tr>
    <tr>
      <td>Spark/Flink executors</td>
      <td><strong>1 vCPU : 2–8 GiB</strong>, memory-bound</td>
    </tr>
  </tbody>
</table>

<hr />

<h1 id="part-3--numa-topology-explained-critical-selection-factor"><strong>PART 3 — NUMA Topology Explained (Critical Selection Factor)</strong></h1>

<h2 id="how-to-think-about-numa"><strong>How to think about NUMA:</strong></h2>

<ol>
  <li><strong>Single NUMA node</strong> = predictable, consistent latency</li>
  <li>
    <p><strong>Multiple NUMA nodes</strong> =</p>

    <ul>
      <li>Remote memory access</li>
      <li>20–80% slowdown for AI/Redis/Envoy</li>
      <li>Complex scheduling</li>
    </ul>
  </li>
</ol>

<p>Cloud providers rarely document NUMA, but here’s the real mapping:</p>

<h3 id="aws-eks-numa"><strong>AWS (EKS) NUMA</strong></h3>

<ul>
  <li><strong>m5 / c5 / r5</strong> → 1 NUMA node up to 24–32 vCPUs</li>
  <li><strong>m6i / c6i / r6i</strong> → 1 NUMA until ~32–48 vCPUs</li>
  <li><strong>m5.24xlarge / c5.24xlarge</strong> → 2 NUMA nodes</li>
</ul>

<h3 id="azure-aks-numa"><strong>Azure (AKS) NUMA</strong></h3>

<p>Azure uses “CPU groups”, but effectively:</p>

<ul>
  <li><strong>D-series, E-series</strong> → 1 NUMA up to ~32 vCPUs</li>
  <li><strong>F-series</strong> → 1 NUMA up to ~16 vCPUs</li>
  <li><strong>Lsv2</strong> → 2+ NUMA nodes (local SSD optimized)</li>
</ul>

<h3 id="gcp-gke-numa"><strong>GCP (GKE) NUMA</strong></h3>

<ul>
  <li><strong>n2-standard, e2-standard</strong> → single NUMA up to 32 vCPUs</li>
  <li><strong>n2-highmem/highcpu</strong> → single NUMA up to 48 vCPUs</li>
  <li><strong>a2 / g2 GPU nodes</strong> → big NUMA topology</li>
</ul>

<hr />

<h1 id="part-4--recommended-vm-families-per-cloud"><strong>PART 4 — Recommended VM Families Per Cloud</strong></h1>

<h2 id="aks-azure"><strong>AKS (Azure)</strong></h2>

<h3 id="-best-general-purpose-workload-nodes">⭐ Best General Purpose Workload Nodes:</h3>

<ul>
  <li><strong>D4s_v5, D8s_v5, D16s_v5</strong>
Balance of:</li>
  <li>memory</li>
  <li>CPU</li>
  <li>no NUMA surprises</li>
</ul>

<h3 id="-best-compute-nodes">⭐ Best Compute Nodes:</h3>

<ul>
  <li><strong>F4s_v2, F8s_v2</strong>
Best for:</li>
  <li>Cilium agents</li>
  <li>API gateways</li>
  <li>small services
<strong>Avoid &gt; F16</strong> (NUMA segmentation)</li>
</ul>

<h3 id="-best-memory-optimized">⭐ Best Memory-Optimized:</h3>

<ul>
  <li><strong>E8ds_v5, E16ds_v5, E20</strong>
Ideal for:</li>
  <li>Java</li>
  <li>Elasticsearch</li>
  <li>Redis</li>
</ul>

<h3 id="-best-for-nvme-heavy-workloads">⭐ Best for NVMe-heavy workloads:</h3>

<ul>
  <li><strong>L8s_v3, L16s_v3</strong>
For:</li>
  <li>Spark</li>
  <li>batch</li>
  <li>caching</li>
  <li>databases with high random IO</li>
</ul>

<h3 id="-best-cpu-optimized-for-aidpdk">⭐ Best CPU-optimized for AI/DPDK:</h3>

<ul>
  <li><strong>D8as_v5, F8as_v4</strong>
(start with 8 cores to keep single NUMA)</li>
</ul>

<hr />

<h2 id="eks-aws"><strong>EKS (AWS)</strong></h2>

<h3 id="-best-general-workloads">⭐ Best general workloads:</h3>

<ul>
  <li><strong>m6i.large / xlarge / 2xlarge / 4xlarge</strong></li>
</ul>

<h3 id="-best-for-high-throughput">⭐ Best for high-throughput:</h3>

<ul>
  <li><strong>c6i.xlarge / 2xlarge</strong></li>
</ul>

<h3 id="-best-memory-heavy">⭐ Best memory-heavy:</h3>

<ul>
  <li><strong>r6i.xlarge / 2xlarge / 4xlarge</strong></li>
</ul>

<h3 id="-best-ai-cpu-side-prepost-processing">⭐ Best AI CPU-side pre/post processing:</h3>

<ul>
  <li><strong>c7g (Graviton3)</strong> — extreme performance/price</li>
  <li><strong>m7g</strong> — best balance</li>
</ul>

<h3 id="-best-with-local-ssd">⭐ Best with local SSD:</h3>

<ul>
  <li><strong>i3.xlarge / 2xlarge</strong>
(best throughput in AWS)</li>
</ul>

<h3 id="avoid">Avoid:</h3>

<ul>
  <li>m5.24xlarge</li>
  <li>c5.18xlarge
(NUM A splitting → inconsistent performance)</li>
</ul>

<hr />

<h2 id="gke-google-cloud"><strong>GKE (Google Cloud)</strong></h2>

<h3 id="-best-general-workloads-1">⭐ Best general workloads:</h3>

<ul>
  <li><strong>n2-standard-4 / 8 / 16</strong></li>
</ul>

<h3 id="-best-memory-workloads">⭐ Best memory workloads:</h3>

<ul>
  <li><strong>n2-highmem-4 / 8 / 16</strong></li>
</ul>

<h3 id="-best-cpu-heavy">⭐ Best CPU-heavy:</h3>

<ul>
  <li><strong>c2-standard-4 / 8</strong></li>
</ul>

<h3 id="-best-local-ssd">⭐ Best local SSD:</h3>

<ul>
  <li><strong>n2-standard-8 w/ Local SSD</strong></li>
</ul>

<h3 id="avoid-1">Avoid:</h3>

<ul>
  <li>n1 or older instance types</li>
  <li>Very large machine types (&gt; 64 vCPUs)</li>
</ul>

<hr />

<h1 id="part-5--node-sizes-per-workload-type"><strong>PART 5 — Node Sizes Per Workload Type</strong></h1>

<h2 id="1-microservices-go-node-python"><strong>1. Microservices (Go, Node, Python)</strong></h2>

<p>Best sizes:</p>

<ul>
  <li><strong>4 vCPU / 16 GiB</strong></li>
  <li><strong>8 vCPU / 32 GiB</strong></li>
</ul>

<p>Why:</p>

<ul>
  <li>Good bin packing</li>
  <li>No NUMA pressure</li>
  <li>Fits 10–25 Pods safely</li>
</ul>

<p>Avoid:</p>

<ul>
  <li>Very small nodes (inefficient)</li>
  <li>Very large nodes (blast radius)</li>
</ul>

<hr />

<h2 id="2-jvm-apps-spring-boot-pega-kafka-clients"><strong>2. JVM Apps (Spring Boot, Pega, Kafka clients)</strong></h2>

<p>Needs:</p>

<ul>
  <li>high memory per Pod</li>
  <li>JVM heap + direct buffers</li>
</ul>

<p>Best sizes:</p>

<ul>
  <li><strong>8 vCPU / 64 GiB</strong></li>
  <li><strong>16 vCPU / 128 GiB</strong></li>
</ul>

<p>If each POD needs 4Gi:</p>

<ul>
  <li>Node with 64Gi can fit 10–12 properly</li>
  <li>With headroom for system daemons</li>
</ul>

<hr />

<h2 id="3-redis--memcached"><strong>3. Redis / Memcached</strong></h2>

<p>Needs:</p>

<ul>
  <li>single NUMA node</li>
  <li>predictable CPU</li>
  <li>local SSD optional</li>
</ul>

<p>Best sizes:</p>

<ul>
  <li><strong>8 vCPU / 64 GiB</strong></li>
  <li><strong>16 vCPU / 128 GiB</strong></li>
</ul>

<p>Never deploy Redis on:</p>

<ul>
  <li>multi-NUMA 32–64 core nodes
(unless CPU pinned)</li>
</ul>

<hr />

<h2 id="4-envoy-proxy--api-gateway"><strong>4. Envoy Proxy / API Gateway</strong></h2>

<p>Needs:</p>

<ul>
  <li>stable CPU</li>
  <li>no throttling</li>
  <li>low jitter</li>
</ul>

<p>Best sizes:</p>

<ul>
  <li><strong>4 vCPU / 8 GiB</strong></li>
  <li><strong>8 vCPU / 16 GiB</strong></li>
</ul>

<p>Run fewer Pods per node for isolation.</p>

<hr />

<h2 id="5-aiml-inference-cpu-bound"><strong>5. AI/ML Inference (CPU-bound)</strong></h2>

<p>Needs:</p>

<ul>
  <li>NUMA alignment</li>
  <li>large memory for models</li>
  <li>predictable batching latency</li>
</ul>

<p>Best sizes:</p>

<ul>
  <li><strong>8 vCPU / 32 GiB</strong></li>
  <li><strong>16 vCPU / 64 GiB</strong></li>
</ul>

<p>With CPUManager:</p>

<ul>
  <li>Pin 4–8 CPUs exclusively for inference worker</li>
</ul>

<hr />

<h2 id="6-aiml-with-gpu"><strong>6. AI/ML with GPU</strong></h2>

<p>CPU sizing is <em>secondary</em>.</p>

<p>Good rule:</p>

<ul>
  <li>4–6 vCPUs per GPU</li>
  <li>16–32 GiB memory per GPU</li>
</ul>

<p>Node example:</p>

<ul>
  <li>A10 GPU node → 8 vCPU / 32 GiB</li>
  <li>A100 GPU node → 32 vCPU / 128–256 GiB</li>
</ul>

<hr />

<h2 id="7-databases-postgres-mysql-elasticsearch"><strong>7. Databases (Postgres, MySQL, Elasticsearch)</strong></h2>

<p>Needs:</p>

<ul>
  <li>huge page cache</li>
  <li>high memory</li>
  <li>stable IO</li>
</ul>

<p>Best sizes:</p>

<ul>
  <li><strong>8 vCPU / 64 GiB</strong></li>
  <li><strong>16 vCPU / 128 GiB</strong></li>
</ul>

<p>With local SSD:</p>

<ul>
  <li>Lsv2 (AKS)</li>
  <li>i3/i4i (EKS)</li>
  <li>n2-standard w/ local SSD (GKE)</li>
</ul>

<p>Avoid:</p>

<ul>
  <li>memory-poor compute nodes</li>
</ul>

<hr />

<h2 id="8-spark--flink--ray"><strong>8. Spark / Flink / Ray</strong></h2>

<p>Executors need:</p>

<ul>
  <li>memory</li>
  <li>local SSD</li>
  <li>CPU bursts</li>
</ul>

<p>Best sizes:</p>

<ul>
  <li><strong>16 vCPU / 64 GiB</strong></li>
  <li><strong>32 vCPU / 128 GiB</strong></li>
  <li>with <strong>local SSD</strong></li>
</ul>

<p>Avoid:</p>

<ul>
  <li>small nodes (executor fragmentation)</li>
  <li>massive nodes (NUMA issues)</li>
</ul>

<hr />

<h1 id="part-6--when-to-use-large-nodes-vs-small-nodes"><strong>PART 6 — When to Use Large Nodes vs Small Nodes</strong></h1>

<h2 id="use-smallmedium-nodes-16-vcpu-for">Use <strong>small/medium nodes</strong> (&lt;16 vCPU) for:</h2>

<ul>
  <li>microservices</li>
  <li>latency-sensitive workloads</li>
  <li>Cilium/Envoy</li>
  <li>Redis</li>
  <li>AI inference</li>
  <li>clusters with high Pod churn</li>
</ul>

<p>Benefits:</p>

<ul>
  <li>low blast radius</li>
  <li>easier bin packing</li>
  <li>fast autoscaling</li>
</ul>

<hr />

<h2 id="use-large-nodes-3264-vcpu-for">Use <strong>large nodes</strong> (32–64 vCPU) for:</h2>

<ul>
  <li>Spark executors</li>
  <li>Flink task managers</li>
  <li>ETL workloads</li>
  <li>AI training (multi-GPU nodes)</li>
</ul>

<hr />

<h2 id="avoid-very-large-nodes-64-vcpu-unless">Avoid <strong>very large nodes</strong> (&gt;64 vCPU) unless:</h2>

<ul>
  <li>you’re doing ML training</li>
  <li>pods are pinned to cores</li>
  <li>you fully understand NUMA management</li>
</ul>

<hr />

<h1 id="part-7--local-ssd-guidance"><strong>PART 7 — Local SSD Guidance</strong></h1>

<p>Use nodes with local SSD when:</p>

<ul>
  <li>Redis</li>
  <li>Postgres WAL/logs</li>
  <li>Spark shuffle</li>
  <li>ML preprocessing</li>
  <li>High local IO workloads</li>
</ul>

<p>Avoid local SSD for:</p>

<ul>
  <li>general microservices (no benefit)</li>
  <li>workloads using remote storage (EBS/EFS/Azure Disk/Premium)</li>
</ul>

<hr />

<h1 id="part-8--cost-optimization-rules"><strong>PART 8 — Cost Optimization Rules</strong></h1>

<ol>
  <li>
    <p><strong>Use medium nodes for better bin packing</strong></p>

    <ul>
      <li>8 vCPU / 32 GiB is the global sweet spot</li>
    </ul>
  </li>
  <li>
    <p><strong>Avoid high-memory SKUs unless necessary</strong></p>

    <ul>
      <li>r-series / E-series cost premium</li>
    </ul>
  </li>
  <li>
    <p><strong>Graviton (AWS) or Ampere (GKE/Oracle) &gt; x86</strong></p>

    <ul>
      <li>20–40% cheaper</li>
      <li>better perf</li>
    </ul>
  </li>
  <li>
    <p><strong>GPU nodes: choose smallest CPU SKU that meets throughput</strong></p>

    <ul>
      <li>oversizing CPU around GPUs is the #1 cost waste in AI clusters</li>
    </ul>
  </li>
  <li>
    <p><strong>Use autoscaling with Pod Disruption Budgets</strong></p>

    <ul>
      <li>avoids evacuation storms</li>
    </ul>
  </li>
</ol>

<hr />

<h1 id="segment-12-summary"><strong>SEGMENT 12 SUMMARY</strong></h1>

<p>You now have a <strong>cloud-agnostic, workload-driven node sizing strategy</strong>:</p>

<h3 id="core-principles">Core Principles</h3>

<ul>
  <li>memory &gt; CPU</li>
  <li>avoid NUMA fragmentation</li>
  <li>prefer several medium nodes</li>
  <li>leave room for system daemons</li>
</ul>

<h3 id="best-vm-families">Best VM Families</h3>

<ul>
  <li><strong>Azure</strong>: D-series, E-series, F-series, Lsv2 for SSD</li>
  <li><strong>AWS</strong>: m6i, c6i, r6i, c7g (Graviton), i3/i4i</li>
  <li><strong>GCP</strong>: n2-standard, n2-highmem, c2-standard</li>
</ul>

<h3 id="per-workload-node-size-playbooks">Per-Workload Node Size Playbooks</h3>

<ul>
  <li>Microservices → 4–8 vCPU</li>
  <li>JVM → 8–16 vCPU, high-memory</li>
  <li>Redis → 8 vCPU single-NUMA</li>
  <li>AI inference → 8–16 vCPU</li>
  <li>AI GPU → 4–6 CPUs per GPU</li>
  <li>Spark → 16–32 vCPU, local SSD</li>
</ul>

<h3 id="cost-optimization">Cost Optimization</h3>

<ul>
  <li>medium nodes pack best</li>
  <li>avoid big NUMA nodes</li>
  <li>Graviton/Ampere highly efficient</li>
  <li>GPU nodes should minimize CPU</li>
</ul>]]></content><author><name>Maung San</name><email>msan001@live.com</email></author><category term="Kubernetes" /><category term="Containers" /><category term="Kubernetes Resource Isolation" /><category term="cgroups" /><category term="kubelet" /><category term="resource isolation" /><category term="container runtime" /><category term="systemd" /><category term="cpu" /><category term="memory" /><category term="pod" /><category term="QoS" /><category term="kubepods" /><category term="CRI" /><category term="containerd" /><category term="CRI-O" /><category term="scheduling" /><category term="node allocatable" /><category term="overcommit" /><category term="bin-packing" /><summary type="html"><![CDATA[Segment 12 is where we get extremely practical about selecting the right node sizes and VM shapes in AKS/EKS/GKE. This is one of the most important but least understood aspects of Kubernetes performance engineering.]]></summary></entry></feed>