Packer Images

Build repeatable base images once, validate them hard, then deploy instances from those images instead of re-running long bootstrap scripts at boot.

Golden image rules
  • Bake packages, agents, and baseline config into the image; do not bake secrets or per-host identity.
  • Promote images by metadata or channel after validation, not by guessing which AMI ID is "probably the last good one".
  • Validation is part of the build. A successful Packer run without boot tests is only a faster way to ship a broken machine.
  • Rollback means deploying the previous known-good image, not mutating a bad one in place.

Why use golden images

A golden image is a prebuilt machine image with the OS, base packages, agents, and common configuration already installed. The trade-off is simple: spend time once during image build so new instances boot fast and consistently.

This works especially well for fleets where the same baseline is launched repeatedly: web nodes, CI runners, bastions, worker pools, or disposable test hosts. Packer builds the image; Terraform or the cloud autoscaling system consumes it later. That split mirrors the difference between Terraform Basics and configuration management with Ansible.

Builders, provisioners, and post-processors

Packer terminology is straightforward once you map each part to its job:

packer {
  required_plugins {
    amazon = {
      source  = "github.com/hashicorp/amazon"
      version = "~> 1.3"
    }
    ansible = {
      source  = "github.com/hashicorp/ansible"
      version = "~> 1.1"
    }
  }
}

variable "region" {
  type    = string
  default = "eu-west-1"
}

source "amazon-ebs" "rhel9" {
  region        = var.region
  instance_type = "t3.medium"
  ssh_username  = "ec2-user"
  ami_name      = "web-base-${formatdate(\"YYYYMMDDhhmm\", timestamp())}"

  source_ami_filter {
    filters = {
      name                = "RHEL-9.*_HVM-*"
      architecture        = "x86_64"
      root-device-type    = "ebs"
      virtualization-type = "hvm"
    }
    owners      = ["309956199498"]
    most_recent = true
  }

  tags = {
    Role         = "web"
    ImageChannel = "candidate"
    ManagedBy    = "packer"
  }
}

build {
  name    = "web-base"
  sources = ["source.amazon-ebs.rhel9"]

  provisioner "shell" {
    inline = [
      "sudo dnf -y update",
      "sudo dnf -y install python3"
    ]
  }

  provisioner "ansible" {
    playbook_file   = "ansible/image.yml"
    extra_arguments = ["--extra-vars", "image_role=web"]
  }

  provisioner "shell" {
    inline = [
      "sudo cloud-init clean --logs --seed || true",
      "sudo truncate -s 0 /etc/machine-id"
    ]
  }

  post-processor "manifest" {
    output = "manifest.json"
  }
}

The last shell step matters. You want the image to be generic again before snapshotting it: no stale machine ID, no installer leftovers, and no one-off temp files from the build.

Ansible integration

Packer and Ansible fit well together. Packer handles image lifecycle; Ansible applies the OS-level baseline. The same roles can often be reused later during config drift correction or app deploys, as long as you keep host-identity work out of the image build.

---
- name: Configure image baseline
  hosts: all
  become: true
  roles:
    - baseline
    - node-exporter
    - journald

  tasks:
    - name: Remove temporary build users
      ansible.builtin.user:
        name: packer
        state: absent
        remove: true

Good image-build content:

Bad image-build content:

Use runtime secret injection instead; the patterns are covered in GitLab Secrets & OIDC and later in HashiCorp Vault.

Validation and smoke tests

An image build is not done when Packer says "artifact created". It is done when a real instance boots from that image and passes the checks you care about.

# Example post-build validation
IMAGE_ID=$(jq -r '.builds[-1].artifact_id' manifest.json | cut -d: -f2)

aws ec2 run-instances \
  --image-id "$IMAGE_ID" \
  --instance-type t3.micro \
  --subnet-id subnet-0123456789abcdef0 \
  --security-group-ids sg-0123456789abcdef0

# Then validate from CI or a canary runner
nc -zv test-host 22
curl -fsS http://test-host/health
ssh test-host 'systemctl is-active node_exporter'

Keep validation focused on boot success, service reachability, baseline telemetry, and one role-specific smoke test. If you already have deployment smoke tests in Ansible Testing or a service runbook, reuse them here.

Fast feedback beats deep heroics. Ten short checks that run on every build are better than one giant manual validation document no one actually follows.

Image promotion and consumption

Do not point production directly at "the latest build". Promote a validated candidate into a stable channel, then let Terraform or the autoscaling group consume the promoted image.

# Simple tag-based promotion example
aws ec2 create-tags \
  --resources ami-0123456789abcdef0 \
  --tags Key=ImageChannel,Value=prod Key=Role,Value=web
data "aws_ami" "web_prod" {
  most_recent = true
  owners      = ["self"]

  filter {
    name   = "tag:Role"
    values = ["web"]
  }

  filter {
    name   = "tag:ImageChannel"
    values = ["prod"]
  }
}

resource "aws_launch_template" "web" {
  name_prefix   = "web-"
  image_id      = data.aws_ami.web_prod.id
  instance_type = "t3.medium"
}

This keeps image production and image consumption decoupled. Packer publishes candidates; Terraform chooses the promoted image. That is easier to review and easier to roll back.

Rollback

Rollback means selecting the previous known-good image and redeploying instances from it. Do not try to "fix" a bad AMI in place; images are immutable artifacts.

# Re-promote the previous image
aws ec2 create-tags \
  --resources ami-0fedcba9876543210 \
  --tags Key=ImageChannel,Value=prod Key=Role,Value=web

# Then re-run Terraform or your rollout job
terraform apply

Keep at least one earlier production image available until the new one has survived a real deployment. If the image also includes package-level risk, pair the rollback plan with normal service rollback guidance from Infra Change Lifecycle.

Troubleshooting and failure modes

SymptomLikely causeWhat to do
Packer cannot SSH into the builder instance Wrong username, network path, or security group. Verify the source image's default user and confirm the temporary instance is reachable.
Ansible provisioner fails immediately Python is missing or the communicator user lacks sudo. Install Python in an earlier shell provisioner and confirm privilege escalation works.
Image boots, but clones have duplicate identity weirdness Machine ID, host keys, or cloud-init state was baked in. Clean those artifacts before snapshot and regenerate at first boot.
New instances come up unhealthy even though the build passed No real post-build validation was run. Launch a test instance and add role-specific smoke tests to the pipeline.
Credentials leak across many hosts at once A secret was baked into the image. Rebuild without the secret, rotate the credential, and switch to runtime injection.
Terraform keeps rolling instances unexpectedly Image selection is "most recent" without a stable promotion rule. Consume only a promoted channel or pinned image ID, not every fresh candidate build.
Rollback is slow and confusing No manifest, no tags, or no clear record of the last good image. Publish metadata every build and make the production channel explicit.

Related pages: Terraform Basics, Ansible, Ansible Testing, Ansible Deploy Flow, Infra Change Lifecycle.