Packer Images
- Bake packages, agents, and baseline config into the image; do not bake secrets or per-host identity.
- Promote images by metadata or channel after validation, not by guessing which AMI ID is "probably the last good one".
- Validation is part of the build. A successful Packer run without boot tests is only a faster way to ship a broken machine.
- Rollback means deploying the previous known-good image, not mutating a bad one in place.
Why use golden images
A golden image is a prebuilt machine image with the OS, base packages, agents, and common configuration already installed. The trade-off is simple: spend time once during image build so new instances boot fast and consistently.
This works especially well for fleets where the same baseline is launched repeatedly: web nodes, CI runners, bastions, worker pools, or disposable test hosts. Packer builds the image; Terraform or the cloud autoscaling system consumes it later. That split mirrors the difference between Terraform Basics and configuration management with Ansible.
Builders, provisioners, and post-processors
Packer terminology is straightforward once you map each part to its job:
- Builder: creates the temporary machine and snapshots the image.
- Provisioner: configures the temporary machine before snapshotting it.
- Post-processor: writes build metadata, copies artifacts, or prepares output for downstream tooling.
packer {
required_plugins {
amazon = {
source = "github.com/hashicorp/amazon"
version = "~> 1.3"
}
ansible = {
source = "github.com/hashicorp/ansible"
version = "~> 1.1"
}
}
}
variable "region" {
type = string
default = "eu-west-1"
}
source "amazon-ebs" "rhel9" {
region = var.region
instance_type = "t3.medium"
ssh_username = "ec2-user"
ami_name = "web-base-${formatdate(\"YYYYMMDDhhmm\", timestamp())}"
source_ami_filter {
filters = {
name = "RHEL-9.*_HVM-*"
architecture = "x86_64"
root-device-type = "ebs"
virtualization-type = "hvm"
}
owners = ["309956199498"]
most_recent = true
}
tags = {
Role = "web"
ImageChannel = "candidate"
ManagedBy = "packer"
}
}
build {
name = "web-base"
sources = ["source.amazon-ebs.rhel9"]
provisioner "shell" {
inline = [
"sudo dnf -y update",
"sudo dnf -y install python3"
]
}
provisioner "ansible" {
playbook_file = "ansible/image.yml"
extra_arguments = ["--extra-vars", "image_role=web"]
}
provisioner "shell" {
inline = [
"sudo cloud-init clean --logs --seed || true",
"sudo truncate -s 0 /etc/machine-id"
]
}
post-processor "manifest" {
output = "manifest.json"
}
}
The last shell step matters. You want the image to be generic again before snapshotting it: no stale machine ID, no installer leftovers, and no one-off temp files from the build.
Ansible integration
Packer and Ansible fit well together. Packer handles image lifecycle; Ansible applies the OS-level baseline. The same roles can often be reused later during config drift correction or app deploys, as long as you keep host-identity work out of the image build.
---
- name: Configure image baseline
hosts: all
become: true
roles:
- baseline
- node-exporter
- journald
tasks:
- name: Remove temporary build users
ansible.builtin.user:
name: packer
state: absent
remove: true
Good image-build content:
- package installation
- security baseline
- monitoring agents
- common service config that is identical on every node of that role
Bad image-build content:
- database passwords
- host-specific TLS private keys
- Kerberos host principals and keytabs
- any token you expect to rotate frequently
Use runtime secret injection instead; the patterns are covered in GitLab Secrets & OIDC and later in HashiCorp Vault.
Validation and smoke tests
An image build is not done when Packer says "artifact created". It is done when a real instance boots from that image and passes the checks you care about.
# Example post-build validation
IMAGE_ID=$(jq -r '.builds[-1].artifact_id' manifest.json | cut -d: -f2)
aws ec2 run-instances \
--image-id "$IMAGE_ID" \
--instance-type t3.micro \
--subnet-id subnet-0123456789abcdef0 \
--security-group-ids sg-0123456789abcdef0
# Then validate from CI or a canary runner
nc -zv test-host 22
curl -fsS http://test-host/health
ssh test-host 'systemctl is-active node_exporter'
Keep validation focused on boot success, service reachability, baseline telemetry, and one role-specific smoke test. If you already have deployment smoke tests in Ansible Testing or a service runbook, reuse them here.
Image promotion and consumption
Do not point production directly at "the latest build". Promote a validated candidate into a stable channel, then let Terraform or the autoscaling group consume the promoted image.
# Simple tag-based promotion example
aws ec2 create-tags \
--resources ami-0123456789abcdef0 \
--tags Key=ImageChannel,Value=prod Key=Role,Value=web
data "aws_ami" "web_prod" {
most_recent = true
owners = ["self"]
filter {
name = "tag:Role"
values = ["web"]
}
filter {
name = "tag:ImageChannel"
values = ["prod"]
}
}
resource "aws_launch_template" "web" {
name_prefix = "web-"
image_id = data.aws_ami.web_prod.id
instance_type = "t3.medium"
}
This keeps image production and image consumption decoupled. Packer publishes candidates; Terraform chooses the promoted image. That is easier to review and easier to roll back.
Rollback
Rollback means selecting the previous known-good image and redeploying instances from it. Do not try to "fix" a bad AMI in place; images are immutable artifacts.
# Re-promote the previous image
aws ec2 create-tags \
--resources ami-0fedcba9876543210 \
--tags Key=ImageChannel,Value=prod Key=Role,Value=web
# Then re-run Terraform or your rollout job
terraform apply
Keep at least one earlier production image available until the new one has survived a real deployment. If the image also includes package-level risk, pair the rollback plan with normal service rollback guidance from Infra Change Lifecycle.
Troubleshooting and failure modes
| Symptom | Likely cause | What to do |
|---|---|---|
| Packer cannot SSH into the builder instance | Wrong username, network path, or security group. | Verify the source image's default user and confirm the temporary instance is reachable. |
| Ansible provisioner fails immediately | Python is missing or the communicator user lacks sudo. | Install Python in an earlier shell provisioner and confirm privilege escalation works. |
| Image boots, but clones have duplicate identity weirdness | Machine ID, host keys, or cloud-init state was baked in. | Clean those artifacts before snapshot and regenerate at first boot. |
| New instances come up unhealthy even though the build passed | No real post-build validation was run. | Launch a test instance and add role-specific smoke tests to the pipeline. |
| Credentials leak across many hosts at once | A secret was baked into the image. | Rebuild without the secret, rotate the credential, and switch to runtime injection. |
| Terraform keeps rolling instances unexpectedly | Image selection is "most recent" without a stable promotion rule. | Consume only a promoted channel or pinned image ID, not every fresh candidate build. |
| Rollback is slow and confusing | No manifest, no tags, or no clear record of the last good image. | Publish metadata every build and make the production channel explicit. |
Related pages: Terraform Basics, Ansible, Ansible Testing, Ansible Deploy Flow, Infra Change Lifecycle.