Change Window Runbook

A practical runbook for making production changes without improvising: prove readiness, move in small steps, capture evidence, and know when to stop.

Window discipline
  • If rollback is not written before the window starts, you are not ready.
  • One person drives the change, one person watches impact and keeps notes.
  • Capture before and after evidence while calm. You will not reconstruct it cleanly after things go sideways.
  • Validation means a real smoke test, not just systemctl status.
  • If the scope changes mid-window, stop and re-declare the plan instead of drifting into a bigger change by accident.

Before the window opens

Most production pain comes from doing real planning during the window instead of before it. The minimum readiness bar:

Pre-window header
- Change ID: CHG-1234
- Window: 22:00-22:30 UTC
- Scope: production web tier only
- Operator: alice
- Watcher: bob
- Rollback: revert MR !442 and redeploy previous image
- Stop conditions: 5xx > 2%, login failures, or smoke test failure

If the change came from normal engineering flow, the supporting detail should already exist in Infra Change Lifecycle or the related implementation MR. The window is not where you figure out what to run for the first time.

Go / no-go pre-checks

Run these checks shortly before the declared start and record the result. A failed pre-check is a normal reason to delay. It is not a personal failure.

CheckWhat success looks likeIf not
Repo and artifact version Exact commit, image ID, or package version is pinned and reviewed. Stop and identify what is actually being deployed.
Dependencies healthy Databases, caches, auth, and network path are already green. Do not stack a planned change on top of an active platform issue.
Backups or snapshots Recent and relevant, with restore path known. Delay if the rollback depends on state you do not actually have.
Dashboards open Error rate, latency, saturation, and health checks are visible. Open them now; do not change prod blind.
Rollback permissions The operator can actually execute the revert if needed. Fix access before start, not during a bad rollout.
# A practical pre-check bundle
date -Is
git rev-parse HEAD
curl -fsS https://app.example.com/health
journalctl -u myapp --since "10 minutes ago" --no-pager | tail -40
systemctl is-active myapp
df -h
df -i

For service-specific checks, pair this page with Generic or Service Troubleshooting.

Opening comms and declaring the start

Start the window with a short message that states scope, owner, expected impact, and next update time. The goal is to prevent "is anything happening?" chatter while the operator is working.

[START][CHG-1234] 22:00 UTC
Beginning production web rollout.
Scope: web01-web04 only.
Expected impact: brief connection draining per node, no full outage expected.
Operator: alice. Watcher: bob.
Rollback: previous image + launch template.
Next update: 22:10 UTC or sooner on material change.

If there is any chance the change could mask an ongoing incident, freeze unrelated deploys during the window. That makes later attribution much easier.

Execution loop and rollback points

Move in small, reversible steps. Every meaningful step should have a checkpoint and a clear decision: continue, pause, or roll back.

  1. Drain or isolate one canary node if possible.
  2. Apply the change to the canary.
  3. Run the defined smoke tests.
  4. Check dashboards and logs for a short interval.
  5. Only then continue to the next batch or the rest of the fleet.
StepCheckpointRollback point
Canary deploy Node is healthy and passes smoke tests. Revert on the single node or restore previous image/package/config.
Partial fleet Error rate remains normal and no backlog appears. Stop rollout, return traffic to untouched nodes, revert changed batch.
Full fleet All nodes healthy and dashboards stable for the agreed watch period. Execute the pre-written global rollback plan.
# Example execution shape for a controlled Ansible deploy
ansible-playbook site.yml -i inventories/prod --limit web01 --tags myapp
curl -fsS https://web01.example.com/health

ansible-playbook site.yml -i inventories/prod --limit web02:web03 --tags myapp
curl -fsS https://app.example.com/health

ansible-playbook site.yml -i inventories/prod --limit web04 --tags myapp
Do not widen the scope mid-window. "Since we are here, let's also roll the database config" is how controlled maintenance turns into an incident review.

Smoke tests and validation

Your smoke tests should prove the user-visible or operator-visible path still works. Health endpoints are useful, but not sufficient by themselves.

# Example smoke pack
curl -fsS https://app.example.com/health
curl -fsSI https://app.example.com/login
ssh bastion 'systemctl is-active myapp'
journalctl -u myapp --since "2 minutes ago" --no-pager | tail -20

Validation usually has three layers:

For the dashboard side, lean on Observability Overview. If the service has its own troubleshooting path, jump to Service Troubleshooting instead of guessing.

Evidence capture and closeout

Capture evidence as you go, not from memory afterwards. A minimal evidence bundle should answer: what changed, when, what was observed, and how success was validated.

mkdir -p /var/tmp/change-CHG-1234

date -Is | tee /var/tmp/change-CHG-1234/timestamps.txt
git rev-parse HEAD | tee /var/tmp/change-CHG-1234/revision.txt
curl -fsS https://app.example.com/health | tee /var/tmp/change-CHG-1234/health.txt
journalctl -u myapp --since "15 minutes ago" --no-pager \
  > /var/tmp/change-CHG-1234/journal.txt
[COMPLETE][CHG-1234] 22:23 UTC
Rollout completed successfully.
Validation: health endpoint 200, login test passed, 5xx flat, latency unchanged.
Evidence: attached logs, dashboard screenshots, commit SHA, smoke output.
Watch period: continue normal monitoring for 30 minutes.

That closure message becomes the source material for the ticket, handover note, or change record. If the change was noisy or surprising, capture what should be improved for the next run instead of waiting for the next on-call engineer to rediscover it.

Failure modes

Failure modeWhy it happensSafer move
No real rollback plan Team assumes "we can just undo it" without rehearsing how. Write the actual revert command or package/image rollback before the window starts.
Validation is only service-up checks The daemon is up but the business path is broken. Add one end-to-end smoke test per critical path.
Unrelated background change confuses the result Multiple deploys or operator edits overlap. Freeze unrelated changes for the window.
Scope creep "While we are here" turns one change into three. Stop, re-scope, and reschedule the extra work.
Rollback takes too long Evidence, access, or artifact references were not prepared. Pre-stage the old config, previous image ID, or revert MR link.
No useful record afterwards Evidence was never captured during the run. Assign one watcher to keep timestamps, outputs, and dashboard notes live.

Related pages: Generic, Infra Change Lifecycle, Service Troubleshooting, Observability Overview, Backup & Restore.