Change Window Runbook
- If rollback is not written before the window starts, you are not ready.
- One person drives the change, one person watches impact and keeps notes.
- Capture before and after evidence while calm. You will not reconstruct it cleanly after things go sideways.
- Validation means a real smoke test, not just
systemctl status. - If the scope changes mid-window, stop and re-declare the plan instead of drifting into a bigger change by accident.
Before the window opens
Most production pain comes from doing real planning during the window instead of before it. The minimum readiness bar:
- scope is explicit: hosts, services, environments, and what is out of scope
- one operator and one reviewer or watcher are assigned
- the exact command path is known and tested in staging where possible
- rollback command or rollback document is ready to paste
- monitoring dashboards and logs are bookmarked
- affected stakeholders know the window and expected user impact
Pre-window header
- Change ID: CHG-1234
- Window: 22:00-22:30 UTC
- Scope: production web tier only
- Operator: alice
- Watcher: bob
- Rollback: revert MR !442 and redeploy previous image
- Stop conditions: 5xx > 2%, login failures, or smoke test failure
If the change came from normal engineering flow, the supporting detail should already exist in Infra Change Lifecycle or the related implementation MR. The window is not where you figure out what to run for the first time.
Go / no-go pre-checks
Run these checks shortly before the declared start and record the result. A failed pre-check is a normal reason to delay. It is not a personal failure.
| Check | What success looks like | If not |
|---|---|---|
| Repo and artifact version | Exact commit, image ID, or package version is pinned and reviewed. | Stop and identify what is actually being deployed. |
| Dependencies healthy | Databases, caches, auth, and network path are already green. | Do not stack a planned change on top of an active platform issue. |
| Backups or snapshots | Recent and relevant, with restore path known. | Delay if the rollback depends on state you do not actually have. |
| Dashboards open | Error rate, latency, saturation, and health checks are visible. | Open them now; do not change prod blind. |
| Rollback permissions | The operator can actually execute the revert if needed. | Fix access before start, not during a bad rollout. |
# A practical pre-check bundle
date -Is
git rev-parse HEAD
curl -fsS https://app.example.com/health
journalctl -u myapp --since "10 minutes ago" --no-pager | tail -40
systemctl is-active myapp
df -h
df -i
For service-specific checks, pair this page with Generic or Service Troubleshooting.
Opening comms and declaring the start
Start the window with a short message that states scope, owner, expected impact, and next update time. The goal is to prevent "is anything happening?" chatter while the operator is working.
[START][CHG-1234] 22:00 UTC
Beginning production web rollout.
Scope: web01-web04 only.
Expected impact: brief connection draining per node, no full outage expected.
Operator: alice. Watcher: bob.
Rollback: previous image + launch template.
Next update: 22:10 UTC or sooner on material change.
If there is any chance the change could mask an ongoing incident, freeze unrelated deploys during the window. That makes later attribution much easier.
Execution loop and rollback points
Move in small, reversible steps. Every meaningful step should have a checkpoint and a clear decision: continue, pause, or roll back.
- Drain or isolate one canary node if possible.
- Apply the change to the canary.
- Run the defined smoke tests.
- Check dashboards and logs for a short interval.
- Only then continue to the next batch or the rest of the fleet.
| Step | Checkpoint | Rollback point |
|---|---|---|
| Canary deploy | Node is healthy and passes smoke tests. | Revert on the single node or restore previous image/package/config. |
| Partial fleet | Error rate remains normal and no backlog appears. | Stop rollout, return traffic to untouched nodes, revert changed batch. |
| Full fleet | All nodes healthy and dashboards stable for the agreed watch period. | Execute the pre-written global rollback plan. |
# Example execution shape for a controlled Ansible deploy
ansible-playbook site.yml -i inventories/prod --limit web01 --tags myapp
curl -fsS https://web01.example.com/health
ansible-playbook site.yml -i inventories/prod --limit web02:web03 --tags myapp
curl -fsS https://app.example.com/health
ansible-playbook site.yml -i inventories/prod --limit web04 --tags myapp
Smoke tests and validation
Your smoke tests should prove the user-visible or operator-visible path still works. Health endpoints are useful, but not sufficient by themselves.
# Example smoke pack
curl -fsS https://app.example.com/health
curl -fsSI https://app.example.com/login
ssh bastion 'systemctl is-active myapp'
journalctl -u myapp --since "2 minutes ago" --no-pager | tail -20
Validation usually has three layers:
- service layer: process up, config loaded, no immediate errors
- application layer: real request succeeds, auth still works, queue or database still reachable
- observability layer: no silent 5xx rise, latency spike, saturation jump, or log storm
For the dashboard side, lean on Observability Overview. If the service has its own troubleshooting path, jump to Service Troubleshooting instead of guessing.
Evidence capture and closeout
Capture evidence as you go, not from memory afterwards. A minimal evidence bundle should answer: what changed, when, what was observed, and how success was validated.
mkdir -p /var/tmp/change-CHG-1234
date -Is | tee /var/tmp/change-CHG-1234/timestamps.txt
git rev-parse HEAD | tee /var/tmp/change-CHG-1234/revision.txt
curl -fsS https://app.example.com/health | tee /var/tmp/change-CHG-1234/health.txt
journalctl -u myapp --since "15 minutes ago" --no-pager \
> /var/tmp/change-CHG-1234/journal.txt
[COMPLETE][CHG-1234] 22:23 UTC
Rollout completed successfully.
Validation: health endpoint 200, login test passed, 5xx flat, latency unchanged.
Evidence: attached logs, dashboard screenshots, commit SHA, smoke output.
Watch period: continue normal monitoring for 30 minutes.
That closure message becomes the source material for the ticket, handover note, or change record. If the change was noisy or surprising, capture what should be improved for the next run instead of waiting for the next on-call engineer to rediscover it.
Failure modes
| Failure mode | Why it happens | Safer move |
|---|---|---|
| No real rollback plan | Team assumes "we can just undo it" without rehearsing how. | Write the actual revert command or package/image rollback before the window starts. |
| Validation is only service-up checks | The daemon is up but the business path is broken. | Add one end-to-end smoke test per critical path. |
| Unrelated background change confuses the result | Multiple deploys or operator edits overlap. | Freeze unrelated changes for the window. |
| Scope creep | "While we are here" turns one change into three. | Stop, re-scope, and reschedule the extra work. |
| Rollback takes too long | Evidence, access, or artifact references were not prepared. | Pre-stage the old config, previous image ID, or revert MR link. |
| No useful record afterwards | Evidence was never captured during the run. | Assign one watcher to keep timestamps, outputs, and dashboard notes live. |
Related pages: Generic, Infra Change Lifecycle, Service Troubleshooting, Observability Overview, Backup & Restore.