The Infra Change Lifecycle
- Overview: the full lifecycle
- 1. Understand the change
- 2. Find what to change
- 3. Branch and edit
- 4. Lint and validate locally
- 5. Dry-run against production inventory
- 6. Open the merge request
- 7. Pipeline runs automated checks
- 8. Peer review
- 9. Merge and deploy
- 10. Verify in production
- Rollback
- Shortcuts and when to use them
- Post-deploy smoke tests
- Dashboards checklist
Overview: the full lifecycle
Ticket → Understand → Find file → Branch → Edit → Lint → Dry-run → MR → CI → Review → Merge → Deploy → Verify
Each step catches a different class of problem. Skipping steps is how production outages happen.
1. Understand the change
Before touching any code, make sure you understand:
- What is the desired end state? (not "what should I change" but "what should the system look like after")
- Which hosts are affected? (one? a group? all?)
- Is this change reversible? What does rollback look like?
- Are there dependencies? (does this require a firewall change? a cert? a FreeIPA rule?)
- Is there a maintenance window? Can this be deployed during business hours?
If the ticket is unclear, ask before starting. A 10-minute conversation is faster than a 2-hour incident.
2. Find what to change
# Search for the variable or config option
grep -r "setting_name" inventories/ roles/
# Find which template generates the config
grep -r "setting_name" roles/*/templates/
# Check the role defaults to understand what is configurable
cat roles/nginx/defaults/main.yml
Common outcomes:
- Change a value — edit the right group_vars or host_vars file
- Add a new variable — add it to group_vars, make sure the role's template uses it
- Change role behaviour — edit the task or template file in the role
- New service on a host — add the role to site.yml or a playbook, add host to the group
3. Branch and edit
git checkout main
git pull origin main
git checkout -b feature/INF-1234-description-of-change
# Make your changes
# Use $EDITOR or your preferred tool
# Stage only what you intend to change
git add inventories/production/group_vars/webservers.yml
# Review the diff
git diff --staged
# Commit with a meaningful message
git commit -m "Update nginx client_max_body_size for webservers
Ticket: INF-1234
Increasing from 1m to 64m to support large file uploads.
Applies to all hosts in the webservers group."
4. Lint and validate locally
# YAML syntax check
yamllint inventories/production/group_vars/webservers.yml
# Ansible lint — checks for best practice violations
ansible-lint
# Syntax check the playbook
ansible-playbook site.yml --syntax-check -i inventories/production/hosts.ini
Fix any errors before proceeding. Lint failures in CI will block your MR anyway — catch them now.
5. Dry-run against production inventory
# Full dry-run with diff — see exactly what would change
ansible-playbook site.yml \
--check --diff \
-i inventories/production/hosts.ini \
--limit webservers # only run against webservers group
# Narrow further to a single host if possible
ansible-playbook site.yml \
--check --diff \
-i inventories/production/hosts.ini \
--limit web01.example.com
Review the diff carefully:
- Do you see only the files you expected to change?
- Does the diff show the right before/after values?
- Are any unexpected files changing? (indicates a variable collision or wider impact)
- Are any hosts changing that should not be? (check your --limit)
Paste the --check --diff output into the MR description or attach it to the ticket.
6. Open the merge request
git push -u origin feature/INF-1234-description-of-change
Open the MR in GitLab. Use a description template like:
## What
[What is being changed]
## Why
Ticket: INF-1234 — [Brief description from ticket]
## Hosts affected
[list of hosts or groups]
## Dry-run output
[paste --check --diff output or attach file]
## Rollback
[How to revert: revert this MR commit, or manual steps]
7. Pipeline runs automated checks
When you push and open an MR, the CI pipeline runs automatically. Typical jobs:
- ansible-lint — must pass before anything else runs
- syntax-check — verifies all playbooks parse correctly
- dry-run (optional) — runs --check --diff against the production or staging inventory; output saved as an artifact
If the pipeline fails: click the failed job in GitLab, read the log, fix the issue, push again. The pipeline re-runs automatically.
8. Peer review
Assign a reviewer who knows the relevant system. What a good reviewer looks at:
- Is the change in the right file? (group_vars vs host_vars, right group name)
- Does the value make sense? Is it correct syntax/format for what the role expects?
- Are there unintended side effects? (variable used elsewhere, other roles referencing it)
- Is the dry-run output as expected?
- Is there a documented rollback plan for high-risk changes?
Respond to comments, push updates, and mark conversations resolved when addressed.
9. Merge and deploy
Once approved and pipeline is green: merge the MR.
Depending on your pipeline setup:
- Auto-deploy on merge — pipeline on main automatically runs the playbook. Watch the pipeline.
- Manual deploy job — click the play button on the deploy job in the pipeline. Review the job log as it runs.
- No CI deploy — clone main and run the playbook manually with the correct inventory and limit.
# Manual deploy (if no CI automation)
git checkout main
git pull origin main
ansible-playbook site.yml \
-i inventories/production/hosts.ini \
--limit webservers \
--tags nginx
# Run everything EXCEPT a specific tag (e.g. skip a long data migration)
ansible-playbook site.yml \
-i inventories/production/hosts.ini \
--skip-tags migrate-db
command and shell modules skip execution entirely in check mode, so subsequent tasks that depend on their output will also fail or report incorrectly. template and copy are reliable in check mode; command/shell are not. Always treat --check output as a guide, not a guarantee.
10. Verify in production
After deployment, confirm the change took effect and the service is healthy:
# Check the service is running
systemctl status nginx
# Check the config file has the expected content
# nginx.conf often includes conf.d/ — search there too
grep -r client_max_body_size /etc/nginx/
# Validate the live config
nginx -t
# Test the service responds
curl -v http://app.example.com/health
# Check logs for errors since the deployment
journalctl -u nginx --since "5 minutes ago"
Only close the ticket once you have confirmed the change works as expected.
tcpdump -i any -w /tmp/verify.pcap port 443 for a minute, reproduce the request, and open in Wireshark. Filter on tls.handshake or http.request to see what the client and server are actually exchanging. To confirm a port is exposed from a peer's perspective (from a bastion, inside scope) use nmap: nmap -sV -p 443 app.example.com.
Rollback
Something went wrong after merge. Act quickly:
Option 1: Revert the MR commit
# Find the merge commit
git log --oneline -5
# Revert it
git revert -m 1 COMMIT_HASH
# Push and open an MR for the revert
git push -u origin revert/fix-bad-change
Option 2: Quick emergency fix without waiting for review
git checkout main && git pull
git checkout -b hotfix/urgent-rollback
# Edit the file back to the previous value manually
git add . && git commit -m "hotfix: revert bad setting — causing nginx errors"
git push -u origin hotfix/urgent-rollback
# Open MR (mark as urgent), deploy, then close MR after the fact
Shortcuts and when to use them
In real-world situations, shortcuts exist. Use them consciously, not by default:
- Skip the MR for a hotfix — apply directly and open a post-hoc MR for the record. Only for genuine production emergencies where minutes matter.
- Skip the dry-run — acceptable for trivially safe changes (adding a new comment, changing a log level). Never skip for config changes that affect running services.
- Use -e for one-time override — fine for testing a value. Always follow up with a proper change in the repo.
Post-deploy smoke tests
Step 10 ("Verify in production") should not end at systemctl status. Run a short, scripted set of smoke tests that exercise the change from a real caller's perspective — they turn "the daemon is running" into "the feature actually works end-to-end". Keep them idempotent, read-only, and fast enough to run after every deploy.
# A minimal curl-based smoke test harness — runnable from your laptop or a CI job
set -euo pipefail
smoke() {
local name="$1" expected="$2" ; shift 2
local got
got=$(curl -fsS -o /dev/null -w '%{http_code}' "$@") || got="network-error"
if [[ "$got" != "$expected" ]]; then
printf 'FAIL %-30s got=%s expected=%s\n' "$name" "$got" "$expected" >&2
return 1
fi
printf 'OK %-30s %s\n' "$name" "$got"
}
smoke "health endpoint" 200 https://app.example.com/health
smoke "static asset" 200 https://app.example.com/static/app.css
smoke "api login (GET)" 405 https://app.example.com/api/login # method not allowed proves the route exists
smoke "auth required" 401 https://app.example.com/api/profile # 401 without a token
smoke "tls valid" 200 --resolve app.example.com:443:$(dig +short app.example.com | head -1) https://app.example.com/health
For fleet-wide verification — "did the config land on every webserver?" — an Ansible ad-hoc one-liner is usually the fastest answer:
# Confirm nginx is active and the new directive is in the generated config
ansible webservers -i inventories/production/hosts.ini -m shell \
-a "systemctl is-active nginx && grep -H client_max_body_size /etc/nginx/conf.d/*.conf"
# Hit the health endpoint from each host (internal-side check, bypasses LB)
ansible webservers -i inventories/production/hosts.ini -m uri \
-a "url=http://127.0.0.1/health status_code=200"
Commit the smoke-test script alongside the role or inventory it validates (e.g. scripts/smoke/webservers.sh). Run it automatically as the last stage of the deploy pipeline and again manually after any out-of-band change. A deploy that passed Ansible but fails smoke is almost always a DNS, certificate, or firewall mismatch — catch it in the five-minute window while you're still paying attention, not at 02:00 when pager duty calls.
Dashboards checklist
After merge, the second tab you should open (right after the deploy pipeline) is the service's dashboard. A useful dashboard answers the only five questions that matter during a deploy; if any are missing, the dashboard is incomplete and this is the moment to fix it, not during the next incident.
| Panel | What it proves | Typical signal |
|---|---|---|
| Service up | The process is running and reachable from the outside. | Prometheus up{job="app"}, blackbox-exporter HTTP 200, or an uptime ping from a remote probe. |
| Error rate | Requests are succeeding, not just arriving. | 5xx-per-second, app-level exception count, nginx 5xx from access log, or http_requests_total{status=~"5.."}. |
| Latency | Users are getting timely answers, not slow ones. | p50 / p95 / p99 request duration; watch p99 — averages hide slow tails. |
| Saturation | The service has headroom under current load. | CPU %, memory RSS vs limit, worker pool utilisation, connection pool usage, thread/goroutine count. |
| Queue depth | Async work isn't backing up (if the service has a queue). | Redis / RabbitMQ / SQS length, worker lag, oldest unprocessed message age. |
Watch all five for 5–10 minutes after a deploy before declaring success. If any panel moves in a direction you didn't expect — even if smoke tests passed — treat it as a red flag and have the revert command ready to paste. A green CI plus a suspicious dashboard is how quiet regressions reach customers.