The Infra Change Lifecycle

End-to-end: from ticket to production deployment — what happens at each step and why.

On this page

Overview: the full lifecycle
1. Understand the change
2. Find what to change
3. Branch and edit
4. Lint and validate locally
5. Dry-run against production inventory
6. Open the merge request
7. Pipeline runs automated checks
8. Peer review
9. Merge and deploy
10. Verify in production
Rollback
Shortcuts and when to use them
Post-deploy smoke tests
Dashboards checklist

Overview: the full lifecycle

Ticket → Understand → Find file → Branch → Edit → Lint → Dry-run → MR → CI → Review → Merge → Deploy → Verify

Each step catches a different class of problem. Skipping steps is how production outages happen.

1. Understand the change

Before touching any code, make sure you understand:

What is the desired end state? (not "what should I change" but "what should the system look like after")
Which hosts are affected? (one? a group? all?)
Is this change reversible? What does rollback look like?
Are there dependencies? (does this require a firewall change? a cert? a FreeIPA rule?)
Is there a maintenance window? Can this be deployed during business hours?

If the ticket is unclear, ask before starting. A 10-minute conversation is faster than a 2-hour incident.

2. Find what to change

# Search for the variable or config option
grep -r "setting_name" inventories/ roles/

# Find which template generates the config
grep -r "setting_name" roles/*/templates/

# Check the role defaults to understand what is configurable
cat roles/nginx/defaults/main.yml

Common outcomes:

Change a value — edit the right group_vars or host_vars file
Add a new variable — add it to group_vars, make sure the role's template uses it
Change role behaviour — edit the task or template file in the role
New service on a host — add the role to site.yml or a playbook, add host to the group

3. Branch and edit

git checkout main
git pull origin main
git checkout -b feature/INF-1234-description-of-change

# Make your changes
# Use $EDITOR or your preferred tool

# Stage only what you intend to change
git add inventories/production/group_vars/webservers.yml

# Review the diff
git diff --staged

# Commit with a meaningful message
git commit -m "Update nginx client_max_body_size for webservers

Ticket: INF-1234
Increasing from 1m to 64m to support large file uploads.
Applies to all hosts in the webservers group."

4. Lint and validate locally

# YAML syntax check
yamllint inventories/production/group_vars/webservers.yml

# Ansible lint — checks for best practice violations
ansible-lint

# Syntax check the playbook
ansible-playbook site.yml --syntax-check -i inventories/production/hosts.ini

Fix any errors before proceeding. Lint failures in CI will block your MR anyway — catch them now.

5. Dry-run against production inventory

# Full dry-run with diff — see exactly what would change
ansible-playbook site.yml \
  --check --diff \
  -i inventories/production/hosts.ini \
  --limit webservers         # only run against webservers group

# Narrow further to a single host if possible
ansible-playbook site.yml \
  --check --diff \
  -i inventories/production/hosts.ini \
  --limit web01.example.com

Review the diff carefully:

Do you see only the files you expected to change?
Does the diff show the right before/after values?
Are any unexpected files changing? (indicates a variable collision or wider impact)
Are any hosts changing that should not be? (check your --limit)

Paste the --check --diff output into the MR description or attach it to the ticket.

6. Open the merge request

git push -u origin feature/INF-1234-description-of-change

Open the MR in GitLab. Use a description template like:

## What
[What is being changed]

## Why
Ticket: INF-1234 — [Brief description from ticket]

## Hosts affected
[list of hosts or groups]

## Dry-run output
[paste --check --diff output or attach file]

## Rollback
[How to revert: revert this MR commit, or manual steps]

7. Pipeline runs automated checks

When you push and open an MR, the CI pipeline runs automatically. Typical jobs:

ansible-lint — must pass before anything else runs
syntax-check — verifies all playbooks parse correctly
dry-run (optional) — runs --check --diff against the production or staging inventory; output saved as an artifact

If the pipeline fails: click the failed job in GitLab, read the log, fix the issue, push again. The pipeline re-runs automatically.

8. Peer review

Assign a reviewer who knows the relevant system. What a good reviewer looks at:

Is the change in the right file? (group_vars vs host_vars, right group name)
Does the value make sense? Is it correct syntax/format for what the role expects?
Are there unintended side effects? (variable used elsewhere, other roles referencing it)
Is the dry-run output as expected?
Is there a documented rollback plan for high-risk changes?

Respond to comments, push updates, and mark conversations resolved when addressed.

9. Merge and deploy

Once approved and pipeline is green: merge the MR.

Depending on your pipeline setup:

Auto-deploy on merge — pipeline on main automatically runs the playbook. Watch the pipeline.
Manual deploy job — click the play button on the deploy job in the pipeline. Review the job log as it runs.
No CI deploy — clone main and run the playbook manually with the correct inventory and limit.

# Manual deploy (if no CI automation)
git checkout main
git pull origin main
ansible-playbook site.yml \
  -i inventories/production/hosts.ini \
  --limit webservers \
  --tags nginx

# Run everything EXCEPT a specific tag (e.g. skip a long data migration)
ansible-playbook site.yml \
  -i inventories/production/hosts.ini \
  --skip-tags migrate-db

--check mode caveats: Not all modules support check mode. command and shell modules skip execution entirely in check mode, so subsequent tasks that depend on their output will also fail or report incorrectly. template and copy are reliable in check mode; command/shell are not. Always treat --check output as a guide, not a guarantee.

10. Verify in production

After deployment, confirm the change took effect and the service is healthy:

# Check the service is running
systemctl status nginx

# Check the config file has the expected content
# nginx.conf often includes conf.d/ — search there too
grep -r client_max_body_size /etc/nginx/

# Validate the live config
nginx -t

# Test the service responds
curl -v http://app.example.com/health

# Check logs for errors since the deployment
journalctl -u nginx --since "5 minutes ago"

Only close the ticket once you have confirmed the change works as expected.

If the service responds wrong but the logs look fine — the problem is usually on the wire. Capture a pcap with tcpdump -i any -w /tmp/verify.pcap port 443 for a minute, reproduce the request, and open in Wireshark. Filter on tls.handshake or http.request to see what the client and server are actually exchanging. To confirm a port is exposed from a peer's perspective (from a bastion, inside scope) use nmap: nmap -sV -p 443 app.example.com.

Rollback

Something went wrong after merge. Act quickly:

Option 1: Revert the MR commit

# Find the merge commit
git log --oneline -5

# Revert it
git revert -m 1 COMMIT_HASH

# Push and open an MR for the revert
git push -u origin revert/fix-bad-change

Option 2: Quick emergency fix without waiting for review

git checkout main && git pull
git checkout -b hotfix/urgent-rollback

# Edit the file back to the previous value manually
git add . && git commit -m "hotfix: revert bad setting — causing nginx errors"
git push -u origin hotfix/urgent-rollback

# Open MR (mark as urgent), deploy, then close MR after the fact

Shortcuts and when to use them

In real-world situations, shortcuts exist. Use them consciously, not by default:

Skip the MR for a hotfix — apply directly and open a post-hoc MR for the record. Only for genuine production emergencies where minutes matter.
Skip the dry-run — acceptable for trivially safe changes (adding a new comment, changing a log level). Never skip for config changes that affect running services.
Use -e for one-time override — fine for testing a value. Always follow up with a proper change in the repo.

Config drift happens when shortcuts become habits. If you regularly make changes directly on hosts without going through the repo, the repo no longer reflects reality — and the next time Ansible runs, it will undo your manual changes.

Post-deploy smoke tests

Step 10 ("Verify in production") should not end at systemctl status. Run a short, scripted set of smoke tests that exercise the change from a real caller's perspective — they turn "the daemon is running" into "the feature actually works end-to-end". Keep them idempotent, read-only, and fast enough to run after every deploy.

# A minimal curl-based smoke test harness — runnable from your laptop or a CI job
set -euo pipefail

smoke() {
    local name="$1" expected="$2" ; shift 2
    local got
    got=$(curl -fsS -o /dev/null -w '%{http_code}' "$@") || got="network-error"
    if [[ "$got" != "$expected" ]]; then
        printf 'FAIL  %-30s got=%s expected=%s\n' "$name" "$got" "$expected" >&2
        return 1
    fi
    printf 'OK    %-30s %s\n' "$name" "$got"
}

smoke "health endpoint"   200 https://app.example.com/health
smoke "static asset"      200 https://app.example.com/static/app.css
smoke "api login (GET)"   405 https://app.example.com/api/login       # method not allowed proves the route exists
smoke "auth required"     401 https://app.example.com/api/profile     # 401 without a token
smoke "tls valid"         200 --resolve app.example.com:443:$(dig +short app.example.com | head -1) https://app.example.com/health

For fleet-wide verification — "did the config land on every webserver?" — an Ansible ad-hoc one-liner is usually the fastest answer:

# Confirm nginx is active and the new directive is in the generated config
ansible webservers -i inventories/production/hosts.ini -m shell \
  -a "systemctl is-active nginx && grep -H client_max_body_size /etc/nginx/conf.d/*.conf"

# Hit the health endpoint from each host (internal-side check, bypasses LB)
ansible webservers -i inventories/production/hosts.ini -m uri \
  -a "url=http://127.0.0.1/health status_code=200"

Commit the smoke-test script alongside the role or inventory it validates (e.g. scripts/smoke/webservers.sh). Run it automatically as the last stage of the deploy pipeline and again manually after any out-of-band change. A deploy that passed Ansible but fails smoke is almost always a DNS, certificate, or firewall mismatch — catch it in the five-minute window while you're still paying attention, not at 02:00 when pager duty calls.

Dashboards checklist

After merge, the second tab you should open (right after the deploy pipeline) is the service's dashboard. A useful dashboard answers the only five questions that matter during a deploy; if any are missing, the dashboard is incomplete and this is the moment to fix it, not during the next incident.

Panel	What it proves	Typical signal
Service up	The process is running and reachable from the outside.	Prometheus `up{job="app"}`, blackbox-exporter HTTP 200, or an uptime ping from a remote probe.
Error rate	Requests are succeeding, not just arriving.	5xx-per-second, app-level exception count, `nginx` 5xx from access log, or `http_requests_total{status=~"5.."}`.
Latency	Users are getting timely answers, not slow ones.	p50 / p95 / p99 request duration; watch p99 — averages hide slow tails.
Saturation	The service has headroom under current load.	CPU %, memory RSS vs limit, worker pool utilisation, connection pool usage, thread/goroutine count.
Queue depth	Async work isn't backing up (if the service has a queue).	Redis / RabbitMQ / SQS length, worker lag, oldest unprocessed message age.

Watch all five for 5–10 minutes after a deploy before declaring success. If any panel moves in a direction you didn't expect — even if smoke tests passed — treat it as a red flag and have the revert command ready to paste. A green CI plus a suspicious dashboard is how quiet regressions reach customers.

Next: Service-Specific Troubleshooting →