The Infra Change Lifecycle

End-to-end: from ticket to production deployment — what happens at each step and why.

Overview: the full lifecycle

Ticket → Understand → Find file → Branch → Edit → Lint → Dry-run → MR → CI → Review → Merge → Deploy → Verify

Each step catches a different class of problem. Skipping steps is how production outages happen.

1. Understand the change

Before touching any code, make sure you understand:

If the ticket is unclear, ask before starting. A 10-minute conversation is faster than a 2-hour incident.

2. Find what to change

# Search for the variable or config option
grep -r "setting_name" inventories/ roles/

# Find which template generates the config
grep -r "setting_name" roles/*/templates/

# Check the role defaults to understand what is configurable
cat roles/nginx/defaults/main.yml

Common outcomes:

3. Branch and edit

git checkout main
git pull origin main
git checkout -b feature/INF-1234-description-of-change

# Make your changes
# Use $EDITOR or your preferred tool

# Stage only what you intend to change
git add inventories/production/group_vars/webservers.yml

# Review the diff
git diff --staged

# Commit with a meaningful message
git commit -m "Update nginx client_max_body_size for webservers

Ticket: INF-1234
Increasing from 1m to 64m to support large file uploads.
Applies to all hosts in the webservers group."

4. Lint and validate locally

# YAML syntax check
yamllint inventories/production/group_vars/webservers.yml

# Ansible lint — checks for best practice violations
ansible-lint

# Syntax check the playbook
ansible-playbook site.yml --syntax-check -i inventories/production/hosts.ini

Fix any errors before proceeding. Lint failures in CI will block your MR anyway — catch them now.

5. Dry-run against production inventory

# Full dry-run with diff — see exactly what would change
ansible-playbook site.yml \
  --check --diff \
  -i inventories/production/hosts.ini \
  --limit webservers         # only run against webservers group

# Narrow further to a single host if possible
ansible-playbook site.yml \
  --check --diff \
  -i inventories/production/hosts.ini \
  --limit web01.example.com

Review the diff carefully:

Paste the --check --diff output into the MR description or attach it to the ticket.

6. Open the merge request

git push -u origin feature/INF-1234-description-of-change

Open the MR in GitLab. Use a description template like:

## What
[What is being changed]

## Why
Ticket: INF-1234 — [Brief description from ticket]

## Hosts affected
[list of hosts or groups]

## Dry-run output
[paste --check --diff output or attach file]

## Rollback
[How to revert: revert this MR commit, or manual steps]

7. Pipeline runs automated checks

When you push and open an MR, the CI pipeline runs automatically. Typical jobs:

  1. ansible-lint — must pass before anything else runs
  2. syntax-check — verifies all playbooks parse correctly
  3. dry-run (optional) — runs --check --diff against the production or staging inventory; output saved as an artifact

If the pipeline fails: click the failed job in GitLab, read the log, fix the issue, push again. The pipeline re-runs automatically.

8. Peer review

Assign a reviewer who knows the relevant system. What a good reviewer looks at:

Respond to comments, push updates, and mark conversations resolved when addressed.

9. Merge and deploy

Once approved and pipeline is green: merge the MR.

Depending on your pipeline setup:

# Manual deploy (if no CI automation)
git checkout main
git pull origin main
ansible-playbook site.yml \
  -i inventories/production/hosts.ini \
  --limit webservers \
  --tags nginx

# Run everything EXCEPT a specific tag (e.g. skip a long data migration)
ansible-playbook site.yml \
  -i inventories/production/hosts.ini \
  --skip-tags migrate-db
--check mode caveats: Not all modules support check mode. command and shell modules skip execution entirely in check mode, so subsequent tasks that depend on their output will also fail or report incorrectly. template and copy are reliable in check mode; command/shell are not. Always treat --check output as a guide, not a guarantee.

10. Verify in production

After deployment, confirm the change took effect and the service is healthy:

# Check the service is running
systemctl status nginx

# Check the config file has the expected content
# nginx.conf often includes conf.d/ — search there too
grep -r client_max_body_size /etc/nginx/

# Validate the live config
nginx -t

# Test the service responds
curl -v http://app.example.com/health

# Check logs for errors since the deployment
journalctl -u nginx --since "5 minutes ago"

Only close the ticket once you have confirmed the change works as expected.

If the service responds wrong but the logs look fine — the problem is usually on the wire. Capture a pcap with tcpdump -i any -w /tmp/verify.pcap port 443 for a minute, reproduce the request, and open in Wireshark. Filter on tls.handshake or http.request to see what the client and server are actually exchanging. To confirm a port is exposed from a peer's perspective (from a bastion, inside scope) use nmap: nmap -sV -p 443 app.example.com.

Rollback

Something went wrong after merge. Act quickly:

Option 1: Revert the MR commit

# Find the merge commit
git log --oneline -5

# Revert it
git revert -m 1 COMMIT_HASH

# Push and open an MR for the revert
git push -u origin revert/fix-bad-change

Option 2: Quick emergency fix without waiting for review

git checkout main && git pull
git checkout -b hotfix/urgent-rollback

# Edit the file back to the previous value manually
git add . && git commit -m "hotfix: revert bad setting — causing nginx errors"
git push -u origin hotfix/urgent-rollback

# Open MR (mark as urgent), deploy, then close MR after the fact

Shortcuts and when to use them

In real-world situations, shortcuts exist. Use them consciously, not by default:

Config drift happens when shortcuts become habits. If you regularly make changes directly on hosts without going through the repo, the repo no longer reflects reality — and the next time Ansible runs, it will undo your manual changes.

Post-deploy smoke tests

Step 10 ("Verify in production") should not end at systemctl status. Run a short, scripted set of smoke tests that exercise the change from a real caller's perspective — they turn "the daemon is running" into "the feature actually works end-to-end". Keep them idempotent, read-only, and fast enough to run after every deploy.

# A minimal curl-based smoke test harness — runnable from your laptop or a CI job
set -euo pipefail

smoke() {
    local name="$1" expected="$2" ; shift 2
    local got
    got=$(curl -fsS -o /dev/null -w '%{http_code}' "$@") || got="network-error"
    if [[ "$got" != "$expected" ]]; then
        printf 'FAIL  %-30s got=%s expected=%s\n' "$name" "$got" "$expected" >&2
        return 1
    fi
    printf 'OK    %-30s %s\n' "$name" "$got"
}

smoke "health endpoint"   200 https://app.example.com/health
smoke "static asset"      200 https://app.example.com/static/app.css
smoke "api login (GET)"   405 https://app.example.com/api/login       # method not allowed proves the route exists
smoke "auth required"     401 https://app.example.com/api/profile     # 401 without a token
smoke "tls valid"         200 --resolve app.example.com:443:$(dig +short app.example.com | head -1) https://app.example.com/health

For fleet-wide verification — "did the config land on every webserver?" — an Ansible ad-hoc one-liner is usually the fastest answer:

# Confirm nginx is active and the new directive is in the generated config
ansible webservers -i inventories/production/hosts.ini -m shell \
  -a "systemctl is-active nginx && grep -H client_max_body_size /etc/nginx/conf.d/*.conf"

# Hit the health endpoint from each host (internal-side check, bypasses LB)
ansible webservers -i inventories/production/hosts.ini -m uri \
  -a "url=http://127.0.0.1/health status_code=200"

Commit the smoke-test script alongside the role or inventory it validates (e.g. scripts/smoke/webservers.sh). Run it automatically as the last stage of the deploy pipeline and again manually after any out-of-band change. A deploy that passed Ansible but fails smoke is almost always a DNS, certificate, or firewall mismatch — catch it in the five-minute window while you're still paying attention, not at 02:00 when pager duty calls.

Dashboards checklist

After merge, the second tab you should open (right after the deploy pipeline) is the service's dashboard. A useful dashboard answers the only five questions that matter during a deploy; if any are missing, the dashboard is incomplete and this is the moment to fix it, not during the next incident.

PanelWhat it provesTypical signal
Service up The process is running and reachable from the outside. Prometheus up{job="app"}, blackbox-exporter HTTP 200, or an uptime ping from a remote probe.
Error rate Requests are succeeding, not just arriving. 5xx-per-second, app-level exception count, nginx 5xx from access log, or http_requests_total{status=~"5.."}.
Latency Users are getting timely answers, not slow ones. p50 / p95 / p99 request duration; watch p99 — averages hide slow tails.
Saturation The service has headroom under current load. CPU %, memory RSS vs limit, worker pool utilisation, connection pool usage, thread/goroutine count.
Queue depth Async work isn't backing up (if the service has a queue). Redis / RabbitMQ / SQS length, worker lag, oldest unprocessed message age.

Watch all five for 5–10 minutes after a deploy before declaring success. If any panel moves in a direction you didn't expect — even if smoke tests passed — treat it as a red flag and have the revert command ready to paste. A green CI plus a suspicious dashboard is how quiet regressions reach customers.