GitLab CI/CD for Infrastructure
- What CI/CD does for infrastructure
- .gitlab-ci.yml structure
- Stages and jobs
- A real infra pipeline
- Running Ansible in a CI job
- Passing secrets — CI/CD variables
- SSH keys in CI
- Ansible Vault in CI
- Reading a failed pipeline job
- Artifacts
- rules — control when jobs run
- Manual deployment approval
- Runners — what runs your jobs
- cache: vs artifacts:
- needs: DAG with parallel jobs
- parallel: matrix
- include: project, remote, template
- OIDC to cloud
What CI/CD does for infrastructure
Without CI/CD, deploying infrastructure changes looks like: push to GitLab → someone manually SSH in → run ansible-playbook → hope they used the right inventory and flags.
With CI/CD, pushing a branch triggers automatic jobs that:
- Lint the Ansible code (
ansible-lint) - Run a syntax check (
ansible-playbook --syntax-check) - Optionally run a dry-run in staging
- Deploy to production when the MR is merged
This means every change is automatically validated before it can merge, and deployment is consistent and repeatable — not dependent on who runs the command or what flags they remember.
.gitlab-ci.yml structure
The pipeline is defined in a file called .gitlab-ci.yml at the root of the repo.
# .gitlab-ci.yml — basic structure
stages: # define the order of stages
- lint
- check
- deploy
variables: # repo-level variable defaults
ANSIBLE_FORCE_COLOR: "1"
lint-ansible: # job name
stage: lint # which stage this belongs to
image: cytopia/ansible:latest # Docker image to run in
script:
- ansible-lint site.yml
only:
- merge_requests
- main
Key concepts:
- stages — jobs in the same stage run in parallel; stages run sequentially
- image — the Docker image the job runs inside
- script — the commands to run (one per line)
- rules — when this job should run (on what branches or events).
only:andexcept:are the legacy syntax; preferrules:in all new pipelines — it is more expressive and GitLab has soft-deprecatedonly/except.
Stages and jobs
Multiple jobs can exist within the same stage and run in parallel. Stages are sequential — if a job in the lint stage fails, the check and deploy stages do not run.
stages:
- lint
- check
- deploy
lint-ansible:
stage: lint
script:
- ansible-lint site.yml
lint-yaml:
stage: lint # runs in parallel with lint-ansible
script:
- yamllint .
syntax-check:
stage: check # only runs if lint stage passed
script:
- ansible-playbook site.yml --syntax-check
A real infra pipeline
---
stages:
- lint
- syntax
- check
- deploy
variables:
ANSIBLE_FORCE_COLOR: "1"
ANSIBLE_STDOUT_CALLBACK: yaml
# Do NOT set ANSIBLE_HOST_KEY_CHECKING=False — use SSH_KNOWN_HOSTS below instead
# Shared config applied to all jobs
.ansible-base: &ansible-base
image: willhallonline/ansible:2.14-ubuntu-22.04
before_script:
- eval "$(ssh-agent -s)"
- echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
- mkdir -p ~/.ssh
- echo "$SSH_KNOWN_HOSTS" > ~/.ssh/known_hosts
lint:
<<: *ansible-base
stage: lint
script:
- ansible-lint
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_COMMIT_BRANCH == "main"
syntax-check:
<<: *ansible-base
stage: syntax
script:
- ansible-playbook site.yml --syntax-check -i inventories/production/hosts.ini
dry-run:
<<: *ansible-base
stage: check
script:
- ansible-playbook site.yml --check --diff -i inventories/production/hosts.ini
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
deploy-production:
<<: *ansible-base
stage: deploy
script:
- echo "$ANSIBLE_VAULT_PASS" > .vault_pass
- ansible-playbook site.yml -i inventories/production/hosts.ini --vault-password-file .vault_pass
- rm -f .vault_pass
rules:
- if: $CI_COMMIT_BRANCH == "main"
when: manual # requires a human to click "play" in GitLab
Running Ansible in a CI job
The CI runner needs to be able to SSH to your infrastructure hosts. The standard approach:
- Generate a dedicated deploy SSH key pair (no passphrase):
ssh-keygen -t ed25519 -f deploy_key -N "" - Add the public key to
~/.ssh/authorized_keyson every managed host (or via FreeIPA) - Store the private key as a CI/CD variable named
SSH_PRIVATE_KEY - Load it in the job's
before_script
before_script:
- eval "$(ssh-agent -s)"
- echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
- mkdir -p ~/.ssh
- chmod 700 ~/.ssh
tr -d '\r' removes Windows-style carriage returns that sometimes appear when copying a key through the browser. Without it, ssh-add may reject the key.
Passing secrets — CI/CD variables
In GitLab: Settings → CI/CD → Variables. Add these for an Ansible infra project:
SSH_PRIVATE_KEY— the deploy private key (masked, not protected unless you want branch restrictions)SSH_KNOWN_HOSTS— output ofssh-keyscan -H host1 host2to avoid host key promptsANSIBLE_VAULT_PASS— the vault password (masked)
Marking a variable masked prevents it from appearing in job logs. Marking it protected means only protected branches (like main) can access it.
SSH keys in CI
# Generate known_hosts to avoid interactive prompts
ssh-keyscan -H web01.example.com mail01.example.com >> known_hosts_file
# Paste the output into a CI variable: SSH_KNOWN_HOSTS
# Then in before_script:
echo "$SSH_KNOWN_HOSTS" > ~/.ssh/known_hosts
chmod 644 ~/.ssh/known_hosts
Ansible Vault in CI
# In the CI job script
- echo "$ANSIBLE_VAULT_PASS" > /tmp/.vault_pass
- ansible-playbook site.yml -i inventories/production/hosts.ini --vault-password-file /tmp/.vault_pass
- rm /tmp/.vault_pass # clean up
Writing to a temp file and deleting it is preferable to passing the password directly via -e vault_password=..., which would appear in the process list and potentially in logs.
Reading a failed pipeline job
When a pipeline fails, click the failed job (shown in red) to read its log. The log shows every command that ran and their output.
What to look for:
- Which command failed — look for
$ commandlines and the exit code (ERROR: Job failed: exit code 1) - Ansible task failure — look for
fatal:lines followed by a JSON error object with amsgfield - SSH failure — look for
UNREACHABLEfollowed by an SSH error message - Lint failure — ansible-lint prints rule violations in the format
rulename: description [tag]
# Example failed ansible-lint output in CI log
$ ansible-lint
WARNING Listing 2 violation(s) that are fatal
roles/nginx/tasks/main.yml:12: yaml[truthy] Truthy value should be one of [false, true]
roles/nginx/handlers/main.yml:3: no-handler Use [module] instead of command/shell for service management
Finished with 2 failure(s), 0 warning(s) on 8 files.
ERROR: Job failed: exit code 2
Artifacts
Jobs can save files that persist after the job ends. Useful for saving Ansible output, reports, or files that later jobs need.
dry-run:
stage: check
script:
- ansible-playbook site.yml --check --diff 2>&1 | tee ansible-output.txt
artifacts:
name: "ansible-dry-run-${CI_COMMIT_SHORT_SHA}"
paths:
- ansible-output.txt
expire_in: 7 days
when: always # save even on failure
rules — control when jobs run
# Run only on merge requests
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
# Run only on main branch
rules:
- if: $CI_COMMIT_BRANCH == "main"
# Run on MRs AND main
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_COMMIT_BRANCH == "main"
# Skip if commit message contains [skip-ci]
rules:
- if: $CI_COMMIT_MESSAGE =~ /\[skip-ci\]/
when: never
- when: on_success
Manual deployment approval
For production deployments, require a human to click "play" in the GitLab pipeline view:
deploy-production:
stage: deploy
script:
- ansible-playbook site.yml -i inventories/production/hosts.ini
rules:
- if: $CI_COMMIT_BRANCH == "main"
when: manual # shows a play button in the pipeline view
allow_failure: false # pipeline stays "blocked" until triggered
Runners — what runs your jobs
A runner is a service that picks up CI jobs and executes them. There are two types:
- Shared runners — provided by GitLab; run in isolated containers; no access to your internal network
- Self-hosted runners — you run these on a machine in your network; they can reach internal hosts; needed for Ansible deployments to private infrastructure
Check runner status: Settings → CI/CD → Runners.
If your job shows "Waiting for runner" or "No runners available", the runner is offline or no runner matches the job's tags.
# Specify a job must run on a runner with a specific tag
deploy-production:
tags:
- infra # only run on runners tagged "infra"
script:
- ansible-playbook site.yml
cache: vs artifacts:
Both put files somewhere for a later run to pick up, but they serve different purposes and follow different rules. Get this wrong and you'll either waste minutes rebuilding deps every job, or silently ship stale build output to a later stage.
cache | artifacts | |
|---|---|---|
| Purpose | Speed up repeat runs (pip/npm/molecule deps) | Hand output to a later stage / save for humans |
| Flows between | Pipeline runs (same job or any with matching key) | Jobs within one pipeline (needs:/dependencies:) |
| Trust level | Best-effort; job must cope if cache is missing/stale | Mandatory; downstream job relies on it |
| Storage | Runner-local (or S3 when distributed cache is on) | GitLab object storage, downloadable from the UI |
| Keyed by | Content of a file or branch name | Job name + pipeline ID |
cache: — speed up repeat jobs
lint-ansible:
stage: lint
image: python:3.11
cache:
key:
files:
- requirements.txt # re-key the cache when deps change
paths:
- .cache/pip/
- .venv/
policy: pull-push # default: pull before, push after
before_script:
- pip install --cache-dir .cache/pip -r requirements.txt
script:
- ansible-lint site.yml
artifacts: — pass files to the next stage
build-image:
stage: build
script:
- packer build image.pkr.hcl
- cp manifest.json build/
artifacts:
name: "image-${CI_COMMIT_SHORT_SHA}"
paths:
- build/
expire_in: 1 week
deploy-image:
stage: deploy
needs: [build-image] # explicitly consumes the artifact
script:
- terraform apply -var-file=build/manifest.json
needs: DAG with parallel jobs
By default, jobs in a later stage wait for every job in earlier stages to finish. needs: breaks that lockstep — a job runs as soon as its specific predecessors complete, turning the pipeline into a directed acyclic graph (DAG) and dramatically reducing wall-clock time on wide pipelines.
stages: [lint, check, deploy]
# --- lint stage (run in parallel) ---
lint-ansible:
stage: lint
script: ansible-lint .
lint-yaml:
stage: lint
script: yamllint .
lint-shell:
stage: lint
script: shellcheck scripts/*.sh
# --- check stage (each depends on only one lint) ---
syntax-check:
stage: check
needs: [lint-ansible] # starts as soon as lint-ansible is green
script: ansible-playbook site.yml --syntax-check
yaml-schema:
stage: check
needs: [lint-yaml] # independent of lint-ansible
script: ./scripts/validate-schema.sh
# --- deploy stage (needs only the relevant check) ---
deploy-staging:
stage: deploy
needs: [syntax-check, yaml-schema]
script: ansible-playbook site.yml -i inventories/staging/
Without needs:, deploy-staging would wait for lint-shell even though shell scripts have nothing to do with the deploy. With the DAG, syntax-check starts the moment lint-ansible finishes — no need to wait for the slowest lint job.
parallel: matrix
Generate a cartesian product of jobs from a variable matrix — ideal for testing the same role against multiple distros and Ansible versions without copy-pasting the job definition six times.
molecule-test:
stage: check
image: quay.io/ansible/molecule-runner:latest
parallel:
matrix:
- DISTRO: [rockylinux9, ubuntu2204, debian12]
ANSIBLE_VERSION: ["2.16", "2.17"]
variables:
MOLECULE_DISTRO: $DISTRO
before_script:
- pip install "ansible-core==${ANSIBLE_VERSION}.*"
script:
- molecule test
That's a 3 × 2 matrix — 6 parallel jobs named molecule-test: [rockylinux9, 2.16], molecule-test: [rockylinux9, 2.17], and so on. The matrix variables are exposed as ordinary environment variables inside each job. Failures are reported per cell, so you can see exactly which combination broke.
Matrix jobs all count against your concurrent-job limit — if you have 6 matrix cells but only 2 free runners, the remaining 4 queue until a runner frees up. Size the matrix with that in mind.
include: project, remote, template
Keep .gitlab-ci.yml short by pulling in pipeline fragments from elsewhere. Four common sources:
# .gitlab-ci.yml — one file, many sources
include:
# 1. local — a file in this repo
- local: '.gitlab/ci/ansible.yml'
# 2. project — from another GitLab project, pinned to a ref
- project: 'infra/ci-templates'
ref: v3.1.0 # tag, branch, or commit SHA
file:
- '/ansible/lint.yml'
- '/ansible/molecule.yml'
# 3. remote — any public URL returning YAML
- remote: 'https://raw.githubusercontent.com/example/ci-shared/main/python.yml'
# 4. template — GitLab-maintained stock templates
- template: Security/SAST.gitlab-ci.yml
stages: [lint, check, deploy, security]
infra/ci-templates project holding the common Ansible lint / molecule / syntax-check jobs means every service repo only writes the bits that are genuinely specific to its pipeline. Upgrade the template in one place, pin consumers to new versions at their own pace.
ref:. An unpinned project: include resolves to HEAD of the default branch — a change on that branch will silently alter every downstream pipeline on its next run. Pin to a tag and bump the tag deliberately.
OIDC to cloud
The old pattern for "pipeline needs AWS credentials" was a long-lived AWS_ACCESS_KEY_ID in CI/CD variables — which gets leaked in logs, rotated by nobody, and has an unlimited blast radius. GitLab's id_tokens: replaces that with short-lived JWTs exchanged at the cloud's OIDC endpoint.
deploy-terraform:
stage: deploy
id_tokens:
AWS_WEB_IDENTITY_TOKEN:
aud: https://gitlab.example.com # must match the IAM trust policy
VAULT_ID_TOKEN:
aud: vault-prod
script:
# AWS — exchange the JWT for STS credentials
- >-
aws sts assume-role-with-web-identity
--role-arn arn:aws:iam::123456789012:role/gitlab-deploy
--role-session-name "$CI_JOB_ID"
--web-identity-token "$AWS_WEB_IDENTITY_TOKEN"
# Vault — exchange the JWT for a Vault token
- vault write -field=token auth/jwt/login
role=ci-deploy jwt="$VAULT_ID_TOKEN" > ~/.vault-token
- terraform apply -auto-approve
No static cloud creds live in GitLab at all — the cloud side trusts GitLab's OIDC issuer and binds the role/policy to claims from the JWT (project path, branch, environment). For the matching Vault / AWS / GCP trust-policy setup and the full worked example, see Secrets & OIDC.