GitLab CI/CD for Infrastructure

How pipelines work, how Ansible runs in CI, how to pass secrets, and how to read failed jobs.

On this page

What CI/CD does for infrastructure
.gitlab-ci.yml structure
Stages and jobs
A real infra pipeline
Running Ansible in a CI job
Passing secrets — CI/CD variables
SSH keys in CI
Ansible Vault in CI
Reading a failed pipeline job
Artifacts
rules — control when jobs run
Manual deployment approval
Runners — what runs your jobs
cache: vs artifacts:
needs: DAG with parallel jobs
parallel: matrix
include: project, remote, template
OIDC to cloud

What CI/CD does for infrastructure

Without CI/CD, deploying infrastructure changes looks like: push to GitLab → someone manually SSH in → run ansible-playbook → hope they used the right inventory and flags.

With CI/CD, pushing a branch triggers automatic jobs that:

Lint the Ansible code (ansible-lint)
Run a syntax check (ansible-playbook --syntax-check)
Optionally run a dry-run in staging
Deploy to production when the MR is merged

This means every change is automatically validated before it can merge, and deployment is consistent and repeatable — not dependent on who runs the command or what flags they remember.

.gitlab-ci.yml structure

The pipeline is defined in a file called .gitlab-ci.yml at the root of the repo.

# .gitlab-ci.yml — basic structure

stages:         # define the order of stages
  - lint
  - check
  - deploy

variables:      # repo-level variable defaults
  ANSIBLE_FORCE_COLOR: "1"

lint-ansible:                  # job name
  stage: lint                  # which stage this belongs to
  image: cytopia/ansible:latest  # Docker image to run in
  script:
    - ansible-lint site.yml
  only:
    - merge_requests
    - main

Key concepts:

stages — jobs in the same stage run in parallel; stages run sequentially
image — the Docker image the job runs inside
script — the commands to run (one per line)
rules — when this job should run (on what branches or events). only: and except: are the legacy syntax; prefer rules: in all new pipelines — it is more expressive and GitLab has soft-deprecated only/except.

Stages and jobs

Multiple jobs can exist within the same stage and run in parallel. Stages are sequential — if a job in the lint stage fails, the check and deploy stages do not run.

stages:
  - lint
  - check
  - deploy

lint-ansible:
  stage: lint
  script:
    - ansible-lint site.yml

lint-yaml:
  stage: lint          # runs in parallel with lint-ansible
  script:
    - yamllint .

syntax-check:
  stage: check         # only runs if lint stage passed
  script:
    - ansible-playbook site.yml --syntax-check

A real infra pipeline

---
stages:
  - lint
  - syntax
  - check
  - deploy

variables:
  ANSIBLE_FORCE_COLOR: "1"
  ANSIBLE_STDOUT_CALLBACK: yaml
  # Do NOT set ANSIBLE_HOST_KEY_CHECKING=False — use SSH_KNOWN_HOSTS below instead

# Shared config applied to all jobs
.ansible-base: &ansible-base
  image: willhallonline/ansible:2.14-ubuntu-22.04
  before_script:
    - eval "$(ssh-agent -s)"
    - echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - echo "$SSH_KNOWN_HOSTS" > ~/.ssh/known_hosts

lint:
  <<: *ansible-base
  stage: lint
  script:
    - ansible-lint
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

syntax-check:
  <<: *ansible-base
  stage: syntax
  script:
    - ansible-playbook site.yml --syntax-check -i inventories/production/hosts.ini

dry-run:
  <<: *ansible-base
  stage: check
  script:
    - ansible-playbook site.yml --check --diff -i inventories/production/hosts.ini
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"

deploy-production:
  <<: *ansible-base
  stage: deploy
  script:
    - echo "$ANSIBLE_VAULT_PASS" > .vault_pass
    - ansible-playbook site.yml -i inventories/production/hosts.ini --vault-password-file .vault_pass
    - rm -f .vault_pass
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
  when: manual           # requires a human to click "play" in GitLab

Running Ansible in a CI job

The CI runner needs to be able to SSH to your infrastructure hosts. The standard approach:

Generate a dedicated deploy SSH key pair (no passphrase): ssh-keygen -t ed25519 -f deploy_key -N ""
Add the public key to ~/.ssh/authorized_keys on every managed host (or via FreeIPA)
Store the private key as a CI/CD variable named SSH_PRIVATE_KEY
Load it in the job's before_script

before_script:
  - eval "$(ssh-agent -s)"
  - echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
  - mkdir -p ~/.ssh
  - chmod 700 ~/.ssh

tr -d '\r' removes Windows-style carriage returns that sometimes appear when copying a key through the browser. Without it, ssh-add may reject the key.

Passing secrets — CI/CD variables

In GitLab: Settings → CI/CD → Variables. Add these for an Ansible infra project:

SSH_PRIVATE_KEY — the deploy private key (masked, not protected unless you want branch restrictions)
SSH_KNOWN_HOSTS — output of ssh-keyscan -H host1 host2 to avoid host key prompts
ANSIBLE_VAULT_PASS — the vault password (masked)

Marking a variable masked prevents it from appearing in job logs. Marking it protected means only protected branches (like main) can access it.

SSH keys in CI

# Generate known_hosts to avoid interactive prompts
ssh-keyscan -H web01.example.com mail01.example.com >> known_hosts_file

# Paste the output into a CI variable: SSH_KNOWN_HOSTS
# Then in before_script:
echo "$SSH_KNOWN_HOSTS" > ~/.ssh/known_hosts
chmod 644 ~/.ssh/known_hosts

Do not set ANSIBLE_HOST_KEY_CHECKING=False in production. It disables SSH host verification, which is a security risk. Use SSH_KNOWN_HOSTS instead to pre-populate the known hosts file.

Ansible Vault in CI

# In the CI job script
- echo "$ANSIBLE_VAULT_PASS" > /tmp/.vault_pass
- ansible-playbook site.yml -i inventories/production/hosts.ini --vault-password-file /tmp/.vault_pass
- rm /tmp/.vault_pass     # clean up

Writing to a temp file and deleting it is preferable to passing the password directly via -e vault_password=..., which would appear in the process list and potentially in logs.

Reading a failed pipeline job

When a pipeline fails, click the failed job (shown in red) to read its log. The log shows every command that ran and their output.

What to look for:

Which command failed — look for $ command lines and the exit code (ERROR: Job failed: exit code 1)
Ansible task failure — look for fatal: lines followed by a JSON error object with a msg field
SSH failure — look for UNREACHABLE followed by an SSH error message
Lint failure — ansible-lint prints rule violations in the format rulename: description [tag]

# Example failed ansible-lint output in CI log
$ ansible-lint
WARNING  Listing 2 violation(s) that are fatal

roles/nginx/tasks/main.yml:12: yaml[truthy] Truthy value should be one of [false, true]
roles/nginx/handlers/main.yml:3: no-handler Use [module] instead of command/shell for service management

Finished with 2 failure(s), 0 warning(s) on 8 files.
ERROR: Job failed: exit code 2

Artifacts

Jobs can save files that persist after the job ends. Useful for saving Ansible output, reports, or files that later jobs need.

dry-run:
  stage: check
  script:
    - ansible-playbook site.yml --check --diff 2>&1 | tee ansible-output.txt
  artifacts:
    name: "ansible-dry-run-${CI_COMMIT_SHORT_SHA}"
    paths:
      - ansible-output.txt
    expire_in: 7 days
    when: always    # save even on failure

rules — control when jobs run

# Run only on merge requests
rules:
  - if: $CI_PIPELINE_SOURCE == "merge_request_event"

# Run only on main branch
rules:
  - if: $CI_COMMIT_BRANCH == "main"

# Run on MRs AND main
rules:
  - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  - if: $CI_COMMIT_BRANCH == "main"

# Skip if commit message contains [skip-ci]
rules:
  - if: $CI_COMMIT_MESSAGE =~ /\[skip-ci\]/
    when: never
  - when: on_success

Manual deployment approval

For production deployments, require a human to click "play" in the GitLab pipeline view:

deploy-production:
  stage: deploy
  script:
    - ansible-playbook site.yml -i inventories/production/hosts.ini
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
  when: manual          # shows a play button in the pipeline view
  allow_failure: false  # pipeline stays "blocked" until triggered

Runners — what runs your jobs

A runner is a service that picks up CI jobs and executes them. There are two types:

Shared runners — provided by GitLab; run in isolated containers; no access to your internal network
Self-hosted runners — you run these on a machine in your network; they can reach internal hosts; needed for Ansible deployments to private infrastructure

Check runner status: Settings → CI/CD → Runners.

If your job shows "Waiting for runner" or "No runners available", the runner is offline or no runner matches the job's tags.

# Specify a job must run on a runner with a specific tag
deploy-production:
  tags:
    - infra         # only run on runners tagged "infra"
  script:
    - ansible-playbook site.yml

cache: vs artifacts:

Both put files somewhere for a later run to pick up, but they serve different purposes and follow different rules. Get this wrong and you'll either waste minutes rebuilding deps every job, or silently ship stale build output to a later stage.

	`cache`	`artifacts`
Purpose	Speed up repeat runs (pip/npm/molecule deps)	Hand output to a later stage / save for humans
Flows between	Pipeline runs (same job or any with matching key)	Jobs within one pipeline (`needs:`/`dependencies:`)
Trust level	Best-effort; job must cope if cache is missing/stale	Mandatory; downstream job relies on it
Storage	Runner-local (or S3 when distributed cache is on)	GitLab object storage, downloadable from the UI
Keyed by	Content of a file or branch name	Job name + pipeline ID

cache: — speed up repeat jobs

lint-ansible:
  stage: lint
  image: python:3.11
  cache:
    key:
      files:
        - requirements.txt          # re-key the cache when deps change
    paths:
      - .cache/pip/
      - .venv/
    policy: pull-push               # default: pull before, push after
  before_script:
    - pip install --cache-dir .cache/pip -r requirements.txt
  script:
    - ansible-lint site.yml

artifacts: — pass files to the next stage

build-image:
  stage: build
  script:
    - packer build image.pkr.hcl
    - cp manifest.json build/
  artifacts:
    name: "image-${CI_COMMIT_SHORT_SHA}"
    paths:
      - build/
    expire_in: 1 week

deploy-image:
  stage: deploy
  needs: [build-image]              # explicitly consumes the artifact
  script:
    - terraform apply -var-file=build/manifest.json

If the next job would still succeed (just slower) without the files, it's cache. If the next job would fail without them, it's artifacts.

needs: DAG with parallel jobs

By default, jobs in a later stage wait for every job in earlier stages to finish. needs: breaks that lockstep — a job runs as soon as its specific predecessors complete, turning the pipeline into a directed acyclic graph (DAG) and dramatically reducing wall-clock time on wide pipelines.

stages: [lint, check, deploy]

# --- lint stage (run in parallel) ---
lint-ansible:
  stage: lint
  script: ansible-lint .

lint-yaml:
  stage: lint
  script: yamllint .

lint-shell:
  stage: lint
  script: shellcheck scripts/*.sh

# --- check stage (each depends on only one lint) ---
syntax-check:
  stage: check
  needs: [lint-ansible]            # starts as soon as lint-ansible is green
  script: ansible-playbook site.yml --syntax-check

yaml-schema:
  stage: check
  needs: [lint-yaml]               # independent of lint-ansible
  script: ./scripts/validate-schema.sh

# --- deploy stage (needs only the relevant check) ---
deploy-staging:
  stage: deploy
  needs: [syntax-check, yaml-schema]
  script: ansible-playbook site.yml -i inventories/staging/

Without needs:, deploy-staging would wait for lint-shell even though shell scripts have nothing to do with the deploy. With the DAG, syntax-check starts the moment lint-ansible finishes — no need to wait for the slowest lint job.

parallel: matrix

Generate a cartesian product of jobs from a variable matrix — ideal for testing the same role against multiple distros and Ansible versions without copy-pasting the job definition six times.

molecule-test:
  stage: check
  image: quay.io/ansible/molecule-runner:latest
  parallel:
    matrix:
      - DISTRO: [rockylinux9, ubuntu2204, debian12]
        ANSIBLE_VERSION: ["2.16", "2.17"]
  variables:
    MOLECULE_DISTRO: $DISTRO
  before_script:
    - pip install "ansible-core==${ANSIBLE_VERSION}.*"
  script:
    - molecule test

That's a 3 × 2 matrix — 6 parallel jobs named molecule-test: [rockylinux9, 2.16], molecule-test: [rockylinux9, 2.17], and so on. The matrix variables are exposed as ordinary environment variables inside each job. Failures are reported per cell, so you can see exactly which combination broke.

Matrix jobs all count against your concurrent-job limit — if you have 6 matrix cells but only 2 free runners, the remaining 4 queue until a runner frees up. Size the matrix with that in mind.

include: project, remote, template

Keep .gitlab-ci.yml short by pulling in pipeline fragments from elsewhere. Four common sources:

# .gitlab-ci.yml — one file, many sources
include:
  # 1. local — a file in this repo
  - local: '.gitlab/ci/ansible.yml'

  # 2. project — from another GitLab project, pinned to a ref
  - project: 'infra/ci-templates'
    ref: v3.1.0                       # tag, branch, or commit SHA
    file:
      - '/ansible/lint.yml'
      - '/ansible/molecule.yml'

  # 3. remote — any public URL returning YAML
  - remote: 'https://raw.githubusercontent.com/example/ci-shared/main/python.yml'

  # 4. template — GitLab-maintained stock templates
  - template: Security/SAST.gitlab-ci.yml

stages: [lint, check, deploy, security]

Centralise once, consume everywhere. An infra/ci-templates project holding the common Ansible lint / molecule / syntax-check jobs means every service repo only writes the bits that are genuinely specific to its pipeline. Upgrade the template in one place, pin consumers to new versions at their own pace.

Always pin ref:. An unpinned project: include resolves to HEAD of the default branch — a change on that branch will silently alter every downstream pipeline on its next run. Pin to a tag and bump the tag deliberately.

OIDC to cloud

The old pattern for "pipeline needs AWS credentials" was a long-lived AWS_ACCESS_KEY_ID in CI/CD variables — which gets leaked in logs, rotated by nobody, and has an unlimited blast radius. GitLab's id_tokens: replaces that with short-lived JWTs exchanged at the cloud's OIDC endpoint.

deploy-terraform:
  stage: deploy
  id_tokens:
    AWS_WEB_IDENTITY_TOKEN:
      aud: https://gitlab.example.com     # must match the IAM trust policy
    VAULT_ID_TOKEN:
      aud: vault-prod
  script:
    # AWS — exchange the JWT for STS credentials
    - >-
      aws sts assume-role-with-web-identity
      --role-arn arn:aws:iam::123456789012:role/gitlab-deploy
      --role-session-name "$CI_JOB_ID"
      --web-identity-token "$AWS_WEB_IDENTITY_TOKEN"

    # Vault — exchange the JWT for a Vault token
    - vault write -field=token auth/jwt/login
      role=ci-deploy jwt="$VAULT_ID_TOKEN" > ~/.vault-token

    - terraform apply -auto-approve

No static cloud creds live in GitLab at all — the cloud side trusts GitLab's OIDC issuer and binds the role/policy to claims from the JWT (project path, branch, environment). For the matching Vault / AWS / GCP trust-policy setup and the full worked example, see Secrets & OIDC.

Next: GitLab Merge Requests →