GitLab CI/CD for Infrastructure

How pipelines work, how Ansible runs in CI, how to pass secrets, and how to read failed jobs.

What CI/CD does for infrastructure

Without CI/CD, deploying infrastructure changes looks like: push to GitLab → someone manually SSH in → run ansible-playbook → hope they used the right inventory and flags.

With CI/CD, pushing a branch triggers automatic jobs that:

This means every change is automatically validated before it can merge, and deployment is consistent and repeatable — not dependent on who runs the command or what flags they remember.

.gitlab-ci.yml structure

The pipeline is defined in a file called .gitlab-ci.yml at the root of the repo.

# .gitlab-ci.yml — basic structure

stages:         # define the order of stages
  - lint
  - check
  - deploy

variables:      # repo-level variable defaults
  ANSIBLE_FORCE_COLOR: "1"

lint-ansible:                  # job name
  stage: lint                  # which stage this belongs to
  image: cytopia/ansible:latest  # Docker image to run in
  script:
    - ansible-lint site.yml
  only:
    - merge_requests
    - main

Key concepts:

Stages and jobs

Multiple jobs can exist within the same stage and run in parallel. Stages are sequential — if a job in the lint stage fails, the check and deploy stages do not run.

stages:
  - lint
  - check
  - deploy

lint-ansible:
  stage: lint
  script:
    - ansible-lint site.yml

lint-yaml:
  stage: lint          # runs in parallel with lint-ansible
  script:
    - yamllint .

syntax-check:
  stage: check         # only runs if lint stage passed
  script:
    - ansible-playbook site.yml --syntax-check

A real infra pipeline

---
stages:
  - lint
  - syntax
  - check
  - deploy

variables:
  ANSIBLE_FORCE_COLOR: "1"
  ANSIBLE_STDOUT_CALLBACK: yaml
  # Do NOT set ANSIBLE_HOST_KEY_CHECKING=False — use SSH_KNOWN_HOSTS below instead

# Shared config applied to all jobs
.ansible-base: &ansible-base
  image: willhallonline/ansible:2.14-ubuntu-22.04
  before_script:
    - eval "$(ssh-agent -s)"
    - echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - echo "$SSH_KNOWN_HOSTS" > ~/.ssh/known_hosts

lint:
  <<: *ansible-base
  stage: lint
  script:
    - ansible-lint
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
    - if: $CI_COMMIT_BRANCH == "main"

syntax-check:
  <<: *ansible-base
  stage: syntax
  script:
    - ansible-playbook site.yml --syntax-check -i inventories/production/hosts.ini

dry-run:
  <<: *ansible-base
  stage: check
  script:
    - ansible-playbook site.yml --check --diff -i inventories/production/hosts.ini
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"

deploy-production:
  <<: *ansible-base
  stage: deploy
  script:
    - echo "$ANSIBLE_VAULT_PASS" > .vault_pass
    - ansible-playbook site.yml -i inventories/production/hosts.ini --vault-password-file .vault_pass
    - rm -f .vault_pass
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
  when: manual           # requires a human to click "play" in GitLab

Running Ansible in a CI job

The CI runner needs to be able to SSH to your infrastructure hosts. The standard approach:

  1. Generate a dedicated deploy SSH key pair (no passphrase): ssh-keygen -t ed25519 -f deploy_key -N ""
  2. Add the public key to ~/.ssh/authorized_keys on every managed host (or via FreeIPA)
  3. Store the private key as a CI/CD variable named SSH_PRIVATE_KEY
  4. Load it in the job's before_script
before_script:
  - eval "$(ssh-agent -s)"
  - echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
  - mkdir -p ~/.ssh
  - chmod 700 ~/.ssh

tr -d '\r' removes Windows-style carriage returns that sometimes appear when copying a key through the browser. Without it, ssh-add may reject the key.

Passing secrets — CI/CD variables

In GitLab: Settings → CI/CD → Variables. Add these for an Ansible infra project:

Marking a variable masked prevents it from appearing in job logs. Marking it protected means only protected branches (like main) can access it.

SSH keys in CI

# Generate known_hosts to avoid interactive prompts
ssh-keyscan -H web01.example.com mail01.example.com >> known_hosts_file

# Paste the output into a CI variable: SSH_KNOWN_HOSTS
# Then in before_script:
echo "$SSH_KNOWN_HOSTS" > ~/.ssh/known_hosts
chmod 644 ~/.ssh/known_hosts
Do not set ANSIBLE_HOST_KEY_CHECKING=False in production. It disables SSH host verification, which is a security risk. Use SSH_KNOWN_HOSTS instead to pre-populate the known hosts file.

Ansible Vault in CI

# In the CI job script
- echo "$ANSIBLE_VAULT_PASS" > /tmp/.vault_pass
- ansible-playbook site.yml -i inventories/production/hosts.ini --vault-password-file /tmp/.vault_pass
- rm /tmp/.vault_pass     # clean up

Writing to a temp file and deleting it is preferable to passing the password directly via -e vault_password=..., which would appear in the process list and potentially in logs.

Reading a failed pipeline job

When a pipeline fails, click the failed job (shown in red) to read its log. The log shows every command that ran and their output.

What to look for:

# Example failed ansible-lint output in CI log
$ ansible-lint
WARNING  Listing 2 violation(s) that are fatal

roles/nginx/tasks/main.yml:12: yaml[truthy] Truthy value should be one of [false, true]
roles/nginx/handlers/main.yml:3: no-handler Use [module] instead of command/shell for service management

Finished with 2 failure(s), 0 warning(s) on 8 files.
ERROR: Job failed: exit code 2

Artifacts

Jobs can save files that persist after the job ends. Useful for saving Ansible output, reports, or files that later jobs need.

dry-run:
  stage: check
  script:
    - ansible-playbook site.yml --check --diff 2>&1 | tee ansible-output.txt
  artifacts:
    name: "ansible-dry-run-${CI_COMMIT_SHORT_SHA}"
    paths:
      - ansible-output.txt
    expire_in: 7 days
    when: always    # save even on failure

rules — control when jobs run

# Run only on merge requests
rules:
  - if: $CI_PIPELINE_SOURCE == "merge_request_event"

# Run only on main branch
rules:
  - if: $CI_COMMIT_BRANCH == "main"

# Run on MRs AND main
rules:
  - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  - if: $CI_COMMIT_BRANCH == "main"

# Skip if commit message contains [skip-ci]
rules:
  - if: $CI_COMMIT_MESSAGE =~ /\[skip-ci\]/
    when: never
  - when: on_success

Manual deployment approval

For production deployments, require a human to click "play" in the GitLab pipeline view:

deploy-production:
  stage: deploy
  script:
    - ansible-playbook site.yml -i inventories/production/hosts.ini
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
  when: manual          # shows a play button in the pipeline view
  allow_failure: false  # pipeline stays "blocked" until triggered

Runners — what runs your jobs

A runner is a service that picks up CI jobs and executes them. There are two types:

Check runner status: Settings → CI/CD → Runners.

If your job shows "Waiting for runner" or "No runners available", the runner is offline or no runner matches the job's tags.

# Specify a job must run on a runner with a specific tag
deploy-production:
  tags:
    - infra         # only run on runners tagged "infra"
  script:
    - ansible-playbook site.yml

cache: vs artifacts:

Both put files somewhere for a later run to pick up, but they serve different purposes and follow different rules. Get this wrong and you'll either waste minutes rebuilding deps every job, or silently ship stale build output to a later stage.

cacheartifacts
PurposeSpeed up repeat runs (pip/npm/molecule deps)Hand output to a later stage / save for humans
Flows betweenPipeline runs (same job or any with matching key)Jobs within one pipeline (needs:/dependencies:)
Trust levelBest-effort; job must cope if cache is missing/staleMandatory; downstream job relies on it
StorageRunner-local (or S3 when distributed cache is on)GitLab object storage, downloadable from the UI
Keyed byContent of a file or branch nameJob name + pipeline ID

cache: — speed up repeat jobs

lint-ansible:
  stage: lint
  image: python:3.11
  cache:
    key:
      files:
        - requirements.txt          # re-key the cache when deps change
    paths:
      - .cache/pip/
      - .venv/
    policy: pull-push               # default: pull before, push after
  before_script:
    - pip install --cache-dir .cache/pip -r requirements.txt
  script:
    - ansible-lint site.yml

artifacts: — pass files to the next stage

build-image:
  stage: build
  script:
    - packer build image.pkr.hcl
    - cp manifest.json build/
  artifacts:
    name: "image-${CI_COMMIT_SHORT_SHA}"
    paths:
      - build/
    expire_in: 1 week

deploy-image:
  stage: deploy
  needs: [build-image]              # explicitly consumes the artifact
  script:
    - terraform apply -var-file=build/manifest.json
If the next job would still succeed (just slower) without the files, it's cache. If the next job would fail without them, it's artifacts.

needs: DAG with parallel jobs

By default, jobs in a later stage wait for every job in earlier stages to finish. needs: breaks that lockstep — a job runs as soon as its specific predecessors complete, turning the pipeline into a directed acyclic graph (DAG) and dramatically reducing wall-clock time on wide pipelines.

stages: [lint, check, deploy]

# --- lint stage (run in parallel) ---
lint-ansible:
  stage: lint
  script: ansible-lint .

lint-yaml:
  stage: lint
  script: yamllint .

lint-shell:
  stage: lint
  script: shellcheck scripts/*.sh

# --- check stage (each depends on only one lint) ---
syntax-check:
  stage: check
  needs: [lint-ansible]            # starts as soon as lint-ansible is green
  script: ansible-playbook site.yml --syntax-check

yaml-schema:
  stage: check
  needs: [lint-yaml]               # independent of lint-ansible
  script: ./scripts/validate-schema.sh

# --- deploy stage (needs only the relevant check) ---
deploy-staging:
  stage: deploy
  needs: [syntax-check, yaml-schema]
  script: ansible-playbook site.yml -i inventories/staging/

Without needs:, deploy-staging would wait for lint-shell even though shell scripts have nothing to do with the deploy. With the DAG, syntax-check starts the moment lint-ansible finishes — no need to wait for the slowest lint job.

parallel: matrix

Generate a cartesian product of jobs from a variable matrix — ideal for testing the same role against multiple distros and Ansible versions without copy-pasting the job definition six times.

molecule-test:
  stage: check
  image: quay.io/ansible/molecule-runner:latest
  parallel:
    matrix:
      - DISTRO: [rockylinux9, ubuntu2204, debian12]
        ANSIBLE_VERSION: ["2.16", "2.17"]
  variables:
    MOLECULE_DISTRO: $DISTRO
  before_script:
    - pip install "ansible-core==${ANSIBLE_VERSION}.*"
  script:
    - molecule test

That's a 3 × 2 matrix — 6 parallel jobs named molecule-test: [rockylinux9, 2.16], molecule-test: [rockylinux9, 2.17], and so on. The matrix variables are exposed as ordinary environment variables inside each job. Failures are reported per cell, so you can see exactly which combination broke.

Matrix jobs all count against your concurrent-job limit — if you have 6 matrix cells but only 2 free runners, the remaining 4 queue until a runner frees up. Size the matrix with that in mind.

include: project, remote, template

Keep .gitlab-ci.yml short by pulling in pipeline fragments from elsewhere. Four common sources:

# .gitlab-ci.yml — one file, many sources
include:
  # 1. local — a file in this repo
  - local: '.gitlab/ci/ansible.yml'

  # 2. project — from another GitLab project, pinned to a ref
  - project: 'infra/ci-templates'
    ref: v3.1.0                       # tag, branch, or commit SHA
    file:
      - '/ansible/lint.yml'
      - '/ansible/molecule.yml'

  # 3. remote — any public URL returning YAML
  - remote: 'https://raw.githubusercontent.com/example/ci-shared/main/python.yml'

  # 4. template — GitLab-maintained stock templates
  - template: Security/SAST.gitlab-ci.yml

stages: [lint, check, deploy, security]
Centralise once, consume everywhere. An infra/ci-templates project holding the common Ansible lint / molecule / syntax-check jobs means every service repo only writes the bits that are genuinely specific to its pipeline. Upgrade the template in one place, pin consumers to new versions at their own pace.
Always pin ref:. An unpinned project: include resolves to HEAD of the default branch — a change on that branch will silently alter every downstream pipeline on its next run. Pin to a tag and bump the tag deliberately.

OIDC to cloud

The old pattern for "pipeline needs AWS credentials" was a long-lived AWS_ACCESS_KEY_ID in CI/CD variables — which gets leaked in logs, rotated by nobody, and has an unlimited blast radius. GitLab's id_tokens: replaces that with short-lived JWTs exchanged at the cloud's OIDC endpoint.

deploy-terraform:
  stage: deploy
  id_tokens:
    AWS_WEB_IDENTITY_TOKEN:
      aud: https://gitlab.example.com     # must match the IAM trust policy
    VAULT_ID_TOKEN:
      aud: vault-prod
  script:
    # AWS — exchange the JWT for STS credentials
    - >-
      aws sts assume-role-with-web-identity
      --role-arn arn:aws:iam::123456789012:role/gitlab-deploy
      --role-session-name "$CI_JOB_ID"
      --web-identity-token "$AWS_WEB_IDENTITY_TOKEN"

    # Vault — exchange the JWT for a Vault token
    - vault write -field=token auth/jwt/login
      role=ci-deploy jwt="$VAULT_ID_TOKEN" > ~/.vault-token

    - terraform apply -auto-approve

No static cloud creds live in GitLab at all — the cloud side trusts GitLab's OIDC issuer and binds the role/policy to claims from the JWT (project path, branch, environment). For the matching Vault / AWS / GCP trust-policy setup and the full worked example, see Secrets & OIDC.