Ansible Best Practices & Refactoring

How to write Ansible that other humans can read, safely extend, and roll back. Conventions, idempotency patterns, error handling, performance, security, refactoring recipes, anti-patterns, and a review checklist.

If you only remember six things
  • Every task must be idempotent — a second run does nothing.
  • Use the right module. shell/command is an escape hatch, not a default.
  • Roles expose a small, well-named public interface in defaults/; anything else is private.
  • Handlers exist for service restarts and only service restarts. Everything else is a task.
  • Secrets live in Vault. The variable name they are referenced by does not.
  • Run --check --diff on the same inventory your CI uses before you open the MR.

Naming conventions

Roles

Tasks, plays, handlers

Tags

Hosts

Layout and file placement

Where does this variable go?

What it isWhereWhy
Default for a role-internal valueroles/x/defaults/main.ymlUser-overridable, lowest precedence, documents the role's knobs.
Constant the user must not overrideroles/x/vars/main.ymlHigh precedence; used for package names, service names, platform split.
Per-group valueinventory/group_vars/<group>/main.ymlGroup scope is the first knob humans reach for.
Per-host valueinventory/host_vars/<host>.ymlReserve for genuine per-host deltas (IPs, certs).
Site-wide for the whole inventoryinventory/group_vars/all/main.ymlTimezone, operator email, retention days.
Secretsinventory/group_vars/<group>/vault.yml (vaulted)Separate vault files per scope so you can rotate per-env passwords.
Ephemeral one-off--extra-vars (-e)Highest precedence; use for hotfix overrides, never for "permanent" config.

group_vars/<group>/ as a directory

Split one giant YAML into topic files. Ansible reads all of them.

inventory/group_vars/web/
├── main.yml        # common web defaults
├── tls.yml         # cert paths, ciphers
├── performance.yml # worker processes, keepalive, timeouts
└── vault.yml       # vaulted secrets

Inventory per environment

inventories/
├── dev/
│   ├── hosts.ini
│   └── group_vars/...
├── stage/
│   ├── hosts.ini
│   └── group_vars/...
└── prod/
    ├── hosts.ini
    └── group_vars/...

Deliberately avoid "branch per environment". Same code, different inventory — the diff between what dev runs and what prod runs is just the inventory directory.

One site.yml, many playbooks

Keep site.yml as the "everything" entrypoint. Per-layer entrypoints (db.yml, app.yml, proxy.yml) are nice but are optional shortcuts; --tags should already give you that selectivity.

Idempotency

Idempotency means a second run reports zero changes. It is non-negotiable because Ansible's safety model depends on it: check-mode, diff, and CI dry-runs all assume changed ≈ real drift.

Patterns that make tasks idempotent

# 1. Prefer real modules. file, copy, template, package, service, lineinfile,
#    blockinfile, replace, user, group, firewalld, systemd, mount — all idempotent.
- ansible.builtin.template:
    src: app.conf.j2
    dest: /etc/app/app.conf
    mode: '0644'
    validate: 'app -t -c %s'
  notify: reload app

# 2. `creates:` / `removes:` for commands that set up durable state.
- ansible.builtin.command: /opt/app/bin/init-db
  args:
    creates: /var/lib/app/.initialised

# 3. Explicit changed_when for scripts whose exit code doesn't indicate change.
- ansible.builtin.command: /opt/app/bin/check-quota
  register: quota
  changed_when: "'updated' in quota.stdout"
  failed_when: "quota.rc != 0 and 'ok' not in quota.stdout"

# 4. check_mode for "report only" tasks that should never mutate.
- ansible.builtin.command: /usr/sbin/audit2allow -l
  check_mode: false           # always run, even in --check
  changed_when: false         # is pure read
  register: audit

When idempotency is genuinely hard

Error handling and failure budgets

block / rescue / always

- block:
    - name: Put the edge into drain mode
      ansible.builtin.command: /opt/edge/bin/drain
    - name: Swap the backend pool
      ansible.builtin.template:
        src: upstream.conf.j2
        dest: /etc/nginx/upstreams.conf
      notify: reload nginx
  rescue:
    - name: Roll back to the previous config
      ansible.builtin.copy:
        remote_src: true
        src: /etc/nginx/upstreams.conf.bak
        dest: /etc/nginx/upstreams.conf
      notify: reload nginx
    - name: Re-raise so CI fails
      ansible.builtin.fail:
        msg: "Edge swap failed, rolled back."
  always:
    - name: Take the edge out of drain mode
      ansible.builtin.command: /opt/edge/bin/undrain

Play-level safety valves

- hosts: app
  become: true
  serial: "25%"              # roll out 25% of hosts at a time
  max_fail_percentage: 10    # bail the play if >10% of a batch fails
  any_errors_fatal: false    # let successful hosts keep going
  roles:
    - app

Explicit failure conditions

- ansible.builtin.command: systemctl is-active --quiet nginx
  register: nginx_up
  failed_when: nginx_up.rc != 0
  changed_when: false

# Fail early on bad input
- ansible.builtin.assert:
    that:
      - db_name is defined
      - db_name is match('^[a-z_]+$')
    fail_msg: "db_name must be snake_case"
Anti-pattern: ignore_errors: true on normal tasks. Use failed_when: to narrow what counts as failure. Reserve ignore_errors for genuinely best-effort tasks (e.g. gathering debug output after a failure) and pair it with an assert later.

See also: Error Handling.

Performance

Fact gathering

# Only gather what you use
- hosts: web
  gather_facts: true
  vars:
    ansible_facts_gather_subset: "!all,!min,network,distribution,os_family"

For big fleets, enable fact caching so you are not hitting every host every run:

# ansible.cfg
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = .facts-cache
fact_caching_timeout = 7200

Connection tuning

# ansible.cfg
[defaults]
forks = 30
pipelining = true
host_key_checking = False    # only in trusted networks; prefer signed keys in prod

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o PreferredAuthentications=publickey

With OpenSSH ControlMaster + pipelining = true, a typical run cuts 40-60% of wall time on large inventories.

Strategy

# Default is 'linear': every host waits at each task.
# 'free' lets each host race to the end independently — finishes faster, logs interleave.
- hosts: many_hosts
  strategy: free
  tasks:
    - ...

async / poll for long tasks

- name: Run a slow backup
  ansible.builtin.command: /usr/local/bin/slow-backup
  async: 3600   # up to one hour
  poll: 0       # fire-and-forget
  register: backup_job

- name: Wait for backup to finish
  ansible.builtin.async_status:
    jid: "{{ backup_job.ansible_job_id }}"
  register: backup_status
  until: backup_status.finished
  retries: 120
  delay: 30

Avoid with_items over huge lists

Use loop on small collections. For hundreds of items, call a real module that takes a list (e.g. firewalld with service: loops, package with a list), or build a single file from a template rather than looping a task.

Mitogen (pip install mitogen, strategy plugin) is another option and gives large (often 2–4x) speedups for CPU-bound plays. It is battle-tested but unofficial; pin the version and know what happens when you upgrade Ansible.

See also: Performance.

Security

Refactoring recipes

Recipe 1 — "This playbook is 400 lines. How do I split it?"

  1. Draw a line under each group of tasks that ends in a handler or an assert. That's usually a role boundary.
  2. For each chunk, create roles/<name>/tasks/main.yml and move the tasks in.
  3. Move any vars: used only by that chunk into roles/<name>/defaults/main.yml (prefixed).
  4. Replace the original chunk in the playbook with - { role: name }.
  5. Run --check --diff against dev. Zero diff = successful refactor.

Recipe 2 — "This role does three things. How do I cleave it?"

  1. Identify the three subjects (install / configure / monitor is common).
  2. Either: split into three roles (x_install, x_config, x_monitor) with explicit dependencies in meta/main.yml; or: keep one role and split its tasks/main.yml into install.yml, config.yml, monitor.yml and include_tasks them.
  3. Prefer the multi-file-single-role approach first. Only split into separate roles once the pieces are being mixed-and-matched across plays.

Recipe 3 — "Same 20 lines copy/pasted across three roles"

  1. Pull the shared code out into a new role, or a shared collection if you have one (Ansible Collection).
  2. Declare the new role as a dependency in the three consuming roles' meta/main.yml.
  3. For tiny snippets (2-5 lines, not worth a role), use include_tasks: from a shared roles/common_snippets/tasks/*.yml.

Recipe 4 — "The when: chain is unreadable"

# Before
- name: Install on Debian-family
  ansible.builtin.apt: { name: foo, state: present }
  when: ansible_facts.os_family == "Debian" and not ansible_check_mode and foo_enabled | default(true) and inventory_hostname in groups['enabled_hosts']

# After — narrow scope with a block, factor out the real condition
- name: Install foo (Debian-family only)
  when: foo_enabled | default(true)
  block:
    - ansible.builtin.apt:
        name: foo
        state: present
      when: ansible_facts.os_family == "Debian"
  # Host gating is the inventory's job, not the play's — move to hosts: enabled_hosts.

Recipe 5 — "We abuse set_fact"

Symptoms: set_fact calls scattered through tasks, later tasks reading facts that were set 200 lines up, multi-line Jinja in every task name. Fixes:

Anti-patterns to grep for in code review

Anti-patternWhy it's badFix
shell: "foo --bar {{ user_input }}"Command injection surface; non-idempotentUse the module for foo, or at minimum command: with argv:
command: true / command: "true"Noise; defeats changed detectionDelete the task, or meta: noop
changed_when: false on everythingHides real drift, breaks check-modeFix idempotency; only use on genuine read-only tasks
ignore_errors: true without assert afterFailures invisible; CI green on redNarrow with failed_when:, or block/rescue
when: ansible_hostname == "web03"Host-specific logic in playbooksMove to inventory groups + group_vars
{{ lookup('env', 'SECRET') }} outside CILeaks caller environment, non-portableVault, or a proper secrets backend (hashi_vault lookup)
- include: foo.yml (the deprecated dynamic one)Ambiguous; gone in modern Ansibleinclude_tasks (dynamic) or import_tasks (static)
Hardcoded 10.0.0.5 IPs in tasksBreaks the moment you re-IP or clone to stagingVars for addresses; groups for discovery
Magic numbers (worker_processes: 4)Why 4? Works until it doesn'tExpress as "{{ ansible_processor_vcpus }}" or document the reason inline
Roles that only exist to include_role other rolesIndirection for indirection's sakeInline in the playbook, or merge the wrapper into the child
become: true at play level, single role needs itAll tasks get sudo, expands blast radiusMove become to the role block that needs it
Vault files outside the inventory treeBreaks the "inventory-per-env" principleinventory/<env>/group_vars/<group>/vault.yml
Handlers doing non-restart workSide-effects that skip in check-modeConvert to regular tasks with when: some_fact.changed
--extra-vars committed to CI configHighest-precedence vars surprise everyonePut values in inventory; use -e only for hotfixes

Review checklist

Paste this into your MR template and tick through it before asking for review.

Correctness

Style

Safety

Operability