Ansible Best Practices & Refactoring

How to write Ansible that other humans can read, safely extend, and roll back. Conventions, idempotency patterns, error handling, performance, security, refactoring recipes, anti-patterns, and a review checklist.

If you only remember six things

Every task must be idempotent — a second run does nothing.
Use the right module. shell/command is an escape hatch, not a default.
Roles expose a small, well-named public interface in defaults/; anything else is private.
Handlers exist for service restarts and only service restarts. Everything else is a task.
Secrets live in Vault. The variable name they are referenced by does not.
Run --check --diff on the same inventory your CI uses before you open the MR.

On this page

Naming conventions
Layout and file placement
Idempotency
Error handling and failure budgets
Performance
Security
Refactoring recipes
Anti-patterns to grep for
Review checklist

Naming conventions

Roles

One role, one verb on one subject. nginx, postgresql, ssh_hardening. Not web_and_db, not setup_everything.
Prefix variables with the role name. nginx_, postgresql_, sshd_. Stops collisions when two roles are in the same play.
Private vars start with an underscore. _nginx_package_name lives in vars/main.yml and signals "do not override from outside".
Public vars live in defaults/main.yml. That file IS your role's documented interface. If a var is not in defaults/, users should not be setting it.

Tasks, plays, handlers

Every task has a name:. The name is a log line — write it as one. Good: Install nginx package. Bad: install.
Plays too. - name: Configure the web tier.
Handlers are imperative. reload nginx, restart postgresql. Never nginx reload handler.
No emoji, no "TODO", no "temporary" in names. Comments on tasks or commits carry that.

Hosts

Group by role, not by environment. web, db, mail. Environment is an inventory dir (inventories/prod/), not a group named prod-web.
No per-host special-casing in playbooks. If web03 needs something different, give it a group and put the delta in group_vars.

Layout and file placement

Where does this variable go?

What it is	Where	Why
Default for a role-internal value	`roles/x/defaults/main.yml`	User-overridable, lowest precedence, documents the role's knobs.
Constant the user must not override	`roles/x/vars/main.yml`	High precedence; used for package names, service names, platform split.
Per-group value	`inventory/group_vars/<group>/main.yml`	Group scope is the first knob humans reach for.
Per-host value	`inventory/host_vars/<host>.yml`	Reserve for genuine per-host deltas (IPs, certs).
Site-wide for the whole inventory	`inventory/group_vars/all/main.yml`	Timezone, operator email, retention days.
Secrets	`inventory/group_vars/<group>/vault.yml` (vaulted)	Separate vault files per scope so you can rotate per-env passwords.
Ephemeral one-off	`--extra-vars` (`-e`)	Highest precedence; use for hotfix overrides, never for "permanent" config.

`group_vars/<group>/` as a directory

Split one giant YAML into topic files. Ansible reads all of them.

inventory/group_vars/web/
├── main.yml        # common web defaults
├── tls.yml         # cert paths, ciphers
├── performance.yml # worker processes, keepalive, timeouts
└── vault.yml       # vaulted secrets

Inventory per environment

inventories/
├── dev/
│   ├── hosts.ini
│   └── group_vars/...
├── stage/
│   ├── hosts.ini
│   └── group_vars/...
└── prod/
    ├── hosts.ini
    └── group_vars/...

Deliberately avoid "branch per environment". Same code, different inventory — the diff between what dev runs and what prod runs is just the inventory directory.

One `site.yml`, many playbooks

Keep site.yml as the "everything" entrypoint. Per-layer entrypoints (db.yml, app.yml, proxy.yml) are nice but are optional shortcuts; --tags should already give you that selectivity.

Idempotency

Idempotency means a second run reports zero changes. It is non-negotiable because Ansible's safety model depends on it: check-mode, diff, and CI dry-runs all assume changed ≈ real drift.

Patterns that make tasks idempotent

# 1. Prefer real modules. file, copy, template, package, service, lineinfile,
#    blockinfile, replace, user, group, firewalld, systemd, mount — all idempotent.
- ansible.builtin.template:
    src: app.conf.j2
    dest: /etc/app/app.conf
    mode: '0644'
    validate: 'app -t -c %s'
  notify: reload app

# 2. `creates:` / `removes:` for commands that set up durable state.
- ansible.builtin.command: /opt/app/bin/init-db
  args:
    creates: /var/lib/app/.initialised

# 3. Explicit changed_when for scripts whose exit code doesn't indicate change.
- ansible.builtin.command: /opt/app/bin/check-quota
  register: quota
  changed_when: "'updated' in quota.stdout"
  failed_when: "quota.rc != 0 and 'ok' not in quota.stdout"

# 4. check_mode for "report only" tasks that should never mutate.
- ansible.builtin.command: /usr/sbin/audit2allow -l
  check_mode: false           # always run, even in --check
  changed_when: false         # is pure read
  register: audit

When idempotency is genuinely hard

External API (a SaaS, a cloud): check current state with a GET, act only if it differs. The community.general and cloud collections already do this — use them.
Migrations: wrap in a creates: flag file or a "migration table row exists" check.
"Run once per deploy": use run_once: true + delegate_to, and combine with a flag file if it must also be once-ever.

Error handling and failure budgets

`block / rescue / always`

- block:
    - name: Put the edge into drain mode
      ansible.builtin.command: /opt/edge/bin/drain
    - name: Swap the backend pool
      ansible.builtin.template:
        src: upstream.conf.j2
        dest: /etc/nginx/upstreams.conf
      notify: reload nginx
  rescue:
    - name: Roll back to the previous config
      ansible.builtin.copy:
        remote_src: true
        src: /etc/nginx/upstreams.conf.bak
        dest: /etc/nginx/upstreams.conf
      notify: reload nginx
    - name: Re-raise so CI fails
      ansible.builtin.fail:
        msg: "Edge swap failed, rolled back."
  always:
    - name: Take the edge out of drain mode
      ansible.builtin.command: /opt/edge/bin/undrain

Play-level safety valves

- hosts: app
  become: true
  serial: "25%"              # roll out 25% of hosts at a time
  max_fail_percentage: 10    # bail the play if >10% of a batch fails
  any_errors_fatal: false    # let successful hosts keep going
  roles:
    - app

Explicit failure conditions

- ansible.builtin.command: systemctl is-active --quiet nginx
  register: nginx_up
  failed_when: nginx_up.rc != 0
  changed_when: false

# Fail early on bad input
- ansible.builtin.assert:
    that:
      - db_name is defined
      - db_name is match('^[a-z_]+$')
    fail_msg: "db_name must be snake_case"

Anti-pattern: ignore_errors: true on normal tasks. Use failed_when: to narrow what counts as failure. Reserve ignore_errors for genuinely best-effort tasks (e.g. gathering debug output after a failure) and pair it with an assert later.

Performance

Fact gathering

# Only gather what you use
- hosts: web
  gather_facts: true
  vars:
    ansible_facts_gather_subset: "!all,!min,network,distribution,os_family"

For big fleets, enable fact caching so you are not hitting every host every run:

# ansible.cfg
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = .facts-cache
fact_caching_timeout = 7200

Connection tuning

# ansible.cfg
[defaults]
forks = 30
pipelining = true
host_key_checking = False    # only in trusted networks; prefer signed keys in prod

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o PreferredAuthentications=publickey

With OpenSSH ControlMaster + pipelining = true, a typical run cuts 40-60% of wall time on large inventories.

Strategy

# Default is 'linear': every host waits at each task.
# 'free' lets each host race to the end independently — finishes faster, logs interleave.
- hosts: many_hosts
  strategy: free
  tasks:
    - ...

`async` / `poll` for long tasks

- name: Run a slow backup
  ansible.builtin.command: /usr/local/bin/slow-backup
  async: 3600   # up to one hour
  poll: 0       # fire-and-forget
  register: backup_job

- name: Wait for backup to finish
  ansible.builtin.async_status:
    jid: "{{ backup_job.ansible_job_id }}"
  register: backup_status
  until: backup_status.finished
  retries: 120
  delay: 30

Avoid `with_items` over huge lists

Use loop on small collections. For hundreds of items, call a real module that takes a list (e.g. firewalld with service: loops, package with a list), or build a single file from a template rather than looping a task.

Mitogen (pip install mitogen, strategy plugin) is another option and gives large (often 2–4x) speedups for CPU-bound plays. It is battle-tested but unofficial; pin the version and know what happens when you upgrade Ansible.

Security

no_log: true on tasks whose diff or result contains secrets. Default on any task that receives a password variable.
become scope. Don't set it play-wide; set it at the smallest scope that still works. Role-level is the usual right answer.
Vault passwords are rotated and stored outside the repo. ~/.vault-pass-prod, HashiCorp Vault, or your CI's secret store.
Check in no SSH private keys. Ever. Use SSH certificates (SSH CA Certs) for short-lived credentials.
Signed commits in the Ansible repo. Infra repos are high-value targets — enable signed-commit and protected-branch requirements in Git for Infra.
Pin collection versions in requirements.yml. Use ansible-galaxy collection install -r requirements.yml in CI.
Disable module fallback to shell in your reviews — if a contributor's change makes an ansible.builtin.shell show up where a module existed, push back.

Refactoring recipes

Recipe 1 — "This playbook is 400 lines. How do I split it?"

Draw a line under each group of tasks that ends in a handler or an assert. That's usually a role boundary.
For each chunk, create roles/<name>/tasks/main.yml and move the tasks in.
Move any vars: used only by that chunk into roles/<name>/defaults/main.yml (prefixed).
Replace the original chunk in the playbook with - { role: name }.
Run --check --diff against dev. Zero diff = successful refactor.

Recipe 2 — "This role does three things. How do I cleave it?"

Identify the three subjects (install / configure / monitor is common).
Either: split into three roles (x_install, x_config, x_monitor) with explicit dependencies in meta/main.yml; or: keep one role and split its tasks/main.yml into install.yml, config.yml, monitor.yml and include_tasks them.
Prefer the multi-file-single-role approach first. Only split into separate roles once the pieces are being mixed-and-matched across plays.

Recipe 3 — "Same 20 lines copy/pasted across three roles"

Pull the shared code out into a new role, or a shared collection if you have one (Ansible Collection).
Declare the new role as a dependency in the three consuming roles' meta/main.yml.
For tiny snippets (2-5 lines, not worth a role), use include_tasks: from a shared roles/common_snippets/tasks/*.yml.

Recipe 4 — "The `when:` chain is unreadable"

# Before
- name: Install on Debian-family
  ansible.builtin.apt: { name: foo, state: present }
  when: ansible_facts.os_family == "Debian" and not ansible_check_mode and foo_enabled | default(true) and inventory_hostname in groups['enabled_hosts']

# After — narrow scope with a block, factor out the real condition
- name: Install foo (Debian-family only)
  when: foo_enabled | default(true)
  block:
    - ansible.builtin.apt:
        name: foo
        state: present
      when: ansible_facts.os_family == "Debian"
  # Host gating is the inventory's job, not the play's — move to hosts: enabled_hosts.

Recipe 5 — "We abuse `set_fact`"

Symptoms: set_fact calls scattered through tasks, later tasks reading facts that were set 200 lines up, multi-line Jinja in every task name. Fixes:

Compute values in vars: at play or role scope, not at task time.
Use vars_files: to load precomputed data.
Reserve set_fact for values that genuinely only become known at runtime (a package version discovered from a command, a remote random token).

Anti-patterns to grep for in code review

Anti-pattern	Why it's bad	Fix
`shell: "foo --bar {{ user_input }}"`	Command injection surface; non-idempotent	Use the module for `foo`, or at minimum `command:` with `argv:`
`command: true` / `command: "true"`	Noise; defeats changed detection	Delete the task, or `meta: noop`
`changed_when: false` on everything	Hides real drift, breaks check-mode	Fix idempotency; only use on genuine read-only tasks
`ignore_errors: true` without `assert` after	Failures invisible; CI green on red	Narrow with `failed_when:`, or block/rescue
`when: ansible_hostname == "web03"`	Host-specific logic in playbooks	Move to inventory groups + group_vars
`{{ lookup('env', 'SECRET') }}` outside CI	Leaks caller environment, non-portable	Vault, or a proper secrets backend (`hashi_vault` lookup)
`- include: foo.yml` (the deprecated dynamic one)	Ambiguous; gone in modern Ansible	`include_tasks` (dynamic) or `import_tasks` (static)
Hardcoded `10.0.0.5` IPs in tasks	Breaks the moment you re-IP or clone to staging	Vars for addresses; groups for discovery
Magic numbers (`worker_processes: 4`)	Why 4? Works until it doesn't	Express as `"{{ ansible_processor_vcpus }}"` or document the reason inline
Roles that only exist to `include_role` other roles	Indirection for indirection's sake	Inline in the playbook, or merge the wrapper into the child
`become: true` at play level, single role needs it	All tasks get sudo, expands blast radius	Move `become` to the role block that needs it
Vault files outside the inventory tree	Breaks the "inventory-per-env" principle	`inventory/<env>/group_vars/<group>/vault.yml`
Handlers doing non-restart work	Side-effects that skip in check-mode	Convert to regular tasks with `when: some_fact.changed`
`--extra-vars` committed to CI config	Highest-precedence vars surprise everyone	Put values in inventory; use `-e` only for hotfixes

Review checklist

Paste this into your MR template and tick through it before asking for review.

Correctness

[ ] ansible-playbook --syntax-check passes
[ ] ansible-lint is clean (or skipped rules are documented)
[ ] ansible-playbook --check --diff against dev shows only the diff I expected
[ ] Running the playbook twice in a row reports zero changed tasks (idempotent)
[ ] The new code has at least one Molecule scenario or --check CI job asserting behaviour

Style

[ ] Every task has a name: that reads like a log line
[ ] Role vars are role_*; private vars start with _
[ ] shell/command is only used where no module exists, with a comment explaining why
[ ] Vars in defaults/ are the documented interface; nothing else overridable lives in vars/
[ ] No hardcoded IPs, hostnames, emails, or secrets

Safety

[ ] Secrets are in Vault and no_log: true is set on tasks that receive them
[ ] become is scoped to the smallest block that needs it
[ ] Destructive tasks have a rollback path (block/rescue, or a prior backup)
[ ] Play-level serial: / max_fail_percentage: suit the blast radius

Operability

[ ] Tags exist for each major chunk and the README documents them
[ ] The diff between dev and prod is just inventory/vars
[ ] A handler, not a regular task, is what restarts services
[ ] The MR description includes the --check --diff output against dev