Ansible Best Practices & Refactoring
- Every task must be idempotent — a second run does nothing.
- Use the right module.
shell/commandis an escape hatch, not a default. - Roles expose a small, well-named public interface in
defaults/; anything else is private. - Handlers exist for service restarts and only service restarts. Everything else is a task.
- Secrets live in Vault. The variable name they are referenced by does not.
- Run
--check --diffon the same inventory your CI uses before you open the MR.
Naming conventions
Roles
- One role, one verb on one subject.
nginx,postgresql,ssh_hardening. Notweb_and_db, notsetup_everything. - Prefix variables with the role name.
nginx_,postgresql_,sshd_. Stops collisions when two roles are in the same play. - Private vars start with an underscore.
_nginx_package_namelives invars/main.ymland signals "do not override from outside". - Public vars live in
defaults/main.yml. That file IS your role's documented interface. If a var is not indefaults/, users should not be setting it.
Tasks, plays, handlers
- Every task has a
name:. The name is a log line — write it as one. Good:Install nginx package. Bad:install. - Plays too.
- name: Configure the web tier. - Handlers are imperative.
reload nginx,restart postgresql. Nevernginx reload handler. - No emoji, no "TODO", no "temporary" in names. Comments on tasks or commits carry that.
Tags
- Layer tags, not task tags.
base,db,app,edge. Notstep5. - One role, one tag. If you need to run part of a role, split the role.
- Always:
always. Tasks that must run (fact collection, version stamping) gettags: always.
Hosts
- Group by role, not by environment.
web,db,mail. Environment is an inventory dir (inventories/prod/), not a group namedprod-web. - No per-host special-casing in playbooks. If
web03needs something different, give it a group and put the delta ingroup_vars.
Layout and file placement
Where does this variable go?
| What it is | Where | Why |
|---|---|---|
| Default for a role-internal value | roles/x/defaults/main.yml | User-overridable, lowest precedence, documents the role's knobs. |
| Constant the user must not override | roles/x/vars/main.yml | High precedence; used for package names, service names, platform split. |
| Per-group value | inventory/group_vars/<group>/main.yml | Group scope is the first knob humans reach for. |
| Per-host value | inventory/host_vars/<host>.yml | Reserve for genuine per-host deltas (IPs, certs). |
| Site-wide for the whole inventory | inventory/group_vars/all/main.yml | Timezone, operator email, retention days. |
| Secrets | inventory/group_vars/<group>/vault.yml (vaulted) | Separate vault files per scope so you can rotate per-env passwords. |
| Ephemeral one-off | --extra-vars (-e) | Highest precedence; use for hotfix overrides, never for "permanent" config. |
group_vars/<group>/ as a directory
Split one giant YAML into topic files. Ansible reads all of them.
inventory/group_vars/web/
├── main.yml # common web defaults
├── tls.yml # cert paths, ciphers
├── performance.yml # worker processes, keepalive, timeouts
└── vault.yml # vaulted secrets
Inventory per environment
inventories/
├── dev/
│ ├── hosts.ini
│ └── group_vars/...
├── stage/
│ ├── hosts.ini
│ └── group_vars/...
└── prod/
├── hosts.ini
└── group_vars/...
Deliberately avoid "branch per environment". Same code, different inventory — the diff between what dev runs and what prod runs is just the inventory directory.
One site.yml, many playbooks
Keep site.yml as the "everything" entrypoint. Per-layer entrypoints (db.yml, app.yml, proxy.yml) are nice but are optional shortcuts; --tags should already give you that selectivity.
Idempotency
Idempotency means a second run reports zero changes. It is non-negotiable because Ansible's safety model depends on it: check-mode, diff, and CI dry-runs all assume changed ≈ real drift.
Patterns that make tasks idempotent
# 1. Prefer real modules. file, copy, template, package, service, lineinfile,
# blockinfile, replace, user, group, firewalld, systemd, mount — all idempotent.
- ansible.builtin.template:
src: app.conf.j2
dest: /etc/app/app.conf
mode: '0644'
validate: 'app -t -c %s'
notify: reload app
# 2. `creates:` / `removes:` for commands that set up durable state.
- ansible.builtin.command: /opt/app/bin/init-db
args:
creates: /var/lib/app/.initialised
# 3. Explicit changed_when for scripts whose exit code doesn't indicate change.
- ansible.builtin.command: /opt/app/bin/check-quota
register: quota
changed_when: "'updated' in quota.stdout"
failed_when: "quota.rc != 0 and 'ok' not in quota.stdout"
# 4. check_mode for "report only" tasks that should never mutate.
- ansible.builtin.command: /usr/sbin/audit2allow -l
check_mode: false # always run, even in --check
changed_when: false # is pure read
register: audit
When idempotency is genuinely hard
- External API (a SaaS, a cloud): check current state with a
GET, act only if it differs. Thecommunity.generaland cloud collections already do this — use them. - Migrations: wrap in a
creates:flag file or a "migration table row exists" check. - "Run once per deploy": use
run_once: true+delegate_to, and combine with a flag file if it must also be once-ever.
Error handling and failure budgets
block / rescue / always
- block:
- name: Put the edge into drain mode
ansible.builtin.command: /opt/edge/bin/drain
- name: Swap the backend pool
ansible.builtin.template:
src: upstream.conf.j2
dest: /etc/nginx/upstreams.conf
notify: reload nginx
rescue:
- name: Roll back to the previous config
ansible.builtin.copy:
remote_src: true
src: /etc/nginx/upstreams.conf.bak
dest: /etc/nginx/upstreams.conf
notify: reload nginx
- name: Re-raise so CI fails
ansible.builtin.fail:
msg: "Edge swap failed, rolled back."
always:
- name: Take the edge out of drain mode
ansible.builtin.command: /opt/edge/bin/undrain
Play-level safety valves
- hosts: app
become: true
serial: "25%" # roll out 25% of hosts at a time
max_fail_percentage: 10 # bail the play if >10% of a batch fails
any_errors_fatal: false # let successful hosts keep going
roles:
- app
Explicit failure conditions
- ansible.builtin.command: systemctl is-active --quiet nginx
register: nginx_up
failed_when: nginx_up.rc != 0
changed_when: false
# Fail early on bad input
- ansible.builtin.assert:
that:
- db_name is defined
- db_name is match('^[a-z_]+$')
fail_msg: "db_name must be snake_case"
ignore_errors: true on normal tasks. Use failed_when: to narrow what counts as failure. Reserve ignore_errors for genuinely best-effort tasks (e.g. gathering debug output after a failure) and pair it with an assert later.
See also: Error Handling.
Performance
Fact gathering
# Only gather what you use
- hosts: web
gather_facts: true
vars:
ansible_facts_gather_subset: "!all,!min,network,distribution,os_family"
For big fleets, enable fact caching so you are not hitting every host every run:
# ansible.cfg
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = .facts-cache
fact_caching_timeout = 7200
Connection tuning
# ansible.cfg
[defaults]
forks = 30
pipelining = true
host_key_checking = False # only in trusted networks; prefer signed keys in prod
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o PreferredAuthentications=publickey
With OpenSSH ControlMaster + pipelining = true, a typical run cuts 40-60% of wall time on large inventories.
Strategy
# Default is 'linear': every host waits at each task.
# 'free' lets each host race to the end independently — finishes faster, logs interleave.
- hosts: many_hosts
strategy: free
tasks:
- ...
async / poll for long tasks
- name: Run a slow backup
ansible.builtin.command: /usr/local/bin/slow-backup
async: 3600 # up to one hour
poll: 0 # fire-and-forget
register: backup_job
- name: Wait for backup to finish
ansible.builtin.async_status:
jid: "{{ backup_job.ansible_job_id }}"
register: backup_status
until: backup_status.finished
retries: 120
delay: 30
Avoid with_items over huge lists
Use loop on small collections. For hundreds of items, call a real module that takes a list (e.g. firewalld with service: loops, package with a list), or build a single file from a template rather than looping a task.
Mitogen (pip install mitogen, strategy plugin) is another option and gives large (often 2–4x) speedups for CPU-bound plays. It is battle-tested but unofficial; pin the version and know what happens when you upgrade Ansible.
See also: Performance.
Security
no_log: trueon tasks whose diff or result contains secrets. Default on any task that receives a password variable.becomescope. Don't set it play-wide; set it at the smallest scope that still works. Role-level is the usual right answer.- Vault passwords are rotated and stored outside the repo.
~/.vault-pass-prod, HashiCorp Vault, or your CI's secret store. - Check in no SSH private keys. Ever. Use SSH certificates (SSH CA Certs) for short-lived credentials.
- Signed commits in the Ansible repo. Infra repos are high-value targets — enable signed-commit and protected-branch requirements in Git for Infra.
- Pin collection versions in
requirements.yml. Useansible-galaxy collection install -r requirements.ymlin CI. - Disable module fallback to shell in your reviews — if a contributor's change makes an
ansible.builtin.shellshow up where a module existed, push back.
Refactoring recipes
Recipe 1 — "This playbook is 400 lines. How do I split it?"
- Draw a line under each group of tasks that ends in a handler or an
assert. That's usually a role boundary. - For each chunk, create
roles/<name>/tasks/main.ymland move the tasks in. - Move any
vars:used only by that chunk intoroles/<name>/defaults/main.yml(prefixed). - Replace the original chunk in the playbook with
- { role: name }. - Run
--check --diffagainst dev. Zero diff = successful refactor.
Recipe 2 — "This role does three things. How do I cleave it?"
- Identify the three subjects (install / configure / monitor is common).
- Either: split into three roles (
x_install,x_config,x_monitor) with explicit dependencies inmeta/main.yml; or: keep one role and split itstasks/main.ymlintoinstall.yml,config.yml,monitor.ymlandinclude_tasksthem. - Prefer the multi-file-single-role approach first. Only split into separate roles once the pieces are being mixed-and-matched across plays.
Recipe 3 — "Same 20 lines copy/pasted across three roles"
- Pull the shared code out into a new role, or a shared collection if you have one (Ansible Collection).
- Declare the new role as a dependency in the three consuming roles'
meta/main.yml. - For tiny snippets (2-5 lines, not worth a role), use
include_tasks:from a sharedroles/common_snippets/tasks/*.yml.
Recipe 4 — "The when: chain is unreadable"
# Before
- name: Install on Debian-family
ansible.builtin.apt: { name: foo, state: present }
when: ansible_facts.os_family == "Debian" and not ansible_check_mode and foo_enabled | default(true) and inventory_hostname in groups['enabled_hosts']
# After — narrow scope with a block, factor out the real condition
- name: Install foo (Debian-family only)
when: foo_enabled | default(true)
block:
- ansible.builtin.apt:
name: foo
state: present
when: ansible_facts.os_family == "Debian"
# Host gating is the inventory's job, not the play's — move to hosts: enabled_hosts.
Recipe 5 — "We abuse set_fact"
Symptoms: set_fact calls scattered through tasks, later tasks reading facts that were set 200 lines up, multi-line Jinja in every task name. Fixes:
- Compute values in
vars:at play or role scope, not at task time. - Use
vars_files:to load precomputed data. - Reserve
set_factfor values that genuinely only become known at runtime (a package version discovered from a command, a remote random token).
Anti-patterns to grep for in code review
| Anti-pattern | Why it's bad | Fix |
|---|---|---|
shell: "foo --bar {{ user_input }}" | Command injection surface; non-idempotent | Use the module for foo, or at minimum command: with argv: |
command: true / command: "true" | Noise; defeats changed detection | Delete the task, or meta: noop |
changed_when: false on everything | Hides real drift, breaks check-mode | Fix idempotency; only use on genuine read-only tasks |
ignore_errors: true without assert after | Failures invisible; CI green on red | Narrow with failed_when:, or block/rescue |
when: ansible_hostname == "web03" | Host-specific logic in playbooks | Move to inventory groups + group_vars |
{{ lookup('env', 'SECRET') }} outside CI | Leaks caller environment, non-portable | Vault, or a proper secrets backend (hashi_vault lookup) |
- include: foo.yml (the deprecated dynamic one) | Ambiguous; gone in modern Ansible | include_tasks (dynamic) or import_tasks (static) |
Hardcoded 10.0.0.5 IPs in tasks | Breaks the moment you re-IP or clone to staging | Vars for addresses; groups for discovery |
Magic numbers (worker_processes: 4) | Why 4? Works until it doesn't | Express as "{{ ansible_processor_vcpus }}" or document the reason inline |
Roles that only exist to include_role other roles | Indirection for indirection's sake | Inline in the playbook, or merge the wrapper into the child |
become: true at play level, single role needs it | All tasks get sudo, expands blast radius | Move become to the role block that needs it |
| Vault files outside the inventory tree | Breaks the "inventory-per-env" principle | inventory/<env>/group_vars/<group>/vault.yml |
| Handlers doing non-restart work | Side-effects that skip in check-mode | Convert to regular tasks with when: some_fact.changed |
--extra-vars committed to CI config | Highest-precedence vars surprise everyone | Put values in inventory; use -e only for hotfixes |
Review checklist
Paste this into your MR template and tick through it before asking for review.
Correctness
- [ ]
ansible-playbook --syntax-checkpasses - [ ]
ansible-lintis clean (or skipped rules are documented) - [ ]
ansible-playbook --check --diffagainst dev shows only the diff I expected - [ ] Running the playbook twice in a row reports zero changed tasks (idempotent)
- [ ] The new code has at least one Molecule scenario or
--checkCI job asserting behaviour
Style
- [ ] Every task has a
name:that reads like a log line - [ ] Role vars are
role_*; private vars start with_ - [ ]
shell/commandis only used where no module exists, with a comment explaining why - [ ] Vars in
defaults/are the documented interface; nothing else overridable lives invars/ - [ ] No hardcoded IPs, hostnames, emails, or secrets
Safety
- [ ] Secrets are in Vault and
no_log: trueis set on tasks that receive them - [ ]
becomeis scoped to the smallest block that needs it - [ ] Destructive tasks have a rollback path (block/rescue, or a prior backup)
- [ ] Play-level
serial:/max_fail_percentage:suit the blast radius
Operability
- [ ] Tags exist for each major chunk and the README documents them
- [ ] The diff between dev and prod is just inventory/vars
- [ ] A handler, not a regular task, is what restarts services
- [ ] The MR description includes the
--check --diffoutput against dev