Ansible Error Handling
- Ansible's default is fail fast, per host. A host that fails stops receiving tasks; other hosts keep going.
- Never silence an error with
ignore_errors: true. Narrow what counts as failure withfailed_when:instead. - Use
block / rescue / alwaysthe way Python usestry / except / finally. Rescue is for recovery,alwaysis for cleanup,failin rescue to re-raise. - Pre-flight with
ansible.builtin.assert. Catching a bad variable at task 1 is worth a dozen smart rescues at task 40. - Pick exactly one of
any_errors_fatal,max_fail_percentage, or per-host default. Combining them confuses everyone.
The failure model
Every task is a function call: Ansible ships a module to the target, runs it, and gets back JSON containing at least changed, failed, and rc (for command-like modules). Ansible then evaluates, in this order:
failed_when:if present, overrides the module's own verdict.changed_when:if present, overrides the module'schanged.ignore_errors:if true, a failure is logged but not treated as fatal for the host.- If failure is still fatal, the host is removed from the play's active list. All other hosts continue.
A failure inside a block: redirects to the rescue: section. A failure in rescue: is final unless further wrapped.
block / rescue / always
The structural primitive. block: groups tasks. rescue: runs if anything in the block fails. always: runs whether the block succeeded or not.
- name: Deploy the app
block:
- name: Put the host into drain mode on the LB
ansible.builtin.uri:
url: "https://lb.example.com/api/drain/{{ inventory_hostname }}"
method: POST
status_code: 200
delegate_to: localhost
- name: Stop the app
ansible.builtin.systemd:
name: myapp
state: stopped
- name: Install the new package
ansible.builtin.package:
name: "myapp-{{ app_version }}"
state: present
- name: Start the app
ansible.builtin.systemd:
name: myapp
state: started
enabled: true
- name: Smoke test
ansible.builtin.uri:
url: http://localhost:8080/health
status_code: 200
retries: 10
delay: 2
rescue:
- name: Roll back to the previous package
ansible.builtin.package:
name: "myapp-{{ app_version_previous }}"
state: present
- name: Start the old app
ansible.builtin.systemd:
name: myapp
state: started
- name: Re-raise so CI is red
ansible.builtin.fail:
msg: "Deploy failed on {{ inventory_hostname }}, rolled back to {{ app_version_previous }}."
always:
- name: Take the host out of drain mode
ansible.builtin.uri:
url: "https://lb.example.com/api/undrain/{{ inventory_hostname }}"
method: POST
status_code: 200
delegate_to: localhost
Inspecting the failure inside rescue:
Ansible sets ansible_failed_task and ansible_failed_result automatically:
rescue:
- name: Show what blew up
ansible.builtin.debug:
msg: |
Task: {{ ansible_failed_task.name }}
Error: {{ ansible_failed_result.msg | default(ansible_failed_result) }}
Nesting blocks
Blocks nest. Use it when a recoverable step lives inside a bigger recoverable step:
- block:
- block:
- name: Inner thing that might fail harmlessly
ansible.builtin.command: /usr/local/bin/try-stuff
rescue:
- name: Shrug and continue
ansible.builtin.debug:
msg: "try-stuff failed; continuing"
- name: The real work
ansible.builtin.command: /usr/local/bin/do-stuff
rescue:
- name: Full rollback on do-stuff failure
ansible.builtin.command: /usr/local/bin/rollback
failed_when in depth
The module's own verdict is usually right (rc != 0 = failed). Override when:
- The command sets non-zero rc for a non-error (e.g.
grepreturns 1 on no match). - A string in stdout is the real signal (some CLIs always return 0 and print
ERROR:). - You want to fail later than the module does (gather more data first).
Non-zero rc that is not a failure
- name: Check whether a pattern exists (grep returns 1 when not found)
ansible.builtin.command: grep -q pattern /etc/app.conf
register: grep_out
failed_when: grep_out.rc not in [0, 1]
changed_when: false
Complex rc sets
- name: Run the tool; 0=ok, 2=warning, anything else=bad
ansible.builtin.command: /usr/local/bin/check-something
register: check
failed_when: check.rc not in [0, 2]
changed_when: check.rc == 0 and 'updated' in check.stdout
String-matching on stdout
- name: Push config
ansible.builtin.command: /opt/vendor/bin/apply-config
register: apply
failed_when: >
apply.rc != 0
or 'ERROR' in apply.stdout
or 'FATAL' in apply.stderr
changed_when: "'applied' in apply.stdout"
Multi-line condition
For anything longer than one line, use a YAML list — each element is ANDed:
- name: Restart cluster node
ansible.builtin.command: /usr/local/bin/cluster-restart
register: restart
failed_when:
- restart.rc != 0
- "'graceful shutdown in progress' not in restart.stderr"
- not (ansible_check_mode and restart.rc == 42)
failed_when: evaluates a Jinja expression — it must return a boolean. A string failed_when: "foo" is truthy and will fail every time.
ignore_errors vs failed_when vs rescue
They look similar. They are not.
| Mechanism | What it does | When to use |
|---|---|---|
failed_when: |
Redefines what "failure" means for the task. | The module says failed but the condition isn't actually a failure (grep rc=1, exit code 2 from a tool that means "warning"). |
ignore_errors: true |
Still counts as failed, but does not stop the host. Result still shows up red in output. | Best-effort cleanup / diagnostic tasks. Almost always paired with a later assert or when: registered.rc == 0. |
rescue: |
Catches failure at a group scope and runs an alternate path. The block is considered handled — further tasks in the play run normally. | Rollback, alternate strategy, compensating actions. The grown-up answer. |
ignore_unreachable: true |
Task that can't connect does not mark the host unreachable. | Reachability probes; tearing down hosts that are expected to have already gone away. |
Reach order of preference: assert / failed_when > rescue > ignore_errors.
ignore_errors: true on a task that you want to succeed is a lie told to your CI. Use failed_when: to describe the real condition, or wrap in block/rescue to handle the failure.
assert for pre-flight sanity
ansible.builtin.assert fails the task if any condition is false. It is the cheapest, loudest way to catch bad variables and impossible states before doing anything destructive.
- name: Pre-flight — required variables
ansible.builtin.assert:
that:
- app_version is defined
- app_version is string
- app_version is match('^[0-9]+\.[0-9]+\.[0-9]+$')
- target_env in ['dev', 'stage', 'prod']
- inventory_hostname in groups['app']
fail_msg: "Required variables missing or invalid. Got app_version={{ app_version | default('UNSET') }}, target_env={{ target_env | default('UNSET') }}"
success_msg: "Pre-flight OK for {{ app_version }} on {{ target_env }}"
tags: always
Run assert before everything else in the play. A failed assert stops the host immediately and nothing else runs.
roles/x/tasks/preflight.yml containing only assert tasks, and import_tasks: preflight.yml at the top of tasks/main.yml. Breaks early, breaks cheap.
assert with a nice list output
- name: Validate deployment inputs
ansible.builtin.assert:
quiet: true # only print on failure
that: "{{ item.condition }}"
fail_msg: "{{ item.msg }}"
loop:
- { condition: "app_version is defined", msg: "app_version must be set" }
- { condition: "app_port | int > 1024", msg: "app_port must be non-privileged" }
- { condition: "db_host != inventory_hostname", msg: "do not colocate db and app" }
any_errors_fatal, max_fail_percentage, serial
These are play-level knobs that control how group failure propagates.
serial — batch size
Controls how many hosts run in parallel. Default is all of them (well, up to forks). Common values:
- hosts: app
serial: 1 # one at a time (canary-style rollout)
# serial: "25%" # percentage
# serial: [1, 5, 10] # ramp: 1 host, then 5, then 10-at-a-time
roles: [app]
With a small serial, a failure on one batch means the remaining batches don't run — the blast radius is one batch, not the fleet.
max_fail_percentage
If more than X% of a batch fails, abort the whole play. Forces a "fleet minority can fail but don't burn the whole deploy" model:
- hosts: app
serial: "25%"
max_fail_percentage: 10
roles: [app]
Reading that: roll out 25% at a time; if more than 10% of a 25% batch fails, stop everything.
any_errors_fatal
If any host in the batch fails, stop all hosts immediately, even mid-task. Useful when hosts coordinate (a cluster where you don't want a half-upgraded state):
- hosts: galera
any_errors_fatal: true
serial: 1
roles: [galera_rolling_upgrade]
Combining them
You can set all three. Ansible picks the first one triggered:
any_errors_fataltrumps the others at the first failure.max_fail_percentagekicks in once a batch's failures cross the threshold.- Without either, the default is "fail this host, keep going".
ignore_errors sprinkled around is unreviewable. Pick the semantics you want at the top of the play and comment why.
meta: end_play / end_host / clear_host_errors
meta: is Ansible's escape hatch for controlling the executor.
| Directive | Effect | Use when |
|---|---|---|
meta: end_play | Stop the whole play for all hosts immediately. No further tasks run. | A pre-flight check found something bad that affects the whole fleet (wrong environment, kill switch present). |
meta: end_host | Stop the play for this host only. Others continue normally. | This host is not applicable (wrong OS family, feature flag off) and you want to short-circuit cleanly. |
meta: end_batch | End the current serial: batch early. Next batch still runs. | A batch's canary succeeded; no need to keep poking the same batch. |
meta: clear_host_errors | Un-fail every host that has previously failed in this play. They become active again. | After a successful rescue at the fleet level, to let subsequent tasks include the recovered hosts. |
meta: flush_handlers | Run any pending handlers now, not at end of play. | Between two roles where the second depends on a service restart that the first triggered. |
meta: refresh_inventory | Re-read the inventory. | After adding hosts to a group at runtime (add_host) so that pattern-matching sees them. |
- name: Skip this host if not in canary wave
ansible.builtin.meta: end_host
when: inventory_hostname not in groups['canary']
- name: Abort the entire deploy if the kill switch is set
ansible.builtin.meta: end_play
when: lookup('env', 'DEPLOY_KILL_SWITCH') == '1'
Example: deploy with rollback
A complete pattern you can copy. Deploys a new package, smoke-tests it, rolls back on failure, always removes the drain flag.
---
- name: Rolling deploy of myapp
hosts: app
become: true
serial: "25%"
max_fail_percentage: 10
vars:
app_version: "{{ version | mandatory }}"
pre_tasks:
- name: Pre-flight
ansible.builtin.assert:
that:
- app_version is match('^[0-9]+\.[0-9]+\.[0-9]+$')
- target_env is defined
fail_msg: "Missing or invalid inputs"
tags: always
- name: Stash current version for rollback
ansible.builtin.command: rpm -q --qf '%{version}-%{release}' myapp
register: current_pkg
changed_when: false
check_mode: false
tasks:
- name: Deploy block
block:
- name: Drain from load balancer
ansible.builtin.uri:
url: "https://lb.example.com/api/drain/{{ inventory_hostname }}"
method: POST
status_code: 200
delegate_to: localhost
- name: Wait for connections to drain
ansible.builtin.wait_for:
host: "{{ ansible_host }}"
port: 8080
state: drained
timeout: 60
- name: Install new package
ansible.builtin.package:
name: "myapp-{{ app_version }}"
state: present
- name: Restart service
ansible.builtin.systemd:
name: myapp
state: restarted
- name: Health probe
ansible.builtin.uri:
url: http://localhost:8080/health
status_code: 200
register: health
retries: 20
delay: 3
until: health.status == 200
- name: Synthetic transaction
ansible.builtin.uri:
url: http://localhost:8080/selftest
status_code: 200
return_content: true
register: selftest
failed_when: "'OK' not in selftest.content"
rescue:
- name: Log what failed
ansible.builtin.debug:
msg: "ROLLBACK: {{ ansible_failed_task.name }} -> {{ ansible_failed_result.msg | default('(no message)') }}"
- name: Reinstall previous package
ansible.builtin.package:
name: "myapp-{{ current_pkg.stdout }}"
state: present
allow_downgrade: true
- name: Restart service
ansible.builtin.systemd:
name: myapp
state: restarted
- name: Verify rollback is healthy
ansible.builtin.uri:
url: http://localhost:8080/health
status_code: 200
retries: 20
delay: 3
- name: Re-raise so CI fails
ansible.builtin.fail:
msg: "Deploy {{ app_version }} failed; rolled back to {{ current_pkg.stdout }}"
always:
- name: Undrain from load balancer
ansible.builtin.uri:
url: "https://lb.example.com/api/undrain/{{ inventory_hostname }}"
method: POST
status_code: 200
delegate_to: localhost
Example: safely drop a service
Decommissioning is where rescue shines: you want partial tear-down with clean-up regardless.
- name: Decommission legacy queue
hosts: queue_workers
serial: 1
become: true
tasks:
- name: Decommission block
block:
- name: Stop accepting new jobs
ansible.builtin.command: /usr/local/bin/queue drain
register: drain
changed_when: "'drained' in drain.stdout"
- name: Wait for in-flight jobs to finish
ansible.builtin.command: /usr/local/bin/queue wait-idle --timeout 300
changed_when: false
- name: Stop the service
ansible.builtin.systemd:
name: queue-worker
state: stopped
enabled: false
- name: Remove the package
ansible.builtin.package:
name: queue-worker
state: absent
- name: Remove config and state
ansible.builtin.file:
path: "{{ item }}"
state: absent
loop:
- /etc/queue-worker/
- /var/lib/queue-worker/
rescue:
- name: Restore from a partial teardown
ansible.builtin.systemd:
name: queue-worker
state: started
when: >
'queue-worker' in ansible_facts.services | default({})
- ansible.builtin.fail:
msg: "Decommission failed on {{ inventory_hostname }}; left as-was."
always:
- name: Report to inventory system
ansible.builtin.uri:
url: "https://cmdb.example.com/api/host/{{ inventory_hostname }}/decommissioned"
method: POST
delegate_to: localhost
ignore_errors: true # CMDB outage shouldn't fail the play
Common gotchas
| Gotcha | Why it happens | Fix |
|---|---|---|
| Handlers don't run after a failed task | By default, if a task failed on a host, notified handlers on that host are discarded at end-of-play. | Set force_handlers: true on the play, or meta: flush_handlers before the risky section. |
rescue: never fires |
The task was skipped (a when: was false) or the failure was unreachable rather than failed. |
Use any_errors_fatal for unreachable propagation, or set ignore_unreachable: true and check register state. |
always: did not run |
Host became unreachable mid-block; Ansible cannot run more tasks on a dead host. | Put "always" work on localhost with delegate_to: localhost. |
ignore_errors: true but the task still aborts the play |
The error is unreachable, not failed. |
Add ignore_unreachable: true as well. |
failed_when: "rc != 0" always fails |
Quoted as a string containing no Jinja — Ansible evaluates it as a literal truthy string. | Remove the outer quotes, or use failed_when: "{{ rc != 0 }}". Prefer a list: failed_when: [rc != 0]. |
assert doesn't print my message |
fail_msg has a typo (older Ansible used msg:) or quiet: true suppresses success output. |
Use fail_msg: on modern Ansible. success_msg: needs quiet: false to be seen. |
| Rescue succeeded, but the host is still "failed" | The rescue ran fail: at the end (to re-raise), which re-marks the host as failed. |
If you want the host to continue, drop the fail: — but then CI won't know. The usual right answer is to re-raise and let CI be red. |
any_errors_fatal stops other hosts mid-task |
That's the documented behaviour. | If you want "fail remaining tasks but let the current task finish on the other hosts", use max_fail_percentage: 0 instead. |
| Registered variable undefined in rescue | The task that was supposed to register it failed before it could register. | Use default() in the rescue, or register on a prior safe task. |
end_host inside a block doesn't end the block cleanly |
It skips remaining tasks but still runs always:. |
That is usually what you want. If not, guard the always block with when:. |
Related reading: Ansible Best Practices, Ansible Testing, Handlers & Templates, Ansible Deploy Flow.