Ansible Error Handling

How Ansible fails, how to catch and redirect failure, and how to design plays whose blast radius you actually control. Block/rescue, failed_when, assert, serial batching, meta directives, and real rollback examples.

Default stance

Ansible's default is fail fast, per host. A host that fails stops receiving tasks; other hosts keep going.
Never silence an error with ignore_errors: true. Narrow what counts as failure with failed_when: instead.
Use block / rescue / always the way Python uses try / except / finally. Rescue is for recovery, always is for cleanup, fail in rescue to re-raise.
Pre-flight with ansible.builtin.assert. Catching a bad variable at task 1 is worth a dozen smart rescues at task 40.
Pick exactly one of any_errors_fatal, max_fail_percentage, or per-host default. Combining them confuses everyone.

On this page

The failure model
block / rescue / always
failed_when in depth
ignore_errors vs failed_when vs rescue
assert for pre-flight
any_errors_fatal, max_fail_percentage, serial
meta: end_play / end_host / clear_host_errors
Example: deploy with rollback
Example: safely drop a service
Common gotchas

The failure model

Every task is a function call: Ansible ships a module to the target, runs it, and gets back JSON containing at least changed, failed, and rc (for command-like modules). Ansible then evaluates, in this order:

failed_when: if present, overrides the module's own verdict.
changed_when: if present, overrides the module's changed.
ignore_errors: if true, a failure is logged but not treated as fatal for the host.
If failure is still fatal, the host is removed from the play's active list. All other hosts continue.

A failure inside a block: redirects to the rescue: section. A failure in rescue: is final unless further wrapped.

block / rescue / always

The structural primitive. block: groups tasks. rescue: runs if anything in the block fails. always: runs whether the block succeeded or not.

- name: Deploy the app
  block:
    - name: Put the host into drain mode on the LB
      ansible.builtin.uri:
        url: "https://lb.example.com/api/drain/{{ inventory_hostname }}"
        method: POST
        status_code: 200
      delegate_to: localhost

    - name: Stop the app
      ansible.builtin.systemd:
        name: myapp
        state: stopped

    - name: Install the new package
      ansible.builtin.package:
        name: "myapp-{{ app_version }}"
        state: present

    - name: Start the app
      ansible.builtin.systemd:
        name: myapp
        state: started
        enabled: true

    - name: Smoke test
      ansible.builtin.uri:
        url: http://localhost:8080/health
        status_code: 200
      retries: 10
      delay: 2

  rescue:
    - name: Roll back to the previous package
      ansible.builtin.package:
        name: "myapp-{{ app_version_previous }}"
        state: present

    - name: Start the old app
      ansible.builtin.systemd:
        name: myapp
        state: started

    - name: Re-raise so CI is red
      ansible.builtin.fail:
        msg: "Deploy failed on {{ inventory_hostname }}, rolled back to {{ app_version_previous }}."

  always:
    - name: Take the host out of drain mode
      ansible.builtin.uri:
        url: "https://lb.example.com/api/undrain/{{ inventory_hostname }}"
        method: POST
        status_code: 200
      delegate_to: localhost

Inspecting the failure inside `rescue:`

Ansible sets ansible_failed_task and ansible_failed_result automatically:

rescue:
  - name: Show what blew up
    ansible.builtin.debug:
      msg: |
        Task: {{ ansible_failed_task.name }}
        Error: {{ ansible_failed_result.msg | default(ansible_failed_result) }}

Nesting blocks

Blocks nest. Use it when a recoverable step lives inside a bigger recoverable step:

- block:
    - block:
        - name: Inner thing that might fail harmlessly
          ansible.builtin.command: /usr/local/bin/try-stuff
      rescue:
        - name: Shrug and continue
          ansible.builtin.debug:
            msg: "try-stuff failed; continuing"

    - name: The real work
      ansible.builtin.command: /usr/local/bin/do-stuff

  rescue:
    - name: Full rollback on do-stuff failure
      ansible.builtin.command: /usr/local/bin/rollback

failed_when in depth

The module's own verdict is usually right (rc != 0 = failed). Override when:

The command sets non-zero rc for a non-error (e.g. grep returns 1 on no match).
A string in stdout is the real signal (some CLIs always return 0 and print ERROR:).
You want to fail later than the module does (gather more data first).

Non-zero rc that is not a failure

- name: Check whether a pattern exists (grep returns 1 when not found)
  ansible.builtin.command: grep -q pattern /etc/app.conf
  register: grep_out
  failed_when: grep_out.rc not in [0, 1]
  changed_when: false

Complex rc sets

- name: Run the tool; 0=ok, 2=warning, anything else=bad
  ansible.builtin.command: /usr/local/bin/check-something
  register: check
  failed_when: check.rc not in [0, 2]
  changed_when: check.rc == 0 and 'updated' in check.stdout

String-matching on stdout

- name: Push config
  ansible.builtin.command: /opt/vendor/bin/apply-config
  register: apply
  failed_when: >
    apply.rc != 0
    or 'ERROR' in apply.stdout
    or 'FATAL' in apply.stderr
  changed_when: "'applied' in apply.stdout"

Multi-line condition

For anything longer than one line, use a YAML list — each element is ANDed:

- name: Restart cluster node
  ansible.builtin.command: /usr/local/bin/cluster-restart
  register: restart
  failed_when:
    - restart.rc != 0
    - "'graceful shutdown in progress' not in restart.stderr"
    - not (ansible_check_mode and restart.rc == 42)

Gotcha: failed_when: evaluates a Jinja expression — it must return a boolean. A string failed_when: "foo" is truthy and will fail every time.

ignore_errors vs failed_when vs rescue

They look similar. They are not.

Mechanism	What it does	When to use
`failed_when:`	Redefines what "failure" means for the task.	The module says failed but the condition isn't actually a failure (grep rc=1, exit code 2 from a tool that means "warning").
`ignore_errors: true`	Still counts as failed, but does not stop the host. Result still shows up red in output.	Best-effort cleanup / diagnostic tasks. Almost always paired with a later `assert` or `when: registered.rc == 0`.
`rescue:`	Catches failure at a group scope and runs an alternate path. The block is considered handled — further tasks in the play run normally.	Rollback, alternate strategy, compensating actions. The grown-up answer.
`ignore_unreachable: true`	Task that can't connect does not mark the host unreachable.	Reachability probes; tearing down hosts that are expected to have already gone away.

Reach order of preference: assert / failed_when > rescue > ignore_errors.

Anti-pattern. ignore_errors: true on a task that you want to succeed is a lie told to your CI. Use failed_when: to describe the real condition, or wrap in block/rescue to handle the failure.

assert for pre-flight sanity

ansible.builtin.assert fails the task if any condition is false. It is the cheapest, loudest way to catch bad variables and impossible states before doing anything destructive.

- name: Pre-flight — required variables
  ansible.builtin.assert:
    that:
      - app_version is defined
      - app_version is string
      - app_version is match('^[0-9]+\.[0-9]+\.[0-9]+$')
      - target_env in ['dev', 'stage', 'prod']
      - inventory_hostname in groups['app']
    fail_msg: "Required variables missing or invalid. Got app_version={{ app_version | default('UNSET') }}, target_env={{ target_env | default('UNSET') }}"
    success_msg: "Pre-flight OK for {{ app_version }} on {{ target_env }}"
  tags: always

Run assert before everything else in the play. A failed assert stops the host immediately and nothing else runs.

Pattern: give every role a roles/x/tasks/preflight.yml containing only assert tasks, and import_tasks: preflight.yml at the top of tasks/main.yml. Breaks early, breaks cheap.

assert with a nice list output

- name: Validate deployment inputs
  ansible.builtin.assert:
    quiet: true       # only print on failure
    that: "{{ item.condition }}"
    fail_msg: "{{ item.msg }}"
  loop:
    - { condition: "app_version is defined", msg: "app_version must be set" }
    - { condition: "app_port | int > 1024", msg: "app_port must be non-privileged" }
    - { condition: "db_host != inventory_hostname", msg: "do not colocate db and app" }

any_errors_fatal, max_fail_percentage, serial

These are play-level knobs that control how group failure propagates.

`serial` — batch size

Controls how many hosts run in parallel. Default is all of them (well, up to forks). Common values:

- hosts: app
  serial: 1               # one at a time (canary-style rollout)
  # serial: "25%"         # percentage
  # serial: [1, 5, 10]    # ramp: 1 host, then 5, then 10-at-a-time
  roles: [app]

With a small serial, a failure on one batch means the remaining batches don't run — the blast radius is one batch, not the fleet.

`max_fail_percentage`

If more than X% of a batch fails, abort the whole play. Forces a "fleet minority can fail but don't burn the whole deploy" model:

- hosts: app
  serial: "25%"
  max_fail_percentage: 10
  roles: [app]

Reading that: roll out 25% at a time; if more than 10% of a 25% batch fails, stop everything.

`any_errors_fatal`

If any host in the batch fails, stop all hosts immediately, even mid-task. Useful when hosts coordinate (a cluster where you don't want a half-upgraded state):

- hosts: galera
  any_errors_fatal: true
  serial: 1
  roles: [galera_rolling_upgrade]

Combining them

You can set all three. Ansible picks the first one triggered:

any_errors_fatal trumps the others at the first failure.
max_fail_percentage kicks in once a batch's failures cross the threshold.
Without either, the default is "fail this host, keep going".

Pick one deliberately. A play that has all three plus ignore_errors sprinkled around is unreviewable. Pick the semantics you want at the top of the play and comment why.

meta: end_play / end_host / clear_host_errors

meta: is Ansible's escape hatch for controlling the executor.

Directive	Effect	Use when
`meta: end_play`	Stop the whole play for all hosts immediately. No further tasks run.	A pre-flight check found something bad that affects the whole fleet (wrong environment, kill switch present).
`meta: end_host`	Stop the play for this host only. Others continue normally.	This host is not applicable (wrong OS family, feature flag off) and you want to short-circuit cleanly.
`meta: end_batch`	End the current `serial:` batch early. Next batch still runs.	A batch's canary succeeded; no need to keep poking the same batch.
`meta: clear_host_errors`	Un-fail every host that has previously failed in this play. They become active again.	After a successful `rescue` at the fleet level, to let subsequent tasks include the recovered hosts.
`meta: flush_handlers`	Run any pending handlers now, not at end of play.	Between two roles where the second depends on a service restart that the first triggered.
`meta: refresh_inventory`	Re-read the inventory.	After adding hosts to a group at runtime (`add_host`) so that pattern-matching sees them.

- name: Skip this host if not in canary wave
  ansible.builtin.meta: end_host
  when: inventory_hostname not in groups['canary']

- name: Abort the entire deploy if the kill switch is set
  ansible.builtin.meta: end_play
  when: lookup('env', 'DEPLOY_KILL_SWITCH') == '1'

Example: deploy with rollback

A complete pattern you can copy. Deploys a new package, smoke-tests it, rolls back on failure, always removes the drain flag.

---
- name: Rolling deploy of myapp
  hosts: app
  become: true
  serial: "25%"
  max_fail_percentage: 10
  vars:
    app_version: "{{ version | mandatory }}"
  pre_tasks:
    - name: Pre-flight
      ansible.builtin.assert:
        that:
          - app_version is match('^[0-9]+\.[0-9]+\.[0-9]+$')
          - target_env is defined
        fail_msg: "Missing or invalid inputs"
      tags: always

    - name: Stash current version for rollback
      ansible.builtin.command: rpm -q --qf '%{version}-%{release}' myapp
      register: current_pkg
      changed_when: false
      check_mode: false

  tasks:
    - name: Deploy block
      block:
        - name: Drain from load balancer
          ansible.builtin.uri:
            url: "https://lb.example.com/api/drain/{{ inventory_hostname }}"
            method: POST
            status_code: 200
          delegate_to: localhost

        - name: Wait for connections to drain
          ansible.builtin.wait_for:
            host: "{{ ansible_host }}"
            port: 8080
            state: drained
            timeout: 60

        - name: Install new package
          ansible.builtin.package:
            name: "myapp-{{ app_version }}"
            state: present

        - name: Restart service
          ansible.builtin.systemd:
            name: myapp
            state: restarted

        - name: Health probe
          ansible.builtin.uri:
            url: http://localhost:8080/health
            status_code: 200
          register: health
          retries: 20
          delay: 3
          until: health.status == 200

        - name: Synthetic transaction
          ansible.builtin.uri:
            url: http://localhost:8080/selftest
            status_code: 200
            return_content: true
          register: selftest
          failed_when: "'OK' not in selftest.content"

      rescue:
        - name: Log what failed
          ansible.builtin.debug:
            msg: "ROLLBACK: {{ ansible_failed_task.name }} -> {{ ansible_failed_result.msg | default('(no message)') }}"

        - name: Reinstall previous package
          ansible.builtin.package:
            name: "myapp-{{ current_pkg.stdout }}"
            state: present
            allow_downgrade: true

        - name: Restart service
          ansible.builtin.systemd:
            name: myapp
            state: restarted

        - name: Verify rollback is healthy
          ansible.builtin.uri:
            url: http://localhost:8080/health
            status_code: 200
          retries: 20
          delay: 3

        - name: Re-raise so CI fails
          ansible.builtin.fail:
            msg: "Deploy {{ app_version }} failed; rolled back to {{ current_pkg.stdout }}"

      always:
        - name: Undrain from load balancer
          ansible.builtin.uri:
            url: "https://lb.example.com/api/undrain/{{ inventory_hostname }}"
            method: POST
            status_code: 200
          delegate_to: localhost

Example: safely drop a service

Decommissioning is where rescue shines: you want partial tear-down with clean-up regardless.

- name: Decommission legacy queue
  hosts: queue_workers
  serial: 1
  become: true
  tasks:
    - name: Decommission block
      block:
        - name: Stop accepting new jobs
          ansible.builtin.command: /usr/local/bin/queue drain
          register: drain
          changed_when: "'drained' in drain.stdout"

        - name: Wait for in-flight jobs to finish
          ansible.builtin.command: /usr/local/bin/queue wait-idle --timeout 300
          changed_when: false

        - name: Stop the service
          ansible.builtin.systemd:
            name: queue-worker
            state: stopped
            enabled: false

        - name: Remove the package
          ansible.builtin.package:
            name: queue-worker
            state: absent

        - name: Remove config and state
          ansible.builtin.file:
            path: "{{ item }}"
            state: absent
          loop:
            - /etc/queue-worker/
            - /var/lib/queue-worker/

      rescue:
        - name: Restore from a partial teardown
          ansible.builtin.systemd:
            name: queue-worker
            state: started
          when: >
            'queue-worker' in ansible_facts.services | default({})

        - ansible.builtin.fail:
            msg: "Decommission failed on {{ inventory_hostname }}; left as-was."

      always:
        - name: Report to inventory system
          ansible.builtin.uri:
            url: "https://cmdb.example.com/api/host/{{ inventory_hostname }}/decommissioned"
            method: POST
          delegate_to: localhost
          ignore_errors: true     # CMDB outage shouldn't fail the play

Common gotchas

Gotcha	Why it happens	Fix
Handlers don't run after a failed task	By default, if a task failed on a host, notified handlers on that host are discarded at end-of-play.	Set `force_handlers: true` on the play, or `meta: flush_handlers` before the risky section.
`rescue:` never fires	The task was skipped (a `when:` was false) or the failure was `unreachable` rather than failed.	Use `any_errors_fatal` for unreachable propagation, or set `ignore_unreachable: true` and check register state.
`always:` did not run	Host became unreachable mid-block; Ansible cannot run more tasks on a dead host.	Put "always" work on localhost with `delegate_to: localhost`.
`ignore_errors: true` but the task still aborts the play	The error is `unreachable`, not `failed`.	Add `ignore_unreachable: true` as well.
`failed_when: "rc != 0"` always fails	Quoted as a string containing no Jinja — Ansible evaluates it as a literal truthy string.	Remove the outer quotes, or use `failed_when: "{{ rc != 0 }}"`. Prefer a list: `failed_when: [rc != 0]`.
`assert` doesn't print my message	`fail_msg` has a typo (older Ansible used `msg:`) or `quiet: true` suppresses success output.	Use `fail_msg:` on modern Ansible. `success_msg:` needs `quiet: false` to be seen.
Rescue succeeded, but the host is still "failed"	The rescue ran `fail:` at the end (to re-raise), which re-marks the host as failed.	If you want the host to continue, drop the `fail:` — but then CI won't know. The usual right answer is to re-raise and let CI be red.
`any_errors_fatal` stops other hosts mid-task	That's the documented behaviour.	If you want "fail remaining tasks but let the current task finish on the other hosts", use `max_fail_percentage: 0` instead.
Registered variable undefined in rescue	The task that was supposed to register it failed before it could register.	Use `default()` in the rescue, or register on a prior safe task.
`end_host` inside a block doesn't end the block cleanly	It skips remaining tasks but still runs `always:`.	That is usually what you want. If not, guard the always block with `when:`.

Ansible Error Handling

The failure model

block / rescue / always

Inspecting the failure inside rescue:

Nesting blocks

failed_when in depth

Non-zero rc that is not a failure

Complex rc sets

String-matching on stdout

Multi-line condition

ignore_errors vs failed_when vs rescue

assert for pre-flight sanity

assert with a nice list output

any_errors_fatal, max_fail_percentage, serial

serial — batch size

max_fail_percentage

any_errors_fatal

Combining them

meta: end_play / end_host / clear_host_errors

Example: deploy with rollback

Example: safely drop a service

Common gotchas

Inspecting the failure inside `rescue:`

`serial` — batch size

`max_fail_percentage`

`any_errors_fatal`