Ansible Error Handling

How Ansible fails, how to catch and redirect failure, and how to design plays whose blast radius you actually control. Block/rescue, failed_when, assert, serial batching, meta directives, and real rollback examples.

Default stance
  • Ansible's default is fail fast, per host. A host that fails stops receiving tasks; other hosts keep going.
  • Never silence an error with ignore_errors: true. Narrow what counts as failure with failed_when: instead.
  • Use block / rescue / always the way Python uses try / except / finally. Rescue is for recovery, always is for cleanup, fail in rescue to re-raise.
  • Pre-flight with ansible.builtin.assert. Catching a bad variable at task 1 is worth a dozen smart rescues at task 40.
  • Pick exactly one of any_errors_fatal, max_fail_percentage, or per-host default. Combining them confuses everyone.

The failure model

Every task is a function call: Ansible ships a module to the target, runs it, and gets back JSON containing at least changed, failed, and rc (for command-like modules). Ansible then evaluates, in this order:

  1. failed_when: if present, overrides the module's own verdict.
  2. changed_when: if present, overrides the module's changed.
  3. ignore_errors: if true, a failure is logged but not treated as fatal for the host.
  4. If failure is still fatal, the host is removed from the play's active list. All other hosts continue.

A failure inside a block: redirects to the rescue: section. A failure in rescue: is final unless further wrapped.

block / rescue / always

The structural primitive. block: groups tasks. rescue: runs if anything in the block fails. always: runs whether the block succeeded or not.

- name: Deploy the app
  block:
    - name: Put the host into drain mode on the LB
      ansible.builtin.uri:
        url: "https://lb.example.com/api/drain/{{ inventory_hostname }}"
        method: POST
        status_code: 200
      delegate_to: localhost

    - name: Stop the app
      ansible.builtin.systemd:
        name: myapp
        state: stopped

    - name: Install the new package
      ansible.builtin.package:
        name: "myapp-{{ app_version }}"
        state: present

    - name: Start the app
      ansible.builtin.systemd:
        name: myapp
        state: started
        enabled: true

    - name: Smoke test
      ansible.builtin.uri:
        url: http://localhost:8080/health
        status_code: 200
      retries: 10
      delay: 2

  rescue:
    - name: Roll back to the previous package
      ansible.builtin.package:
        name: "myapp-{{ app_version_previous }}"
        state: present

    - name: Start the old app
      ansible.builtin.systemd:
        name: myapp
        state: started

    - name: Re-raise so CI is red
      ansible.builtin.fail:
        msg: "Deploy failed on {{ inventory_hostname }}, rolled back to {{ app_version_previous }}."

  always:
    - name: Take the host out of drain mode
      ansible.builtin.uri:
        url: "https://lb.example.com/api/undrain/{{ inventory_hostname }}"
        method: POST
        status_code: 200
      delegate_to: localhost

Inspecting the failure inside rescue:

Ansible sets ansible_failed_task and ansible_failed_result automatically:

rescue:
  - name: Show what blew up
    ansible.builtin.debug:
      msg: |
        Task: {{ ansible_failed_task.name }}
        Error: {{ ansible_failed_result.msg | default(ansible_failed_result) }}

Nesting blocks

Blocks nest. Use it when a recoverable step lives inside a bigger recoverable step:

- block:
    - block:
        - name: Inner thing that might fail harmlessly
          ansible.builtin.command: /usr/local/bin/try-stuff
      rescue:
        - name: Shrug and continue
          ansible.builtin.debug:
            msg: "try-stuff failed; continuing"

    - name: The real work
      ansible.builtin.command: /usr/local/bin/do-stuff

  rescue:
    - name: Full rollback on do-stuff failure
      ansible.builtin.command: /usr/local/bin/rollback

failed_when in depth

The module's own verdict is usually right (rc != 0 = failed). Override when:

Non-zero rc that is not a failure

- name: Check whether a pattern exists (grep returns 1 when not found)
  ansible.builtin.command: grep -q pattern /etc/app.conf
  register: grep_out
  failed_when: grep_out.rc not in [0, 1]
  changed_when: false

Complex rc sets

- name: Run the tool; 0=ok, 2=warning, anything else=bad
  ansible.builtin.command: /usr/local/bin/check-something
  register: check
  failed_when: check.rc not in [0, 2]
  changed_when: check.rc == 0 and 'updated' in check.stdout

String-matching on stdout

- name: Push config
  ansible.builtin.command: /opt/vendor/bin/apply-config
  register: apply
  failed_when: >
    apply.rc != 0
    or 'ERROR' in apply.stdout
    or 'FATAL' in apply.stderr
  changed_when: "'applied' in apply.stdout"

Multi-line condition

For anything longer than one line, use a YAML list — each element is ANDed:

- name: Restart cluster node
  ansible.builtin.command: /usr/local/bin/cluster-restart
  register: restart
  failed_when:
    - restart.rc != 0
    - "'graceful shutdown in progress' not in restart.stderr"
    - not (ansible_check_mode and restart.rc == 42)
Gotcha: failed_when: evaluates a Jinja expression — it must return a boolean. A string failed_when: "foo" is truthy and will fail every time.

ignore_errors vs failed_when vs rescue

They look similar. They are not.

MechanismWhat it doesWhen to use
failed_when: Redefines what "failure" means for the task. The module says failed but the condition isn't actually a failure (grep rc=1, exit code 2 from a tool that means "warning").
ignore_errors: true Still counts as failed, but does not stop the host. Result still shows up red in output. Best-effort cleanup / diagnostic tasks. Almost always paired with a later assert or when: registered.rc == 0.
rescue: Catches failure at a group scope and runs an alternate path. The block is considered handled — further tasks in the play run normally. Rollback, alternate strategy, compensating actions. The grown-up answer.
ignore_unreachable: true Task that can't connect does not mark the host unreachable. Reachability probes; tearing down hosts that are expected to have already gone away.

Reach order of preference: assert / failed_when > rescue > ignore_errors.

Anti-pattern. ignore_errors: true on a task that you want to succeed is a lie told to your CI. Use failed_when: to describe the real condition, or wrap in block/rescue to handle the failure.

assert for pre-flight sanity

ansible.builtin.assert fails the task if any condition is false. It is the cheapest, loudest way to catch bad variables and impossible states before doing anything destructive.

- name: Pre-flight — required variables
  ansible.builtin.assert:
    that:
      - app_version is defined
      - app_version is string
      - app_version is match('^[0-9]+\.[0-9]+\.[0-9]+$')
      - target_env in ['dev', 'stage', 'prod']
      - inventory_hostname in groups['app']
    fail_msg: "Required variables missing or invalid. Got app_version={{ app_version | default('UNSET') }}, target_env={{ target_env | default('UNSET') }}"
    success_msg: "Pre-flight OK for {{ app_version }} on {{ target_env }}"
  tags: always

Run assert before everything else in the play. A failed assert stops the host immediately and nothing else runs.

Pattern: give every role a roles/x/tasks/preflight.yml containing only assert tasks, and import_tasks: preflight.yml at the top of tasks/main.yml. Breaks early, breaks cheap.

assert with a nice list output

- name: Validate deployment inputs
  ansible.builtin.assert:
    quiet: true       # only print on failure
    that: "{{ item.condition }}"
    fail_msg: "{{ item.msg }}"
  loop:
    - { condition: "app_version is defined", msg: "app_version must be set" }
    - { condition: "app_port | int > 1024", msg: "app_port must be non-privileged" }
    - { condition: "db_host != inventory_hostname", msg: "do not colocate db and app" }

any_errors_fatal, max_fail_percentage, serial

These are play-level knobs that control how group failure propagates.

serial — batch size

Controls how many hosts run in parallel. Default is all of them (well, up to forks). Common values:

- hosts: app
  serial: 1               # one at a time (canary-style rollout)
  # serial: "25%"         # percentage
  # serial: [1, 5, 10]    # ramp: 1 host, then 5, then 10-at-a-time
  roles: [app]

With a small serial, a failure on one batch means the remaining batches don't run — the blast radius is one batch, not the fleet.

max_fail_percentage

If more than X% of a batch fails, abort the whole play. Forces a "fleet minority can fail but don't burn the whole deploy" model:

- hosts: app
  serial: "25%"
  max_fail_percentage: 10
  roles: [app]

Reading that: roll out 25% at a time; if more than 10% of a 25% batch fails, stop everything.

any_errors_fatal

If any host in the batch fails, stop all hosts immediately, even mid-task. Useful when hosts coordinate (a cluster where you don't want a half-upgraded state):

- hosts: galera
  any_errors_fatal: true
  serial: 1
  roles: [galera_rolling_upgrade]

Combining them

You can set all three. Ansible picks the first one triggered:

Pick one deliberately. A play that has all three plus ignore_errors sprinkled around is unreviewable. Pick the semantics you want at the top of the play and comment why.

meta: end_play / end_host / clear_host_errors

meta: is Ansible's escape hatch for controlling the executor.

DirectiveEffectUse when
meta: end_playStop the whole play for all hosts immediately. No further tasks run.A pre-flight check found something bad that affects the whole fleet (wrong environment, kill switch present).
meta: end_hostStop the play for this host only. Others continue normally.This host is not applicable (wrong OS family, feature flag off) and you want to short-circuit cleanly.
meta: end_batchEnd the current serial: batch early. Next batch still runs.A batch's canary succeeded; no need to keep poking the same batch.
meta: clear_host_errorsUn-fail every host that has previously failed in this play. They become active again.After a successful rescue at the fleet level, to let subsequent tasks include the recovered hosts.
meta: flush_handlersRun any pending handlers now, not at end of play.Between two roles where the second depends on a service restart that the first triggered.
meta: refresh_inventoryRe-read the inventory.After adding hosts to a group at runtime (add_host) so that pattern-matching sees them.
- name: Skip this host if not in canary wave
  ansible.builtin.meta: end_host
  when: inventory_hostname not in groups['canary']

- name: Abort the entire deploy if the kill switch is set
  ansible.builtin.meta: end_play
  when: lookup('env', 'DEPLOY_KILL_SWITCH') == '1'

Example: deploy with rollback

A complete pattern you can copy. Deploys a new package, smoke-tests it, rolls back on failure, always removes the drain flag.

---
- name: Rolling deploy of myapp
  hosts: app
  become: true
  serial: "25%"
  max_fail_percentage: 10
  vars:
    app_version: "{{ version | mandatory }}"
  pre_tasks:
    - name: Pre-flight
      ansible.builtin.assert:
        that:
          - app_version is match('^[0-9]+\.[0-9]+\.[0-9]+$')
          - target_env is defined
        fail_msg: "Missing or invalid inputs"
      tags: always

    - name: Stash current version for rollback
      ansible.builtin.command: rpm -q --qf '%{version}-%{release}' myapp
      register: current_pkg
      changed_when: false
      check_mode: false

  tasks:
    - name: Deploy block
      block:
        - name: Drain from load balancer
          ansible.builtin.uri:
            url: "https://lb.example.com/api/drain/{{ inventory_hostname }}"
            method: POST
            status_code: 200
          delegate_to: localhost

        - name: Wait for connections to drain
          ansible.builtin.wait_for:
            host: "{{ ansible_host }}"
            port: 8080
            state: drained
            timeout: 60

        - name: Install new package
          ansible.builtin.package:
            name: "myapp-{{ app_version }}"
            state: present

        - name: Restart service
          ansible.builtin.systemd:
            name: myapp
            state: restarted

        - name: Health probe
          ansible.builtin.uri:
            url: http://localhost:8080/health
            status_code: 200
          register: health
          retries: 20
          delay: 3
          until: health.status == 200

        - name: Synthetic transaction
          ansible.builtin.uri:
            url: http://localhost:8080/selftest
            status_code: 200
            return_content: true
          register: selftest
          failed_when: "'OK' not in selftest.content"

      rescue:
        - name: Log what failed
          ansible.builtin.debug:
            msg: "ROLLBACK: {{ ansible_failed_task.name }} -> {{ ansible_failed_result.msg | default('(no message)') }}"

        - name: Reinstall previous package
          ansible.builtin.package:
            name: "myapp-{{ current_pkg.stdout }}"
            state: present
            allow_downgrade: true

        - name: Restart service
          ansible.builtin.systemd:
            name: myapp
            state: restarted

        - name: Verify rollback is healthy
          ansible.builtin.uri:
            url: http://localhost:8080/health
            status_code: 200
          retries: 20
          delay: 3

        - name: Re-raise so CI fails
          ansible.builtin.fail:
            msg: "Deploy {{ app_version }} failed; rolled back to {{ current_pkg.stdout }}"

      always:
        - name: Undrain from load balancer
          ansible.builtin.uri:
            url: "https://lb.example.com/api/undrain/{{ inventory_hostname }}"
            method: POST
            status_code: 200
          delegate_to: localhost

Example: safely drop a service

Decommissioning is where rescue shines: you want partial tear-down with clean-up regardless.

- name: Decommission legacy queue
  hosts: queue_workers
  serial: 1
  become: true
  tasks:
    - name: Decommission block
      block:
        - name: Stop accepting new jobs
          ansible.builtin.command: /usr/local/bin/queue drain
          register: drain
          changed_when: "'drained' in drain.stdout"

        - name: Wait for in-flight jobs to finish
          ansible.builtin.command: /usr/local/bin/queue wait-idle --timeout 300
          changed_when: false

        - name: Stop the service
          ansible.builtin.systemd:
            name: queue-worker
            state: stopped
            enabled: false

        - name: Remove the package
          ansible.builtin.package:
            name: queue-worker
            state: absent

        - name: Remove config and state
          ansible.builtin.file:
            path: "{{ item }}"
            state: absent
          loop:
            - /etc/queue-worker/
            - /var/lib/queue-worker/

      rescue:
        - name: Restore from a partial teardown
          ansible.builtin.systemd:
            name: queue-worker
            state: started
          when: >
            'queue-worker' in ansible_facts.services | default({})

        - ansible.builtin.fail:
            msg: "Decommission failed on {{ inventory_hostname }}; left as-was."

      always:
        - name: Report to inventory system
          ansible.builtin.uri:
            url: "https://cmdb.example.com/api/host/{{ inventory_hostname }}/decommissioned"
            method: POST
          delegate_to: localhost
          ignore_errors: true     # CMDB outage shouldn't fail the play

Common gotchas

GotchaWhy it happensFix
Handlers don't run after a failed task By default, if a task failed on a host, notified handlers on that host are discarded at end-of-play. Set force_handlers: true on the play, or meta: flush_handlers before the risky section.
rescue: never fires The task was skipped (a when: was false) or the failure was unreachable rather than failed. Use any_errors_fatal for unreachable propagation, or set ignore_unreachable: true and check register state.
always: did not run Host became unreachable mid-block; Ansible cannot run more tasks on a dead host. Put "always" work on localhost with delegate_to: localhost.
ignore_errors: true but the task still aborts the play The error is unreachable, not failed. Add ignore_unreachable: true as well.
failed_when: "rc != 0" always fails Quoted as a string containing no Jinja — Ansible evaluates it as a literal truthy string. Remove the outer quotes, or use failed_when: "{{ rc != 0 }}". Prefer a list: failed_when: [rc != 0].
assert doesn't print my message fail_msg has a typo (older Ansible used msg:) or quiet: true suppresses success output. Use fail_msg: on modern Ansible. success_msg: needs quiet: false to be seen.
Rescue succeeded, but the host is still "failed" The rescue ran fail: at the end (to re-raise), which re-marks the host as failed. If you want the host to continue, drop the fail: — but then CI won't know. The usual right answer is to re-raise and let CI be red.
any_errors_fatal stops other hosts mid-task That's the documented behaviour. If you want "fail remaining tasks but let the current task finish on the other hosts", use max_fail_percentage: 0 instead.
Registered variable undefined in rescue The task that was supposed to register it failed before it could register. Use default() in the rescue, or register on a prior safe task.
end_host inside a block doesn't end the block cleanly It skips remaining tasks but still runs always:. That is usually what you want. If not, guard the always block with when:.

Related reading: Ansible Best Practices, Ansible Testing, Handlers & Templates, Ansible Deploy Flow.