DR Runbook Template

A disaster recovery runbook should tell an operator what to restore, in what order, with which credentials, how to validate the result, and who must be updated while the work is happening.

If you only remember six things
  • A DR runbook needs owner, scope, dependencies, RPO, RTO, and exact activation criteria.
  • Do not put secrets in the document. Put references to where break-glass credentials or keys live.
  • Recovery steps must be ordered by dependency, not by team preference or service familiarity.
  • Every runbook needs validation steps that prove the service works, not just that processes started.
  • Include communication checkpoints and roles so updates happen without stalling technical work.
  • The most important field in a runbook is often Last tested. Untested runbooks are drafts.

What a DR runbook must contain

A usable runbook answers seven operator questions: when do we declare DR, what are we restoring, what must come back first, what backups or artifacts do we trust, who does each step, how do we prove success, and who needs status updates along the way.

SectionWhy it matters
Document controlShows owner, review date, last successful test, and version of the procedure being followed
Activation criteriaStops teams from improvising DR for routine incidents or waiting too long for real disasters
Dependencies and prioritiesPrevents restoring the app before DNS, identity, secrets, storage, or databases are ready
Recovery assetsNames the backups, build artifacts, infra code, credentials references, and dashboards needed to recover
Step-by-step procedureTurns intent into an execution order an operator can follow under pressure
ValidationDefines how to prove the recovered service is usable and within RPO/RTO targets
Communication and reviewKeeps stakeholders informed during recovery and captures learning afterward

Write DR runbooks alongside your backup plans from Backup & Restore. If a step depends on a database-specific restore flow, link the exact page such as Postgres Backup or MySQL Backup.

Copyable markdown template

Copy this into your internal docs system and replace the placeholders. Keep it versioned. Keep it reviewed. Keep it tested.

# Disaster Recovery Runbook - [service-name]

## Document control
- Service: [service-name]
- Service owner: [team or primary owner]
- Runbook owner: [person or team]
- Recovery tier: [tier-1 | tier-2 | tier-3]
- Target RPO: [e.g. 15 minutes]
- Target RTO: [e.g. 2 hours]
- Last reviewed: [YYYY-MM-DD]
- Last tested: [YYYY-MM-DD]
- Source of truth: [link to repo or doc path]
- Related runbooks:
  - [backup runbook]
  - [database restore runbook]
  - [network / DNS runbook]

## Activation criteria
- DR is declared when:
  - [condition 1]
  - [condition 2]
- Not a DR event when:
  - [condition 1]
  - [condition 2]
- Declared by: [role]
- Incident commander: [name or role]
- Incident channel / bridge: [channel, room, bridge URL]
- Incident / ticket ID: [identifier]
- Change freeze required: [yes/no]

## Service summary
- What this service does:
- Primary users / customers:
- Business impact if unavailable:
- Data classification:
- Primary region / site:
- Recovery region / site:

## Dependencies and recovery order
| Priority | Dependency | Why required | Owner | Validation |
| --- | --- | --- | --- | --- |
| 1 | [identity / DNS / secrets / DB] | [reason] | [team] | [check] |
| 2 | [queue / object storage / cache] | [reason] | [team] | [check] |
| 3 | [application tier] | [reason] | [team] | [check] |

## Roles and contacts
| Role | Primary | Secondary | Contact path |
| --- | --- | --- | --- |
| Incident commander | [name] | [name] | [pager / phone / chat] |
| Service operator | [name] | [name] | [pager / phone / chat] |
| Database owner | [name] | [name] | [pager / phone / chat] |
| Network / DNS owner | [name] | [name] | [pager / phone / chat] |
| Communications lead | [name] | [name] | [pager / phone / chat] |

## Recovery assets
- Infrastructure code repo: [link]
- Deployment manifests / charts / Terraform path: [link]
- Golden images or build artifacts: [location]
- Backup locations:
  - Full backup: [path / bucket / repo]
  - Incremental / WAL / binlog: [path / bucket / repo]
  - Config backup: [path / bucket / repo]
- Credential references:
  - Break-glass account: [vault path or process]
  - KMS / decryption key reference: [vault path or process]
  - Cloud account / subscription reference: [account ID]
- Monitoring and logs:
  - Dashboard: [link]
  - Log search: [link]
  - Backup freshness metric: [link]

## Preconditions
1. Confirm the blast radius and affected dependencies.
2. Freeze non-essential changes and deployments.
3. Capture the current time and incident timeline start.
4. Confirm backup currency against the target RPO.
5. Confirm the target recovery environment is reachable.
6. Confirm required credentials and keys are accessible.

## Recovery procedure
### Scenario A - primary region loss
1. [Bring up control plane / networking in recovery region]
2. [Restore identity, DNS, or secrets dependencies]
3. [Restore database or data store]
4. [Restore application state or object storage]
5. [Deploy service tier]
6. [Switch traffic]
7. [Run validation]

### Scenario B - accidental deletion or corruption
1. [Stop writes or put the service in maintenance mode]
2. [Identify recovery target time]
3. [Restore from latest good backup]
4. [Replay logs to target time if needed]
5. [Validate data integrity]
6. [Resume writes]

### Scenario C - credential compromise or ransomware
1. [Isolate affected accounts / hosts]
2. [Use immutable or offline backup source]
3. [Rotate credentials before exposing recovered service]
4. [Restore clean infrastructure and data]
5. [Validate and reconnect dependencies]

## Command log
- Restore command(s):
- Validation command(s):
- Rollback command(s):
- Links to automation jobs:

## Validation checklist
- [ ] Service reachable at expected endpoint
- [ ] Authentication and authorization work
- [ ] Database or state store current within RPO
- [ ] Critical background jobs running
- [ ] Metrics, logs, and alerts flowing
- [ ] Customer-visible health checks green
- [ ] Recovery duration recorded against RTO

## Communication plan
### Internal status update
- Audience: [engineering leadership / support / stakeholders]
- Frequency: [every 15 minutes / every milestone]
- Template:
  - Status: [investigating | recovering | validating | complete]
  - Impact: [what users see]
  - Current step: [what operators are doing now]
  - Next update: [time]

### External / customer update
- Channel: [status page / email / support]
- Approval needed from: [role]
- Template:
  - We are currently [recovering / validating] [service-name].
  - Customer impact: [summary]
  - Data loss expectation: [none / up to X minutes / unknown]
  - Next update by: [time]

## Exit criteria
- [ ] Validation checklist complete
- [ ] Monitoring stable for [duration]
- [ ] Support / stakeholders informed
- [ ] Temporary access revoked or rotated
- [ ] Follow-up work captured

## Failback or normalization
1. [State whether service remains in DR site or returns to primary]
2. [Conditions required before failback]
3. [Traffic switch procedure]
4. [Data sync direction and cutover checks]

## Evidence to retain
- Timeline of major actions
- Backup identifiers used
- Restore timings
- Validation output
- Communication timestamps
- Follow-up issues / tickets

## Post-incident review
- What worked:
- What failed:
- What slowed recovery:
- Missing tools or access:
- Runbook changes required:
- Next scheduled DR test date:

Example recovery scenarios

Do not write one generic recovery section and hope it covers everything. Different failure modes need different restore sources, different stop points, and different validation emphasis.

ScenarioPrimary decisionTypical restore sourceValidation focus
Primary region outageFail over entire service stack or operate degradedRecovery-region infrastructure plus recent backupsTraffic routing, dependency health, customer access
Application data corruptionChoose the recovery target time before the bad writeLatest full backup plus WAL or binlogsData correctness, application transactions, downstream consistency
Credential compromiseWhether restored assets are still trustworthyImmutable backups and clean infrastructureCredential rotation, audit trail, residual attacker access
Accidental deletion of a single datasetObject-level restore or full-service restoreLogical dump, snapshot clone, or object version historyOnly affected dataset restored, no unrelated rollback

For the first 15 minutes of a real incident, pair the DR runbook with Incident: First 15 Minutes. The incident guide helps with initial control; the DR runbook tells you how to recover.

Validation checklist

Validation should be explicit and test the user path, data path, and operator path. If your checklist only says "service started", it is not a validation checklist.

AreaWhat to validateExample
ReachabilityEndpoint responds from expected region or sitecurl -fsS https://service.example.com/healthz
AuthenticationUsers and service accounts can log inOIDC login or API token smoke test
Data integrityCritical tables, objects, or queues look saneRow counts, checksum spot checks, known order lookup
Background processingSchedulers, workers, and cron jobs are runningQueue drains, scheduled job heartbeat present
ObservabilityMetrics, logs, and alerts flow from the recovered systemDashboards update and alerts route correctly
SecurityEmergency credentials are rotated or revokedBreak-glass account logged and disabled after use

Run validation in the same order you restored dependencies. The fastest way to lose time is to debug an application symptom caused by an unvalidated lower-layer dependency.

Communication plan

Communication belongs in the runbook because operators under pressure will otherwise defer it. Define who speaks, where they speak, and how often updates happen.

Internal update example

Status: recovering
Impact: customer logins fail and API writes are unavailable
Current step: restoring primary database in recovery region
RPO expectation: up to 15 minutes of data loss
Next update: 14:30 UTC

External update example

We are recovering service in an alternate environment.
Current impact: logins and API writes remain unavailable.
Current data-loss estimate: no more than 15 minutes, still being validated.
Next public update by 14:30 UTC.

Use the same discipline as a planned change in Change Window Runbook: clear roles, timestamped updates, and an explicit next checkpoint.

Post-incident review

A DR event or drill should always produce edits. Review what slowed recovery, what information was stale, what access was missing, and whether the declared RPO/RTO were realistic.

If the service is containerized, the recovery path may depend on image registries, cluster state, and manifest sources. Carry those into the runbook and then continue with Containers 101 for platform-specific recovery dependencies.

Troubleshooting bad runbooks

Failure modeWhat it looks like during an incidentHow to fix it now
No activation criteriaTeams debate whether to declare DR while recovery time slips awayAdd explicit triggers, decision owners, and a "not DR" list
No last-tested dateOperators discover commands, paths, or owners changed months agoTrack review and test dates in document control and alert on staleness
Secrets embedded or missingEither sensitive data leaks in docs, or recovery stalls because no one knows where credentials liveReference vault paths or break-glass procedures, never raw secrets
Restore steps not dependency-orderedApplication recovery starts before DNS, database, or identity are readyRewrite the procedure by dependency chain and restore priority
No validation stepsIncident is declared resolved even though the service is only partially workingAdd concrete commands, health checks, and data sanity tests
No communication ownerStatus updates stop while engineers are busy recoveringAssign a communications lead and define update cadence in the runbook

See also: Backup & Restore, Incident: First 15 Minutes, Change Window Runbook, and SLOs & On-Call.