DR Runbook Template

A disaster recovery runbook should tell an operator what to restore, in what order, with which credentials, how to validate the result, and who must be updated while the work is happening.

If you only remember six things

A DR runbook needs owner, scope, dependencies, RPO, RTO, and exact activation criteria.
Do not put secrets in the document. Put references to where break-glass credentials or keys live.
Recovery steps must be ordered by dependency, not by team preference or service familiarity.
Every runbook needs validation steps that prove the service works, not just that processes started.
Include communication checkpoints and roles so updates happen without stalling technical work.
The most important field in a runbook is often Last tested. Untested runbooks are drafts.

On this page

What a DR runbook must contain
Copyable markdown template
Example recovery scenarios
Validation checklist
Communication plan
Post-incident review
Troubleshooting bad runbooks

What a DR runbook must contain

A usable runbook answers seven operator questions: when do we declare DR, what are we restoring, what must come back first, what backups or artifacts do we trust, who does each step, how do we prove success, and who needs status updates along the way.

Section	Why it matters
Document control	Shows owner, review date, last successful test, and version of the procedure being followed
Activation criteria	Stops teams from improvising DR for routine incidents or waiting too long for real disasters
Dependencies and priorities	Prevents restoring the app before DNS, identity, secrets, storage, or databases are ready
Recovery assets	Names the backups, build artifacts, infra code, credentials references, and dashboards needed to recover
Step-by-step procedure	Turns intent into an execution order an operator can follow under pressure
Validation	Defines how to prove the recovered service is usable and within RPO/RTO targets
Communication and review	Keeps stakeholders informed during recovery and captures learning afterward

Write DR runbooks alongside your backup plans from Backup & Restore. If a step depends on a database-specific restore flow, link the exact page such as Postgres Backup or MySQL Backup.

Copyable markdown template

Copy this into your internal docs system and replace the placeholders. Keep it versioned. Keep it reviewed. Keep it tested.

# Disaster Recovery Runbook - [service-name]

## Document control
- Service: [service-name]
- Service owner: [team or primary owner]
- Runbook owner: [person or team]
- Recovery tier: [tier-1 | tier-2 | tier-3]
- Target RPO: [e.g. 15 minutes]
- Target RTO: [e.g. 2 hours]
- Last reviewed: [YYYY-MM-DD]
- Last tested: [YYYY-MM-DD]
- Source of truth: [link to repo or doc path]
- Related runbooks:
  - [backup runbook]
  - [database restore runbook]
  - [network / DNS runbook]

## Activation criteria
- DR is declared when:
  - [condition 1]
  - [condition 2]
- Not a DR event when:
  - [condition 1]
  - [condition 2]
- Declared by: [role]
- Incident commander: [name or role]
- Incident channel / bridge: [channel, room, bridge URL]
- Incident / ticket ID: [identifier]
- Change freeze required: [yes/no]

## Service summary
- What this service does:
- Primary users / customers:
- Business impact if unavailable:
- Data classification:
- Primary region / site:
- Recovery region / site:

## Dependencies and recovery order
| Priority | Dependency | Why required | Owner | Validation |
| --- | --- | --- | --- | --- |
| 1 | [identity / DNS / secrets / DB] | [reason] | [team] | [check] |
| 2 | [queue / object storage / cache] | [reason] | [team] | [check] |
| 3 | [application tier] | [reason] | [team] | [check] |

## Roles and contacts
| Role | Primary | Secondary | Contact path |
| --- | --- | --- | --- |
| Incident commander | [name] | [name] | [pager / phone / chat] |
| Service operator | [name] | [name] | [pager / phone / chat] |
| Database owner | [name] | [name] | [pager / phone / chat] |
| Network / DNS owner | [name] | [name] | [pager / phone / chat] |
| Communications lead | [name] | [name] | [pager / phone / chat] |

## Recovery assets
- Infrastructure code repo: [link]
- Deployment manifests / charts / Terraform path: [link]
- Golden images or build artifacts: [location]
- Backup locations:
  - Full backup: [path / bucket / repo]
  - Incremental / WAL / binlog: [path / bucket / repo]
  - Config backup: [path / bucket / repo]
- Credential references:
  - Break-glass account: [vault path or process]
  - KMS / decryption key reference: [vault path or process]
  - Cloud account / subscription reference: [account ID]
- Monitoring and logs:
  - Dashboard: [link]
  - Log search: [link]
  - Backup freshness metric: [link]

## Preconditions
1. Confirm the blast radius and affected dependencies.
2. Freeze non-essential changes and deployments.
3. Capture the current time and incident timeline start.
4. Confirm backup currency against the target RPO.
5. Confirm the target recovery environment is reachable.
6. Confirm required credentials and keys are accessible.

## Recovery procedure
### Scenario A - primary region loss
1. [Bring up control plane / networking in recovery region]
2. [Restore identity, DNS, or secrets dependencies]
3. [Restore database or data store]
4. [Restore application state or object storage]
5. [Deploy service tier]
6. [Switch traffic]
7. [Run validation]

### Scenario B - accidental deletion or corruption
1. [Stop writes or put the service in maintenance mode]
2. [Identify recovery target time]
3. [Restore from latest good backup]
4. [Replay logs to target time if needed]
5. [Validate data integrity]
6. [Resume writes]

### Scenario C - credential compromise or ransomware
1. [Isolate affected accounts / hosts]
2. [Use immutable or offline backup source]
3. [Rotate credentials before exposing recovered service]
4. [Restore clean infrastructure and data]
5. [Validate and reconnect dependencies]

## Command log
- Restore command(s):
- Validation command(s):
- Rollback command(s):
- Links to automation jobs:

## Validation checklist
- [ ] Service reachable at expected endpoint
- [ ] Authentication and authorization work
- [ ] Database or state store current within RPO
- [ ] Critical background jobs running
- [ ] Metrics, logs, and alerts flowing
- [ ] Customer-visible health checks green
- [ ] Recovery duration recorded against RTO

## Communication plan
### Internal status update
- Audience: [engineering leadership / support / stakeholders]
- Frequency: [every 15 minutes / every milestone]
- Template:
  - Status: [investigating | recovering | validating | complete]
  - Impact: [what users see]
  - Current step: [what operators are doing now]
  - Next update: [time]

### External / customer update
- Channel: [status page / email / support]
- Approval needed from: [role]
- Template:
  - We are currently [recovering / validating] [service-name].
  - Customer impact: [summary]
  - Data loss expectation: [none / up to X minutes / unknown]
  - Next update by: [time]

## Exit criteria
- [ ] Validation checklist complete
- [ ] Monitoring stable for [duration]
- [ ] Support / stakeholders informed
- [ ] Temporary access revoked or rotated
- [ ] Follow-up work captured

## Failback or normalization
1. [State whether service remains in DR site or returns to primary]
2. [Conditions required before failback]
3. [Traffic switch procedure]
4. [Data sync direction and cutover checks]

## Evidence to retain
- Timeline of major actions
- Backup identifiers used
- Restore timings
- Validation output
- Communication timestamps
- Follow-up issues / tickets

## Post-incident review
- What worked:
- What failed:
- What slowed recovery:
- Missing tools or access:
- Runbook changes required:
- Next scheduled DR test date:

Example recovery scenarios

Do not write one generic recovery section and hope it covers everything. Different failure modes need different restore sources, different stop points, and different validation emphasis.

Scenario	Primary decision	Typical restore source	Validation focus
Primary region outage	Fail over entire service stack or operate degraded	Recovery-region infrastructure plus recent backups	Traffic routing, dependency health, customer access
Application data corruption	Choose the recovery target time before the bad write	Latest full backup plus WAL or binlogs	Data correctness, application transactions, downstream consistency
Credential compromise	Whether restored assets are still trustworthy	Immutable backups and clean infrastructure	Credential rotation, audit trail, residual attacker access
Accidental deletion of a single dataset	Object-level restore or full-service restore	Logical dump, snapshot clone, or object version history	Only affected dataset restored, no unrelated rollback

For the first 15 minutes of a real incident, pair the DR runbook with Incident: First 15 Minutes. The incident guide helps with initial control; the DR runbook tells you how to recover.

Validation checklist

Validation should be explicit and test the user path, data path, and operator path. If your checklist only says "service started", it is not a validation checklist.

Area	What to validate	Example
Reachability	Endpoint responds from expected region or site	`curl -fsS https://service.example.com/healthz`
Authentication	Users and service accounts can log in	OIDC login or API token smoke test
Data integrity	Critical tables, objects, or queues look sane	Row counts, checksum spot checks, known order lookup
Background processing	Schedulers, workers, and cron jobs are running	Queue drains, scheduled job heartbeat present
Observability	Metrics, logs, and alerts flow from the recovered system	Dashboards update and alerts route correctly
Security	Emergency credentials are rotated or revoked	Break-glass account logged and disabled after use

Run validation in the same order you restored dependencies. The fastest way to lose time is to debug an application symptom caused by an unvalidated lower-layer dependency.

Communication plan

Communication belongs in the runbook because operators under pressure will otherwise defer it. Define who speaks, where they speak, and how often updates happen.

Internal update example

Status: recovering
Impact: customer logins fail and API writes are unavailable
Current step: restoring primary database in recovery region
RPO expectation: up to 15 minutes of data loss
Next update: 14:30 UTC

External update example

We are recovering service in an alternate environment.
Current impact: logins and API writes remain unavailable.
Current data-loss estimate: no more than 15 minutes, still being validated.
Next public update by 14:30 UTC.

Use the same discipline as a planned change in Change Window Runbook: clear roles, timestamped updates, and an explicit next checkpoint.

Post-incident review

A DR event or drill should always produce edits. Review what slowed recovery, what information was stale, what access was missing, and whether the declared RPO/RTO were realistic.

Capture actual restore duration per major component.
Record any manual steps that should become automation.
Update dependency order if a hidden prerequisite appeared during recovery.
Review whether communication cadence matched stakeholder needs.
Schedule the next drill while the pain is still memorable.

If the service is containerized, the recovery path may depend on image registries, cluster state, and manifest sources. Carry those into the runbook and then continue with Containers 101 for platform-specific recovery dependencies.

Troubleshooting bad runbooks

Failure mode	What it looks like during an incident	How to fix it now
No activation criteria	Teams debate whether to declare DR while recovery time slips away	Add explicit triggers, decision owners, and a "not DR" list
No last-tested date	Operators discover commands, paths, or owners changed months ago	Track review and test dates in document control and alert on staleness
Secrets embedded or missing	Either sensitive data leaks in docs, or recovery stalls because no one knows where credentials live	Reference vault paths or break-glass procedures, never raw secrets
Restore steps not dependency-ordered	Application recovery starts before DNS, database, or identity are ready	Rewrite the procedure by dependency chain and restore priority
No validation steps	Incident is declared resolved even though the service is only partially working	Add concrete commands, health checks, and data sanity tests
No communication owner	Status updates stop while engineers are busy recovering	Assign a communications lead and define update cadence in the runbook