DR Runbook Template
- A DR runbook needs owner, scope, dependencies, RPO, RTO, and exact activation criteria.
- Do not put secrets in the document. Put references to where break-glass credentials or keys live.
- Recovery steps must be ordered by dependency, not by team preference or service familiarity.
- Every runbook needs validation steps that prove the service works, not just that processes started.
- Include communication checkpoints and roles so updates happen without stalling technical work.
- The most important field in a runbook is often
Last tested. Untested runbooks are drafts.
What a DR runbook must contain
A usable runbook answers seven operator questions: when do we declare DR, what are we restoring, what must come back first, what backups or artifacts do we trust, who does each step, how do we prove success, and who needs status updates along the way.
| Section | Why it matters |
|---|---|
| Document control | Shows owner, review date, last successful test, and version of the procedure being followed |
| Activation criteria | Stops teams from improvising DR for routine incidents or waiting too long for real disasters |
| Dependencies and priorities | Prevents restoring the app before DNS, identity, secrets, storage, or databases are ready |
| Recovery assets | Names the backups, build artifacts, infra code, credentials references, and dashboards needed to recover |
| Step-by-step procedure | Turns intent into an execution order an operator can follow under pressure |
| Validation | Defines how to prove the recovered service is usable and within RPO/RTO targets |
| Communication and review | Keeps stakeholders informed during recovery and captures learning afterward |
Write DR runbooks alongside your backup plans from Backup & Restore. If a step depends on a database-specific restore flow, link the exact page such as Postgres Backup or MySQL Backup.
Copyable markdown template
Copy this into your internal docs system and replace the placeholders. Keep it versioned. Keep it reviewed. Keep it tested.
# Disaster Recovery Runbook - [service-name]
## Document control
- Service: [service-name]
- Service owner: [team or primary owner]
- Runbook owner: [person or team]
- Recovery tier: [tier-1 | tier-2 | tier-3]
- Target RPO: [e.g. 15 minutes]
- Target RTO: [e.g. 2 hours]
- Last reviewed: [YYYY-MM-DD]
- Last tested: [YYYY-MM-DD]
- Source of truth: [link to repo or doc path]
- Related runbooks:
- [backup runbook]
- [database restore runbook]
- [network / DNS runbook]
## Activation criteria
- DR is declared when:
- [condition 1]
- [condition 2]
- Not a DR event when:
- [condition 1]
- [condition 2]
- Declared by: [role]
- Incident commander: [name or role]
- Incident channel / bridge: [channel, room, bridge URL]
- Incident / ticket ID: [identifier]
- Change freeze required: [yes/no]
## Service summary
- What this service does:
- Primary users / customers:
- Business impact if unavailable:
- Data classification:
- Primary region / site:
- Recovery region / site:
## Dependencies and recovery order
| Priority | Dependency | Why required | Owner | Validation |
| --- | --- | --- | --- | --- |
| 1 | [identity / DNS / secrets / DB] | [reason] | [team] | [check] |
| 2 | [queue / object storage / cache] | [reason] | [team] | [check] |
| 3 | [application tier] | [reason] | [team] | [check] |
## Roles and contacts
| Role | Primary | Secondary | Contact path |
| --- | --- | --- | --- |
| Incident commander | [name] | [name] | [pager / phone / chat] |
| Service operator | [name] | [name] | [pager / phone / chat] |
| Database owner | [name] | [name] | [pager / phone / chat] |
| Network / DNS owner | [name] | [name] | [pager / phone / chat] |
| Communications lead | [name] | [name] | [pager / phone / chat] |
## Recovery assets
- Infrastructure code repo: [link]
- Deployment manifests / charts / Terraform path: [link]
- Golden images or build artifacts: [location]
- Backup locations:
- Full backup: [path / bucket / repo]
- Incremental / WAL / binlog: [path / bucket / repo]
- Config backup: [path / bucket / repo]
- Credential references:
- Break-glass account: [vault path or process]
- KMS / decryption key reference: [vault path or process]
- Cloud account / subscription reference: [account ID]
- Monitoring and logs:
- Dashboard: [link]
- Log search: [link]
- Backup freshness metric: [link]
## Preconditions
1. Confirm the blast radius and affected dependencies.
2. Freeze non-essential changes and deployments.
3. Capture the current time and incident timeline start.
4. Confirm backup currency against the target RPO.
5. Confirm the target recovery environment is reachable.
6. Confirm required credentials and keys are accessible.
## Recovery procedure
### Scenario A - primary region loss
1. [Bring up control plane / networking in recovery region]
2. [Restore identity, DNS, or secrets dependencies]
3. [Restore database or data store]
4. [Restore application state or object storage]
5. [Deploy service tier]
6. [Switch traffic]
7. [Run validation]
### Scenario B - accidental deletion or corruption
1. [Stop writes or put the service in maintenance mode]
2. [Identify recovery target time]
3. [Restore from latest good backup]
4. [Replay logs to target time if needed]
5. [Validate data integrity]
6. [Resume writes]
### Scenario C - credential compromise or ransomware
1. [Isolate affected accounts / hosts]
2. [Use immutable or offline backup source]
3. [Rotate credentials before exposing recovered service]
4. [Restore clean infrastructure and data]
5. [Validate and reconnect dependencies]
## Command log
- Restore command(s):
- Validation command(s):
- Rollback command(s):
- Links to automation jobs:
## Validation checklist
- [ ] Service reachable at expected endpoint
- [ ] Authentication and authorization work
- [ ] Database or state store current within RPO
- [ ] Critical background jobs running
- [ ] Metrics, logs, and alerts flowing
- [ ] Customer-visible health checks green
- [ ] Recovery duration recorded against RTO
## Communication plan
### Internal status update
- Audience: [engineering leadership / support / stakeholders]
- Frequency: [every 15 minutes / every milestone]
- Template:
- Status: [investigating | recovering | validating | complete]
- Impact: [what users see]
- Current step: [what operators are doing now]
- Next update: [time]
### External / customer update
- Channel: [status page / email / support]
- Approval needed from: [role]
- Template:
- We are currently [recovering / validating] [service-name].
- Customer impact: [summary]
- Data loss expectation: [none / up to X minutes / unknown]
- Next update by: [time]
## Exit criteria
- [ ] Validation checklist complete
- [ ] Monitoring stable for [duration]
- [ ] Support / stakeholders informed
- [ ] Temporary access revoked or rotated
- [ ] Follow-up work captured
## Failback or normalization
1. [State whether service remains in DR site or returns to primary]
2. [Conditions required before failback]
3. [Traffic switch procedure]
4. [Data sync direction and cutover checks]
## Evidence to retain
- Timeline of major actions
- Backup identifiers used
- Restore timings
- Validation output
- Communication timestamps
- Follow-up issues / tickets
## Post-incident review
- What worked:
- What failed:
- What slowed recovery:
- Missing tools or access:
- Runbook changes required:
- Next scheduled DR test date:
Example recovery scenarios
Do not write one generic recovery section and hope it covers everything. Different failure modes need different restore sources, different stop points, and different validation emphasis.
| Scenario | Primary decision | Typical restore source | Validation focus |
|---|---|---|---|
| Primary region outage | Fail over entire service stack or operate degraded | Recovery-region infrastructure plus recent backups | Traffic routing, dependency health, customer access |
| Application data corruption | Choose the recovery target time before the bad write | Latest full backup plus WAL or binlogs | Data correctness, application transactions, downstream consistency |
| Credential compromise | Whether restored assets are still trustworthy | Immutable backups and clean infrastructure | Credential rotation, audit trail, residual attacker access |
| Accidental deletion of a single dataset | Object-level restore or full-service restore | Logical dump, snapshot clone, or object version history | Only affected dataset restored, no unrelated rollback |
For the first 15 minutes of a real incident, pair the DR runbook with Incident: First 15 Minutes. The incident guide helps with initial control; the DR runbook tells you how to recover.
Validation checklist
Validation should be explicit and test the user path, data path, and operator path. If your checklist only says "service started", it is not a validation checklist.
| Area | What to validate | Example |
|---|---|---|
| Reachability | Endpoint responds from expected region or site | curl -fsS https://service.example.com/healthz |
| Authentication | Users and service accounts can log in | OIDC login or API token smoke test |
| Data integrity | Critical tables, objects, or queues look sane | Row counts, checksum spot checks, known order lookup |
| Background processing | Schedulers, workers, and cron jobs are running | Queue drains, scheduled job heartbeat present |
| Observability | Metrics, logs, and alerts flow from the recovered system | Dashboards update and alerts route correctly |
| Security | Emergency credentials are rotated or revoked | Break-glass account logged and disabled after use |
Run validation in the same order you restored dependencies. The fastest way to lose time is to debug an application symptom caused by an unvalidated lower-layer dependency.
Communication plan
Communication belongs in the runbook because operators under pressure will otherwise defer it. Define who speaks, where they speak, and how often updates happen.
Internal update example
Status: recovering
Impact: customer logins fail and API writes are unavailable
Current step: restoring primary database in recovery region
RPO expectation: up to 15 minutes of data loss
Next update: 14:30 UTC
External update example
We are recovering service in an alternate environment.
Current impact: logins and API writes remain unavailable.
Current data-loss estimate: no more than 15 minutes, still being validated.
Next public update by 14:30 UTC.
Use the same discipline as a planned change in Change Window Runbook: clear roles, timestamped updates, and an explicit next checkpoint.
Post-incident review
A DR event or drill should always produce edits. Review what slowed recovery, what information was stale, what access was missing, and whether the declared RPO/RTO were realistic.
- Capture actual restore duration per major component.
- Record any manual steps that should become automation.
- Update dependency order if a hidden prerequisite appeared during recovery.
- Review whether communication cadence matched stakeholder needs.
- Schedule the next drill while the pain is still memorable.
If the service is containerized, the recovery path may depend on image registries, cluster state, and manifest sources. Carry those into the runbook and then continue with Containers 101 for platform-specific recovery dependencies.
Troubleshooting bad runbooks
| Failure mode | What it looks like during an incident | How to fix it now |
|---|---|---|
| No activation criteria | Teams debate whether to declare DR while recovery time slips away | Add explicit triggers, decision owners, and a "not DR" list |
| No last-tested date | Operators discover commands, paths, or owners changed months ago | Track review and test dates in document control and alert on staleness |
| Secrets embedded or missing | Either sensitive data leaks in docs, or recovery stalls because no one knows where credentials live | Reference vault paths or break-glass procedures, never raw secrets |
| Restore steps not dependency-ordered | Application recovery starts before DNS, database, or identity are ready | Rewrite the procedure by dependency chain and restore priority |
| No validation steps | Incident is declared resolved even though the service is only partially working | Add concrete commands, health checks, and data sanity tests |
| No communication owner | Status updates stop while engineers are busy recovering | Assign a communications lead and define update cadence in the runbook |
See also: Backup & Restore, Incident: First 15 Minutes, Change Window Runbook, and SLOs & On-Call.