Backup & Restore
- Use the 3-2-1 rule: at least three copies, on two different media or systems, with one copy off-site.
- Define
RPOandRTOper service before choosing a backup schedule. - State-heavy services usually need application-consistent backups, not just storage snapshots.
- Retention without pruning and restore testing eventually becomes silent failure.
- Encrypt backup data and document how recovery keys are accessed during an incident.
- Alert on backup freshness and age of the last successful restore drill, not just job exit codes.
3-2-1 strategy
Three copies means production data plus at least two backup copies. Two media or systems means do not trust one storage platform, one account, or one API path. One off-site means a different failure domain: another region, another account, or another provider.
# backup inventory.yml
services:
- name: payments-postgres
tier: 1
method: app-consistent + WAL
rpo: 15m
rto: 2h
copies:
- type: local-object-storage
location: dc-a / s3-main
- type: off-site-object-storage
location: dc-b / s3-dr
- type: immutable-monthly
location: cloud-archive / locked-bucket
- name: media-files
tier: 2
method: filesystem snapshot + object replication
rpo: 4h
rto: 8h
copies:
- type: local-nas
location: dc-a / nas01
- type: off-site-object-storage
location: dc-b / archive-store
The local copy is for speed. The off-site copy is for facility loss. The immutable or offline copy is for ransomware, credential compromise, or an operator deleting the wrong repository.
RPO, RTO, and restore priorities
RPO is how much data loss the business can tolerate. RTO is how long the service may stay unavailable. These are restore targets, not storage targets. They should drive schedule, tooling, monitoring, and staffing.
| Tier | System | Target RPO | Target RTO | Restore order |
|---|---|---|---|---|
| 1 | Identity, DNS, primary database | 15 minutes | 2 hours | First |
| 2 | API workers, queue broker, internal dashboards | 1 hour | 4 hours | After tier 1 dependencies are healthy |
| 3 | Analytics, batch jobs, historical reporting | 24 hours | 24 hours | Last |
Restore order matters because a fast application restore is useless if DNS, credentials, object storage, or the primary database are still down. Treat restore dependency mapping the same way you treat service dependency mapping in SLOs & On-Call.
Crash-consistent vs application-consistent
Crash-consistent means "as if the machine lost power." This is often fine for stateless nodes and some filesystems, but it may leave databases replaying logs or recovering from half-written pages. Application-consistent means the service cooperated: transactions were flushed, backup hooks ran, or the tool understood the database.
# Filesystem-consistent snapshot for a quiet host
fsfreeze -f /srv/app
lvcreate --snapshot --size 20G --name app_$(date +%F_%H%M) /dev/vg0/app
fsfreeze -u /srv/app
# Application-consistent database backup
pg_dump -Fc -f /srv/backups/appdb-$(date +%F_%H%M).dump appdb
Retention, off-site copies, and immutability
Retention should match business, legal, and operational needs. Keep short retention for frequent rollback, longer retention for slow-burn corruption, and a small number of long-term archives for audit or catastrophic recovery.
# Example retention on a repository that supports pruning
restic forget \
--keep-hourly 24 \
--keep-daily 7 \
--keep-weekly 4 \
--keep-monthly 12 \
--keep-yearly 3 \
--prune
- Off-site means an independent blast radius: another region, another account, or another provider. "A second bucket in the same compromised account" is not off-site.
- Use immutable storage where available for at least one copy. WORM or object-lock style policies are common choices.
- Document retention exceptions. Finance or compliance systems often need different schedules from application logs.
- Test pruning in a non-production repository first. A bad retention rule can delete your only usable history faster than a disk failure.
Encryption and key handling
Encrypt backups in transit and at rest. Just as important: document who can decrypt them during an incident, how break-glass access works, and how keys are rotated.
# /etc/backup/restic.env
RESTIC_REPOSITORY=s3:https://s3.example.net/prod-backups
RESTIC_PASSWORD_FILE=/etc/backup/restic.pass
AWS_SHARED_CREDENTIALS_FILE=/etc/backup/s3.credentials
chmod 0600 /etc/backup/restic.env /etc/backup/restic.pass /etc/backup/s3.credentials
# Encrypt an exported dump before moving it off host
age -r age1backuprecipientexample appdb.dump > appdb.dump.age
shred -u appdb.dump
Restore drills and validation
A successful backup job proves only that bytes were written somewhere. A restore drill proves the data can be recovered, permissions are correct, dependencies are known, and operators can complete the work inside the target RTO.
- Restore to an isolated host or network, never into live production.
- Time the full workflow: data transfer, service startup, validation, DNS or endpoint changes.
- Run service-specific validation, not just "process started". Check schema version, row counts, health endpoints, and credentials.
- Capture gaps in the runbook immediately. The value of a drill is the edit that happens after it.
# Example restore drill skeleton
set -euo pipefail
RESTORE_ROOT=/srv/restore-drill/$(date +%F)
mkdir -p "$RESTORE_ROOT"
restic restore latest --target "$RESTORE_ROOT"
sha256sum -c "$RESTORE_ROOT/checksums.sha256"
systemctl start app-restore-check.service
curl -fsS http://127.0.0.1:8080/healthz
Use DR Runbook Template to turn each drill into a repeatable recovery procedure instead of tribal memory.
Monitoring backup freshness
Monitor age of the last known-good backup and the last restore drill. Alert before you breach the declared RPO. A backup system that quietly stopped yesterday is an incident today, not tomorrow.
#!/usr/bin/env bash
set -euo pipefail
stamp=$(stat -c %Y /srv/backups/postgres/latest.ok)
cat > /var/lib/node_exporter/textfile_collector/backup.prom <
groups:
- name: backup-freshness
rules:
- alert: BackupStale
expr: time() - backup_latest_success_unixtime{scope="prod"} > 3600
for: 10m
labels:
severity: page
annotations:
summary: "Backup is older than the allowed RPO"
description: "Backup age is {{ $value }} seconds for {{ $labels.job }}."
Surface freshness on dashboards too. Graphs in Grafana Basics make stale backups visible during normal operations instead of only during incidents.
Troubleshooting and operator mistakes
| Symptom or mistake | Why it hurts | What to do instead |
|---|---|---|
| All copies live in one cloud account or region | Credential loss or regional outage wipes both primary and backup paths | Keep at least one copy in a separate failure domain with separate access control |
| Backup job is green, restore fails immediately | The job only checked write success, not data usability or metadata completeness | Schedule restore drills and validate application start, schema, and credentials |
| Storage snapshot restores but the database is corrupt | The backup was crash-consistent, not application-consistent | Use database-aware tooling or replay logs/WAL/binlogs as designed |
| Backups exist but nobody can decrypt them | Keys, passwords, or IAM paths were not part of DR planning | Document break-glass access and test decryption during drills |
| Retention keeps growing until the backup target fills | Pruning was never enabled or never verified | Define explicit retention policy, automate prune, and monitor repository size |
| Tier 3 systems are restored before identity or primary data stores | Operators are restoring by familiarity, not by dependency | Publish a restore priority list tied to service dependencies and business impact |
See also: Borg & Borgmatic for file-level deduplicated backups, Postgres Backup, MySQL Backup, DR Runbook Template, and Change Window Runbook.