Backup & Restore

Backups are only useful if operators can restore the right data, at the right time, into the right order. Plan around restore objectives, not tool defaults.

If you only remember six things
  • Use the 3-2-1 rule: at least three copies, on two different media or systems, with one copy off-site.
  • Define RPO and RTO per service before choosing a backup schedule.
  • State-heavy services usually need application-consistent backups, not just storage snapshots.
  • Retention without pruning and restore testing eventually becomes silent failure.
  • Encrypt backup data and document how recovery keys are accessed during an incident.
  • Alert on backup freshness and age of the last successful restore drill, not just job exit codes.

3-2-1 strategy

Three copies means production data plus at least two backup copies. Two media or systems means do not trust one storage platform, one account, or one API path. One off-site means a different failure domain: another region, another account, or another provider.

# backup inventory.yml
services:
  - name: payments-postgres
    tier: 1
    method: app-consistent + WAL
    rpo: 15m
    rto: 2h
    copies:
      - type: local-object-storage
        location: dc-a / s3-main
      - type: off-site-object-storage
        location: dc-b / s3-dr
      - type: immutable-monthly
        location: cloud-archive / locked-bucket

  - name: media-files
    tier: 2
    method: filesystem snapshot + object replication
    rpo: 4h
    rto: 8h
    copies:
      - type: local-nas
        location: dc-a / nas01
      - type: off-site-object-storage
        location: dc-b / archive-store

The local copy is for speed. The off-site copy is for facility loss. The immutable or offline copy is for ransomware, credential compromise, or an operator deleting the wrong repository.

RPO, RTO, and restore priorities

RPO is how much data loss the business can tolerate. RTO is how long the service may stay unavailable. These are restore targets, not storage targets. They should drive schedule, tooling, monitoring, and staffing.

TierSystemTarget RPOTarget RTORestore order
1Identity, DNS, primary database15 minutes2 hoursFirst
2API workers, queue broker, internal dashboards1 hour4 hoursAfter tier 1 dependencies are healthy
3Analytics, batch jobs, historical reporting24 hours24 hoursLast

Restore order matters because a fast application restore is useless if DNS, credentials, object storage, or the primary database are still down. Treat restore dependency mapping the same way you treat service dependency mapping in SLOs & On-Call.

Crash-consistent vs application-consistent

Crash-consistent means "as if the machine lost power." This is often fine for stateless nodes and some filesystems, but it may leave databases replaying logs or recovering from half-written pages. Application-consistent means the service cooperated: transactions were flushed, backup hooks ran, or the tool understood the database.

# Filesystem-consistent snapshot for a quiet host
fsfreeze -f /srv/app
lvcreate --snapshot --size 20G --name app_$(date +%F_%H%M) /dev/vg0/app
fsfreeze -u /srv/app

# Application-consistent database backup
pg_dump -Fc -f /srv/backups/appdb-$(date +%F_%H%M).dump appdb
Operator trap: a VM snapshot is not automatically a database backup. For database-specific methods and point-in-time recovery, see Postgres Backup and MySQL Backup.

Retention, off-site copies, and immutability

Retention should match business, legal, and operational needs. Keep short retention for frequent rollback, longer retention for slow-burn corruption, and a small number of long-term archives for audit or catastrophic recovery.

# Example retention on a repository that supports pruning
restic forget \
  --keep-hourly 24 \
  --keep-daily 7 \
  --keep-weekly 4 \
  --keep-monthly 12 \
  --keep-yearly 3 \
  --prune

Encryption and key handling

Encrypt backups in transit and at rest. Just as important: document who can decrypt them during an incident, how break-glass access works, and how keys are rotated.

# /etc/backup/restic.env
RESTIC_REPOSITORY=s3:https://s3.example.net/prod-backups
RESTIC_PASSWORD_FILE=/etc/backup/restic.pass
AWS_SHARED_CREDENTIALS_FILE=/etc/backup/s3.credentials
chmod 0600 /etc/backup/restic.env /etc/backup/restic.pass /etc/backup/s3.credentials

# Encrypt an exported dump before moving it off host
age -r age1backuprecipientexample appdb.dump > appdb.dump.age
shred -u appdb.dump
Keys belong in the disaster recovery scope too. If only one admin knows where the decryption material lives, your backup process has a people-shaped single point of failure.

Restore drills and validation

A successful backup job proves only that bytes were written somewhere. A restore drill proves the data can be recovered, permissions are correct, dependencies are known, and operators can complete the work inside the target RTO.

  1. Restore to an isolated host or network, never into live production.
  2. Time the full workflow: data transfer, service startup, validation, DNS or endpoint changes.
  3. Run service-specific validation, not just "process started". Check schema version, row counts, health endpoints, and credentials.
  4. Capture gaps in the runbook immediately. The value of a drill is the edit that happens after it.
# Example restore drill skeleton
set -euo pipefail

RESTORE_ROOT=/srv/restore-drill/$(date +%F)
mkdir -p "$RESTORE_ROOT"
restic restore latest --target "$RESTORE_ROOT"

sha256sum -c "$RESTORE_ROOT/checksums.sha256"
systemctl start app-restore-check.service
curl -fsS http://127.0.0.1:8080/healthz

Use DR Runbook Template to turn each drill into a repeatable recovery procedure instead of tribal memory.

Monitoring backup freshness

Monitor age of the last known-good backup and the last restore drill. Alert before you breach the declared RPO. A backup system that quietly stopped yesterday is an incident today, not tomorrow.

#!/usr/bin/env bash
set -euo pipefail

stamp=$(stat -c %Y /srv/backups/postgres/latest.ok)
cat > /var/lib/node_exporter/textfile_collector/backup.prom <
groups:
  - name: backup-freshness
    rules:
      - alert: BackupStale
        expr: time() - backup_latest_success_unixtime{scope="prod"} > 3600
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "Backup is older than the allowed RPO"
          description: "Backup age is {{ $value }} seconds for {{ $labels.job }}."

Surface freshness on dashboards too. Graphs in Grafana Basics make stale backups visible during normal operations instead of only during incidents.

Troubleshooting and operator mistakes

Symptom or mistakeWhy it hurtsWhat to do instead
All copies live in one cloud account or regionCredential loss or regional outage wipes both primary and backup pathsKeep at least one copy in a separate failure domain with separate access control
Backup job is green, restore fails immediatelyThe job only checked write success, not data usability or metadata completenessSchedule restore drills and validate application start, schema, and credentials
Storage snapshot restores but the database is corruptThe backup was crash-consistent, not application-consistentUse database-aware tooling or replay logs/WAL/binlogs as designed
Backups exist but nobody can decrypt themKeys, passwords, or IAM paths were not part of DR planningDocument break-glass access and test decryption during drills
Retention keeps growing until the backup target fillsPruning was never enabled or never verifiedDefine explicit retention policy, automate prune, and monitor repository size
Tier 3 systems are restored before identity or primary data storesOperators are restoring by familiarity, not by dependencyPublish a restore priority list tied to service dependencies and business impact

See also: Borg & Borgmatic for file-level deduplicated backups, Postgres Backup, MySQL Backup, DR Runbook Template, and Change Window Runbook.