Backup & Restore

Backups are only useful if operators can restore the right data, at the right time, into the right order. Plan around restore objectives, not tool defaults.

If you only remember six things

Use the 3-2-1 rule: at least three copies, on two different media or systems, with one copy off-site.
Define RPO and RTO per service before choosing a backup schedule.
State-heavy services usually need application-consistent backups, not just storage snapshots.
Retention without pruning and restore testing eventually becomes silent failure.
Encrypt backup data and document how recovery keys are accessed during an incident.
Alert on backup freshness and age of the last successful restore drill, not just job exit codes.

On this page

3-2-1 strategy
RPO, RTO, and restore priorities
Crash-consistent vs application-consistent
Retention, off-site copies, and immutability
Encryption and key handling
Restore drills and validation
Monitoring backup freshness
Troubleshooting and operator mistakes

3-2-1 strategy

Three copies means production data plus at least two backup copies. Two media or systems means do not trust one storage platform, one account, or one API path. One off-site means a different failure domain: another region, another account, or another provider.

# backup inventory.yml
services:
  - name: payments-postgres
    tier: 1
    method: app-consistent + WAL
    rpo: 15m
    rto: 2h
    copies:
      - type: local-object-storage
        location: dc-a / s3-main
      - type: off-site-object-storage
        location: dc-b / s3-dr
      - type: immutable-monthly
        location: cloud-archive / locked-bucket

  - name: media-files
    tier: 2
    method: filesystem snapshot + object replication
    rpo: 4h
    rto: 8h
    copies:
      - type: local-nas
        location: dc-a / nas01
      - type: off-site-object-storage
        location: dc-b / archive-store

The local copy is for speed. The off-site copy is for facility loss. The immutable or offline copy is for ransomware, credential compromise, or an operator deleting the wrong repository.

RPO, RTO, and restore priorities

RPO is how much data loss the business can tolerate. RTO is how long the service may stay unavailable. These are restore targets, not storage targets. They should drive schedule, tooling, monitoring, and staffing.

Tier	System	Target RPO	Target RTO	Restore order
1	Identity, DNS, primary database	15 minutes	2 hours	First
2	API workers, queue broker, internal dashboards	1 hour	4 hours	After tier 1 dependencies are healthy
3	Analytics, batch jobs, historical reporting	24 hours	24 hours	Last

Restore order matters because a fast application restore is useless if DNS, credentials, object storage, or the primary database are still down. Treat restore dependency mapping the same way you treat service dependency mapping in SLOs & On-Call.

Crash-consistent vs application-consistent

Crash-consistent means "as if the machine lost power." This is often fine for stateless nodes and some filesystems, but it may leave databases replaying logs or recovering from half-written pages. Application-consistent means the service cooperated: transactions were flushed, backup hooks ran, or the tool understood the database.

# Filesystem-consistent snapshot for a quiet host
fsfreeze -f /srv/app
lvcreate --snapshot --size 20G --name app_$(date +%F_%H%M) /dev/vg0/app
fsfreeze -u /srv/app

# Application-consistent database backup
pg_dump -Fc -f /srv/backups/appdb-$(date +%F_%H%M).dump appdb

Operator trap: a VM snapshot is not automatically a database backup. For database-specific methods and point-in-time recovery, see Postgres Backup and MySQL Backup.

Retention, off-site copies, and immutability

Retention should match business, legal, and operational needs. Keep short retention for frequent rollback, longer retention for slow-burn corruption, and a small number of long-term archives for audit or catastrophic recovery.

# Example retention on a repository that supports pruning
restic forget \
  --keep-hourly 24 \
  --keep-daily 7 \
  --keep-weekly 4 \
  --keep-monthly 12 \
  --keep-yearly 3 \
  --prune

Off-site means an independent blast radius: another region, another account, or another provider. "A second bucket in the same compromised account" is not off-site.
Use immutable storage where available for at least one copy. WORM or object-lock style policies are common choices.
Document retention exceptions. Finance or compliance systems often need different schedules from application logs.
Test pruning in a non-production repository first. A bad retention rule can delete your only usable history faster than a disk failure.

Encryption and key handling

Encrypt backups in transit and at rest. Just as important: document who can decrypt them during an incident, how break-glass access works, and how keys are rotated.

# /etc/backup/restic.env
RESTIC_REPOSITORY=s3:https://s3.example.net/prod-backups
RESTIC_PASSWORD_FILE=/etc/backup/restic.pass
AWS_SHARED_CREDENTIALS_FILE=/etc/backup/s3.credentials

chmod 0600 /etc/backup/restic.env /etc/backup/restic.pass /etc/backup/s3.credentials

# Encrypt an exported dump before moving it off host
age -r age1backuprecipientexample appdb.dump > appdb.dump.age
shred -u appdb.dump

Keys belong in the disaster recovery scope too. If only one admin knows where the decryption material lives, your backup process has a people-shaped single point of failure.

Restore drills and validation

A successful backup job proves only that bytes were written somewhere. A restore drill proves the data can be recovered, permissions are correct, dependencies are known, and operators can complete the work inside the target RTO.

Restore to an isolated host or network, never into live production.
Time the full workflow: data transfer, service startup, validation, DNS or endpoint changes.
Run service-specific validation, not just "process started". Check schema version, row counts, health endpoints, and credentials.
Capture gaps in the runbook immediately. The value of a drill is the edit that happens after it.

# Example restore drill skeleton
set -euo pipefail

RESTORE_ROOT=/srv/restore-drill/$(date +%F)
mkdir -p "$RESTORE_ROOT"
restic restore latest --target "$RESTORE_ROOT"

sha256sum -c "$RESTORE_ROOT/checksums.sha256"
systemctl start app-restore-check.service
curl -fsS http://127.0.0.1:8080/healthz

Use DR Runbook Template to turn each drill into a repeatable recovery procedure instead of tribal memory.

Monitoring backup freshness

Monitor age of the last known-good backup and the last restore drill. Alert before you breach the declared RPO. A backup system that quietly stopped yesterday is an incident today, not tomorrow.

#!/usr/bin/env bash
set -euo pipefail

stamp=$(stat -c %Y /srv/backups/postgres/latest.ok)
cat > /var/lib/node_exporter/textfile_collector/backup.prom <



groups:
  - name: backup-freshness
    rules:
      - alert: BackupStale
        expr: time() - backup_latest_success_unixtime{scope="prod"} > 3600
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "Backup is older than the allowed RPO"
          description: "Backup age is {{ $value }} seconds for {{ $labels.job }}."

    Surface freshness on dashboards too. Graphs in Grafana Basics make stale backups visible during normal operations instead of only during incidents.

    Troubleshooting and operator mistakes
    
      Symptom or mistake Why it hurts What to do instead
      
        All copies live in one cloud account or region Credential loss or regional outage wipes both primary and backup paths Keep at least one copy in a separate failure domain with separate access control
        Backup job is green, restore fails immediately The job only checked write success, not data usability or metadata completeness Schedule restore drills and validate application start, schema, and credentials
        Storage snapshot restores but the database is corrupt The backup was crash-consistent, not application-consistent Use database-aware tooling or replay logs/WAL/binlogs as designed
        Backups exist but nobody can decrypt them Keys, passwords, or IAM paths were not part of DR planning Document break-glass access and test decryption during drills
        Retention keeps growing until the backup target fills Pruning was never enabled or never verified Define explicit retention policy, automate prune, and monitor repository size
        Tier 3 systems are restored before identity or primary data stores Operators are restoring by familiarity, not by dependency Publish a restore priority list tied to service dependencies and business impact
      
    

    See also: Borg & Borgmatic for file-level deduplicated backups, Postgres Backup, MySQL Backup, DR Runbook Template, and Change Window Runbook.

    
      ← PreviousSLOs & On-Call
      Next →Borg & Borgmatic

Symptom or mistake	Why it hurts	What to do instead
All copies live in one cloud account or region	Credential loss or regional outage wipes both primary and backup paths	Keep at least one copy in a separate failure domain with separate access control
Backup job is green, restore fails immediately	The job only checked write success, not data usability or metadata completeness	Schedule restore drills and validate application start, schema, and credentials
Storage snapshot restores but the database is corrupt	The backup was crash-consistent, not application-consistent	Use database-aware tooling or replay logs/WAL/binlogs as designed
Backups exist but nobody can decrypt them	Keys, passwords, or IAM paths were not part of DR planning	Document break-glass access and test decryption during drills
Retention keeps growing until the backup target fills	Pruning was never enabled or never verified	Define explicit retention policy, automate prune, and monitor repository size
Tier 3 systems are restored before identity or primary data stores	Operators are restoring by familiarity, not by dependency	Publish a restore priority list tied to service dependencies and business impact