Postgres Backup

Use logical dumps for portability, physical backups plus WAL for fast full-cluster recovery, and validate restores often enough that point-in-time recovery is routine instead of guesswork.

If you only remember six things
  • pg_dump is for logical exports; pg_basebackup is for physical cluster backups. They solve different problems.
  • Back up roles and tablespace metadata separately with pg_dumpall --globals-only.
  • Point-in-time recovery requires both a usable base backup and complete WAL archiving.
  • Always verify a physical backup with pg_verifybackup and a logical backup with a real restore.
  • Ownership and mode on restored data directories matter: postgres:postgres and 0700 are the usual minimum.
  • Monitor backup age, WAL archiving health, and restore test results. Silent archive failure is a real outage precursor.

Logical backups with pg_dump

pg_dump is ideal when you want schema-level portability, a single database export, or selective restore. It is not the fastest way to recover a large cluster, but it is excellent for migrations, developer restores, and application-level backups.

# One database, custom format, compressed
pg_dump \
  -Fc \
  -Z 6 \
  -f /srv/backups/postgres/appdb-$(date +%F_%H%M).dump \
  appdb

# Cluster-wide globals: roles, tablespaces, grants
pg_dumpall --globals-only \
  > /srv/backups/postgres/globals-$(date +%F_%H%M).sql

# Inspect the archive contents before you need it
pg_restore --list /srv/backups/postgres/appdb-2026-04-23_0100.dump

Use custom format (-Fc) unless you have a specific reason not to. It supports parallel restore with pg_restore -j and selective object recovery.

Large clusters usually need physical backups plus WAL for primary recovery. Keep logical dumps anyway for schema diffing, object-level restore, and cross-version migrations.

Physical backups with pg_basebackup

pg_basebackup copies the data directory in a consistent state while Postgres is running. This is the built-in foundation for full-cluster restore, replica bootstrap, and point-in-time recovery.

pg_basebackup \
  -h primary.db.internal \
  -U backup \
  -D /srv/backups/postgres/base-$(date +%F_%H%M) \
  -Fp \
  -X stream \
  -P \
  --checkpoint=fast

pg_verifybackup /srv/backups/postgres/base-2026-04-23_0100

Key flags:

If your data set is large or your retention rules are complex, move quickly from ad hoc pg_basebackup scripts to a purpose-built tool such as pgBackRest or wal-g.

WAL archiving

WAL archiving is what turns "restore the last backup" into "restore to 14:20:37 UTC just before the bad deploy". Without archived WAL, your RPO is the time between base backups.

# postgresql.conf
wal_level = replica
archive_mode = on
archive_timeout = 60s
archive_command = 'test ! -f /var/lib/pgsql/wal-archive/%f && cp %p /var/lib/pgsql/wal-archive/%f'
max_wal_senders = 10
# Smoke-test the archive path as the postgres user
sudo -u postgres test -w /var/lib/pgsql/wal-archive
sudo -u postgres psql -c "SELECT pg_switch_wal();"
ls -lh /var/lib/pgsql/wal-archive | tail

Archive storage must outlive the primary. Copying WAL to the same local disk as PGDATA is not archiving; it is rearranging risk.

Point-in-time recovery

PITR means: restore a base backup, replay archived WAL until the chosen point, then promote. Practice it before you need it. The incident is not the time to learn whether your archive actually contains the target window.

systemctl stop postgresql-16
rm -rf /var/lib/pgsql/16/data/*
rsync -a /srv/backups/postgres/base-2026-04-23_0100/ /var/lib/pgsql/16/data/

cat >> /var/lib/pgsql/16/data/postgresql.auto.conf <<'EOF'
restore_command = 'cp /var/lib/pgsql/wal-archive/%f %p'
recovery_target_time = '2026-04-23 14:20:00+00'
recovery_target_action = 'promote'
EOF

touch /var/lib/pgsql/16/data/recovery.signal
chown -R postgres:postgres /var/lib/pgsql/16/data
chmod 0700 /var/lib/pgsql/16/data
systemctl start postgresql-16

Other valid targets include transaction ID, named restore point, or immediate recovery to the end of archived WAL. Pick the least ambiguous target you can explain in a runbook.

pgBackRest and wal-g

Both are widely used because they solve the operational work around Postgres backups: retention, manifests, integrity checks, parallelism, cloud storage integration, and restore ergonomics.

# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-path=/pgbackrest
repo1-retention-full=7
repo1-retention-diff=14
start-fast=y
process-max=4

[main]
pg1-path=/var/lib/pgsql/16/data
pg1-port=5432
# wal-g example
export WALG_S3_PREFIX=s3://sysref-prod/postgres
export PGDATA=/var/lib/pgsql/16/data
export WALG_COMPRESSION_METHOD=zstd

wal-g backup-push "$PGDATA"
wal-g wal-push /var/lib/pgsql/16/data/pg_wal/0000000100000000000000A1

pgBackRest is a strong default for self-managed clusters and formal retention policy. wal-g is common in object-storage-heavy environments and cloud-native flows. Either is better than an unverified shell script glued together at 02:00.

Restore validation

Validation is service-specific. Do not stop at "Postgres started." Confirm that roles, extensions, data, and critical application queries all work on the restored instance.

# Restore globals first
psql -f /srv/backups/postgres/globals-2026-04-23_0100.sql postgres

# Restore one database from a logical dump
createdb appdb_restore
pg_restore -d appdb_restore -j 4 /srv/backups/postgres/appdb-2026-04-23_0100.dump

# Smoke tests
psql -d appdb_restore -c "SELECT count(*) FROM orders;"
psql -d appdb_restore -c "SELECT extname FROM pg_extension ORDER BY 1;"
psql -d postgres -c "SELECT rolname FROM pg_roles ORDER BY 1;"

Ownership, permissions, and access

Many restore failures are not data problems but permissions problems. The Postgres service user must own the restored files, tablespace mount points must exist, and the backup user must be allowed to connect for replication or backup operations.

# pg_hba.conf
host replication backup       10.20.0.0/24  scram-sha-256
host all         restorecheck 10.20.1.0/24  scram-sha-256
install -d -o postgres -g postgres -m 0700 /srv/backups/postgres
chown -R postgres:postgres /var/lib/pgsql/16/data
chmod 0700 /var/lib/pgsql/16/data

# Tablespaces must exist with the right owner before restore
install -d -o postgres -g postgres -m 0700 /pg_tblspc/ts_fast
Common miss: restoring a dump into a server that is missing roles or tablespace directories. Logical data may restore, but ownership, grants, or object placement will fail partway through.

Monitoring

Watch three things: last successful backup time, WAL archive health, and restore test recency. If you only monitor the cron exit code, you will miss archive stalls and unusable restore media.

SELECT archived_count,
       failed_count,
       last_archived_wal,
       last_archived_time,
       last_failed_wal,
       last_failed_time
FROM   pg_stat_archiver;
#!/usr/bin/env bash
set -euo pipefail

age=$(sudo -u postgres psql -Atqc \
  "SELECT EXTRACT(EPOCH FROM now() - last_archived_time)::bigint FROM pg_stat_archiver")

printf 'postgres_wal_archive_age_seconds{cluster="main"} %s\n' "$age" \
  > /var/lib/node_exporter/textfile_collector/postgres_wal.prom
groups:
  - name: postgres-backup
    rules:
      - alert: PostgresWalArchiveStale
        expr: postgres_wal_archive_age_seconds{cluster="main"} > 300
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "WAL archiving has stalled for more than five minutes"

Plot these alongside replication and disk metrics from Postgres Ops so backup health is visible during routine operations, not only during restores.

Troubleshooting

SymptomLikely causeFix
pg_restore: ERROR: role "app" does not existGlobals were not restored firstRestore pg_dumpall --globals-only output before the database dump
requested WAL segment ... has already been removedWAL retention too short, archive broken, or wrong recovery targetConfirm archive completeness, extend retention, and retry from an earlier base backup if needed
archive_command failed keeps incrementingArchive destination full, path missing, or wrong permissionsFix the destination immediately, force WAL rotation, and confirm pg_stat_archiver recovers
FATAL: data directory ... has wrong ownershipFiles restored as root or another userchown -R postgres:postgres and chmod 0700 on the data directory
pg_basebackup: no pg_hba.conf entryBackup user or source network is not allowed for replicationAdd a replication entry in pg_hba.conf and reload config
pg_verifybackup reports checksum or manifest errorsCorrupted transfer, incomplete copy, or storage faultDiscard that backup set, investigate storage, and do not trust the artifact for recovery

See also: Backup & Restore, Postgres Ops, and DR Runbook Template.