Postgres Backup
pg_dumpis for logical exports;pg_basebackupis for physical cluster backups. They solve different problems.- Back up roles and tablespace metadata separately with
pg_dumpall --globals-only. - Point-in-time recovery requires both a usable base backup and complete WAL archiving.
- Always verify a physical backup with
pg_verifybackupand a logical backup with a real restore. - Ownership and mode on restored data directories matter:
postgres:postgresand0700are the usual minimum. - Monitor backup age, WAL archiving health, and restore test results. Silent archive failure is a real outage precursor.
Logical backups with pg_dump
pg_dump is ideal when you want schema-level portability, a single database export, or selective restore. It is not the fastest way to recover a large cluster, but it is excellent for migrations, developer restores, and application-level backups.
# One database, custom format, compressed
pg_dump \
-Fc \
-Z 6 \
-f /srv/backups/postgres/appdb-$(date +%F_%H%M).dump \
appdb
# Cluster-wide globals: roles, tablespaces, grants
pg_dumpall --globals-only \
> /srv/backups/postgres/globals-$(date +%F_%H%M).sql
# Inspect the archive contents before you need it
pg_restore --list /srv/backups/postgres/appdb-2026-04-23_0100.dump
Use custom format (-Fc) unless you have a specific reason not to. It supports parallel restore with pg_restore -j and selective object recovery.
Physical backups with pg_basebackup
pg_basebackup copies the data directory in a consistent state while Postgres is running. This is the built-in foundation for full-cluster restore, replica bootstrap, and point-in-time recovery.
pg_basebackup \
-h primary.db.internal \
-U backup \
-D /srv/backups/postgres/base-$(date +%F_%H%M) \
-Fp \
-X stream \
-P \
--checkpoint=fast
pg_verifybackup /srv/backups/postgres/base-2026-04-23_0100
Key flags:
-Fpwrites plain files. Use-Ftif you prefer tar archives.-X streamstreams required WAL during the backup so the result is self-consistent.--checkpoint=fastfinishes sooner at the cost of more immediate I/O on the primary.
If your data set is large or your retention rules are complex, move quickly from ad hoc pg_basebackup scripts to a purpose-built tool such as pgBackRest or wal-g.
WAL archiving
WAL archiving is what turns "restore the last backup" into "restore to 14:20:37 UTC just before the bad deploy". Without archived WAL, your RPO is the time between base backups.
# postgresql.conf
wal_level = replica
archive_mode = on
archive_timeout = 60s
archive_command = 'test ! -f /var/lib/pgsql/wal-archive/%f && cp %p /var/lib/pgsql/wal-archive/%f'
max_wal_senders = 10
# Smoke-test the archive path as the postgres user
sudo -u postgres test -w /var/lib/pgsql/wal-archive
sudo -u postgres psql -c "SELECT pg_switch_wal();"
ls -lh /var/lib/pgsql/wal-archive | tail
Archive storage must outlive the primary. Copying WAL to the same local disk as PGDATA is not archiving; it is rearranging risk.
Point-in-time recovery
PITR means: restore a base backup, replay archived WAL until the chosen point, then promote. Practice it before you need it. The incident is not the time to learn whether your archive actually contains the target window.
systemctl stop postgresql-16
rm -rf /var/lib/pgsql/16/data/*
rsync -a /srv/backups/postgres/base-2026-04-23_0100/ /var/lib/pgsql/16/data/
cat >> /var/lib/pgsql/16/data/postgresql.auto.conf <<'EOF'
restore_command = 'cp /var/lib/pgsql/wal-archive/%f %p'
recovery_target_time = '2026-04-23 14:20:00+00'
recovery_target_action = 'promote'
EOF
touch /var/lib/pgsql/16/data/recovery.signal
chown -R postgres:postgres /var/lib/pgsql/16/data
chmod 0700 /var/lib/pgsql/16/data
systemctl start postgresql-16
Other valid targets include transaction ID, named restore point, or immediate recovery to the end of archived WAL. Pick the least ambiguous target you can explain in a runbook.
pgBackRest and wal-g
Both are widely used because they solve the operational work around Postgres backups: retention, manifests, integrity checks, parallelism, cloud storage integration, and restore ergonomics.
# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-path=/pgbackrest
repo1-retention-full=7
repo1-retention-diff=14
start-fast=y
process-max=4
[main]
pg1-path=/var/lib/pgsql/16/data
pg1-port=5432
# wal-g example
export WALG_S3_PREFIX=s3://sysref-prod/postgres
export PGDATA=/var/lib/pgsql/16/data
export WALG_COMPRESSION_METHOD=zstd
wal-g backup-push "$PGDATA"
wal-g wal-push /var/lib/pgsql/16/data/pg_wal/0000000100000000000000A1
pgBackRest is a strong default for self-managed clusters and formal retention policy. wal-g is common in object-storage-heavy environments and cloud-native flows. Either is better than an unverified shell script glued together at 02:00.
Restore validation
Validation is service-specific. Do not stop at "Postgres started." Confirm that roles, extensions, data, and critical application queries all work on the restored instance.
# Restore globals first
psql -f /srv/backups/postgres/globals-2026-04-23_0100.sql postgres
# Restore one database from a logical dump
createdb appdb_restore
pg_restore -d appdb_restore -j 4 /srv/backups/postgres/appdb-2026-04-23_0100.dump
# Smoke tests
psql -d appdb_restore -c "SELECT count(*) FROM orders;"
psql -d appdb_restore -c "SELECT extname FROM pg_extension ORDER BY 1;"
psql -d postgres -c "SELECT rolname FROM pg_roles ORDER BY 1;"
- Verify object counts or business-critical row counts, not just table existence.
- Check extension availability on the restore target before you need it.
- Record restore duration and compare it against the declared RTO from Backup & Restore.
Ownership, permissions, and access
Many restore failures are not data problems but permissions problems. The Postgres service user must own the restored files, tablespace mount points must exist, and the backup user must be allowed to connect for replication or backup operations.
# pg_hba.conf
host replication backup 10.20.0.0/24 scram-sha-256
host all restorecheck 10.20.1.0/24 scram-sha-256
install -d -o postgres -g postgres -m 0700 /srv/backups/postgres
chown -R postgres:postgres /var/lib/pgsql/16/data
chmod 0700 /var/lib/pgsql/16/data
# Tablespaces must exist with the right owner before restore
install -d -o postgres -g postgres -m 0700 /pg_tblspc/ts_fast
Monitoring
Watch three things: last successful backup time, WAL archive health, and restore test recency. If you only monitor the cron exit code, you will miss archive stalls and unusable restore media.
SELECT archived_count,
failed_count,
last_archived_wal,
last_archived_time,
last_failed_wal,
last_failed_time
FROM pg_stat_archiver;
#!/usr/bin/env bash
set -euo pipefail
age=$(sudo -u postgres psql -Atqc \
"SELECT EXTRACT(EPOCH FROM now() - last_archived_time)::bigint FROM pg_stat_archiver")
printf 'postgres_wal_archive_age_seconds{cluster="main"} %s\n' "$age" \
> /var/lib/node_exporter/textfile_collector/postgres_wal.prom
groups:
- name: postgres-backup
rules:
- alert: PostgresWalArchiveStale
expr: postgres_wal_archive_age_seconds{cluster="main"} > 300
for: 10m
labels:
severity: page
annotations:
summary: "WAL archiving has stalled for more than five minutes"
Plot these alongside replication and disk metrics from Postgres Ops so backup health is visible during routine operations, not only during restores.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
pg_restore: ERROR: role "app" does not exist | Globals were not restored first | Restore pg_dumpall --globals-only output before the database dump |
requested WAL segment ... has already been removed | WAL retention too short, archive broken, or wrong recovery target | Confirm archive completeness, extend retention, and retry from an earlier base backup if needed |
archive_command failed keeps incrementing | Archive destination full, path missing, or wrong permissions | Fix the destination immediately, force WAL rotation, and confirm pg_stat_archiver recovers |
FATAL: data directory ... has wrong ownership | Files restored as root or another user | chown -R postgres:postgres and chmod 0700 on the data directory |
pg_basebackup: no pg_hba.conf entry | Backup user or source network is not allowed for replication | Add a replication entry in pg_hba.conf and reload config |
pg_verifybackup reports checksum or manifest errors | Corrupted transfer, incomplete copy, or storage fault | Discard that backup set, investigate storage, and do not trust the artifact for recovery |
See also: Backup & Restore, Postgres Ops, and DR Runbook Template.