Postgres Backup

Use logical dumps for portability, physical backups plus WAL for fast full-cluster recovery, and validate restores often enough that point-in-time recovery is routine instead of guesswork.

If you only remember six things

pg_dump is for logical exports; pg_basebackup is for physical cluster backups. They solve different problems.
Back up roles and tablespace metadata separately with pg_dumpall --globals-only.
Point-in-time recovery requires both a usable base backup and complete WAL archiving.
Always verify a physical backup with pg_verifybackup and a logical backup with a real restore.
Ownership and mode on restored data directories matter: postgres:postgres and 0700 are the usual minimum.
Monitor backup age, WAL archiving health, and restore test results. Silent archive failure is a real outage precursor.

On this page

Logical backups with pg_dump
Physical backups with pg_basebackup
WAL archiving
Point-in-time recovery
pgBackRest and wal-g
Restore validation
Ownership, permissions, and access
Monitoring
Troubleshooting

Logical backups with pg_dump

pg_dump is ideal when you want schema-level portability, a single database export, or selective restore. It is not the fastest way to recover a large cluster, but it is excellent for migrations, developer restores, and application-level backups.

# One database, custom format, compressed
pg_dump \
  -Fc \
  -Z 6 \
  -f /srv/backups/postgres/appdb-$(date +%F_%H%M).dump \
  appdb

# Cluster-wide globals: roles, tablespaces, grants
pg_dumpall --globals-only \
  > /srv/backups/postgres/globals-$(date +%F_%H%M).sql

# Inspect the archive contents before you need it
pg_restore --list /srv/backups/postgres/appdb-2026-04-23_0100.dump

Use custom format (-Fc) unless you have a specific reason not to. It supports parallel restore with pg_restore -j and selective object recovery.

Large clusters usually need physical backups plus WAL for primary recovery. Keep logical dumps anyway for schema diffing, object-level restore, and cross-version migrations.

Physical backups with pg_basebackup

pg_basebackup copies the data directory in a consistent state while Postgres is running. This is the built-in foundation for full-cluster restore, replica bootstrap, and point-in-time recovery.

pg_basebackup \
  -h primary.db.internal \
  -U backup \
  -D /srv/backups/postgres/base-$(date +%F_%H%M) \
  -Fp \
  -X stream \
  -P \
  --checkpoint=fast

pg_verifybackup /srv/backups/postgres/base-2026-04-23_0100

Key flags:

-Fp writes plain files. Use -Ft if you prefer tar archives.
-X stream streams required WAL during the backup so the result is self-consistent.
--checkpoint=fast finishes sooner at the cost of more immediate I/O on the primary.

If your data set is large or your retention rules are complex, move quickly from ad hoc pg_basebackup scripts to a purpose-built tool such as pgBackRest or wal-g.

WAL archiving

WAL archiving is what turns "restore the last backup" into "restore to 14:20:37 UTC just before the bad deploy". Without archived WAL, your RPO is the time between base backups.

# postgresql.conf
wal_level = replica
archive_mode = on
archive_timeout = 60s
archive_command = 'test ! -f /var/lib/pgsql/wal-archive/%f && cp %p /var/lib/pgsql/wal-archive/%f'
max_wal_senders = 10

# Smoke-test the archive path as the postgres user
sudo -u postgres test -w /var/lib/pgsql/wal-archive
sudo -u postgres psql -c "SELECT pg_switch_wal();"
ls -lh /var/lib/pgsql/wal-archive | tail

Archive storage must outlive the primary. Copying WAL to the same local disk as PGDATA is not archiving; it is rearranging risk.

Point-in-time recovery

PITR means: restore a base backup, replay archived WAL until the chosen point, then promote. Practice it before you need it. The incident is not the time to learn whether your archive actually contains the target window.

systemctl stop postgresql-16
rm -rf /var/lib/pgsql/16/data/*
rsync -a /srv/backups/postgres/base-2026-04-23_0100/ /var/lib/pgsql/16/data/

cat >> /var/lib/pgsql/16/data/postgresql.auto.conf <<'EOF'
restore_command = 'cp /var/lib/pgsql/wal-archive/%f %p'
recovery_target_time = '2026-04-23 14:20:00+00'
recovery_target_action = 'promote'
EOF

touch /var/lib/pgsql/16/data/recovery.signal
chown -R postgres:postgres /var/lib/pgsql/16/data
chmod 0700 /var/lib/pgsql/16/data
systemctl start postgresql-16

Other valid targets include transaction ID, named restore point, or immediate recovery to the end of archived WAL. Pick the least ambiguous target you can explain in a runbook.

pgBackRest and wal-g

Both are widely used because they solve the operational work around Postgres backups: retention, manifests, integrity checks, parallelism, cloud storage integration, and restore ergonomics.

# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-path=/pgbackrest
repo1-retention-full=7
repo1-retention-diff=14
start-fast=y
process-max=4

[main]
pg1-path=/var/lib/pgsql/16/data
pg1-port=5432

# wal-g example
export WALG_S3_PREFIX=s3://sysref-prod/postgres
export PGDATA=/var/lib/pgsql/16/data
export WALG_COMPRESSION_METHOD=zstd

wal-g backup-push "$PGDATA"
wal-g wal-push /var/lib/pgsql/16/data/pg_wal/0000000100000000000000A1

pgBackRest is a strong default for self-managed clusters and formal retention policy. wal-g is common in object-storage-heavy environments and cloud-native flows. Either is better than an unverified shell script glued together at 02:00.

Restore validation

Validation is service-specific. Do not stop at "Postgres started." Confirm that roles, extensions, data, and critical application queries all work on the restored instance.

# Restore globals first
psql -f /srv/backups/postgres/globals-2026-04-23_0100.sql postgres

# Restore one database from a logical dump
createdb appdb_restore
pg_restore -d appdb_restore -j 4 /srv/backups/postgres/appdb-2026-04-23_0100.dump

# Smoke tests
psql -d appdb_restore -c "SELECT count(*) FROM orders;"
psql -d appdb_restore -c "SELECT extname FROM pg_extension ORDER BY 1;"
psql -d postgres -c "SELECT rolname FROM pg_roles ORDER BY 1;"

Verify object counts or business-critical row counts, not just table existence.
Check extension availability on the restore target before you need it.
Record restore duration and compare it against the declared RTO from Backup & Restore.

Ownership, permissions, and access

Many restore failures are not data problems but permissions problems. The Postgres service user must own the restored files, tablespace mount points must exist, and the backup user must be allowed to connect for replication or backup operations.

# pg_hba.conf
host replication backup       10.20.0.0/24  scram-sha-256
host all         restorecheck 10.20.1.0/24  scram-sha-256

install -d -o postgres -g postgres -m 0700 /srv/backups/postgres
chown -R postgres:postgres /var/lib/pgsql/16/data
chmod 0700 /var/lib/pgsql/16/data

# Tablespaces must exist with the right owner before restore
install -d -o postgres -g postgres -m 0700 /pg_tblspc/ts_fast

Common miss: restoring a dump into a server that is missing roles or tablespace directories. Logical data may restore, but ownership, grants, or object placement will fail partway through.

Monitoring

Watch three things: last successful backup time, WAL archive health, and restore test recency. If you only monitor the cron exit code, you will miss archive stalls and unusable restore media.

SELECT archived_count,
       failed_count,
       last_archived_wal,
       last_archived_time,
       last_failed_wal,
       last_failed_time
FROM   pg_stat_archiver;

#!/usr/bin/env bash
set -euo pipefail

age=$(sudo -u postgres psql -Atqc \
  "SELECT EXTRACT(EPOCH FROM now() - last_archived_time)::bigint FROM pg_stat_archiver")

printf 'postgres_wal_archive_age_seconds{cluster="main"} %s\n' "$age" \
  > /var/lib/node_exporter/textfile_collector/postgres_wal.prom

groups:
  - name: postgres-backup
    rules:
      - alert: PostgresWalArchiveStale
        expr: postgres_wal_archive_age_seconds{cluster="main"} > 300
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "WAL archiving has stalled for more than five minutes"

Plot these alongside replication and disk metrics from Postgres Ops so backup health is visible during routine operations, not only during restores.

Troubleshooting

Symptom	Likely cause	Fix
`pg_restore: ERROR: role "app" does not exist`	Globals were not restored first	Restore `pg_dumpall --globals-only` output before the database dump
`requested WAL segment ... has already been removed`	WAL retention too short, archive broken, or wrong recovery target	Confirm archive completeness, extend retention, and retry from an earlier base backup if needed
`archive_command failed` keeps incrementing	Archive destination full, path missing, or wrong permissions	Fix the destination immediately, force WAL rotation, and confirm `pg_stat_archiver` recovers
`FATAL: data directory ... has wrong ownership`	Files restored as root or another user	`chown -R postgres:postgres` and `chmod 0700` on the data directory
`pg_basebackup: no pg_hba.conf entry`	Backup user or source network is not allowed for replication	Add a replication entry in `pg_hba.conf` and reload config
`pg_verifybackup` reports checksum or manifest errors	Corrupted transfer, incomplete copy, or storage fault	Discard that backup set, investigate storage, and do not trust the artifact for recovery