LVM Thin & Snapshots
- Overprovision with a plan, not by accident. Track actual usage (
data_percent) and metadata (metadata_percent) on every pool. - Metadata exhaustion = data loss. Watch
metadata_percentat least as closely asdata_percent. Metadata is roughly 0.1–0.5% of data in normal use; grow it before it fills. - Autoextend is a safety belt, not a plan.
thin_pool_autoextend_threshold = 80,thin_pool_autoextend_percent = 20in/etc/lvm/lvm.conf. - Thick snapshots age into silent invalidation when their CoW area fills. Thin snapshots don't, but they do share metadata with the parent pool.
- Snapshots are not backups. They protect against wrong-delete and enable consistent reads for backup tooling; they do not survive loss of the underlying VG.
- Thin pool vs regular LV
- Creating a thin pool and thin volumes
- Overprovisioning math
- Monitoring: lvs and systemd
- Autoextend configuration
- Thick snapshots
- Thin snapshots
- Snapshot-based backups
- Rollback procedure
- Online extension
- Metadata repair
- Gotchas and failure modes
- Troubleshooting
- Cross-reference
Thin pool vs regular LV
- Regular (thick) LV
- Allocates all its extents from the VG at create time. Snapshots reserve a separate copy-on-write area sized at create time.
- Thin pool
- A special LV containing a data sub-LV and a metadata sub-LV. Thin volumes carved from it allocate chunks only on first write.
- Thin volume
- A virtual-sized LV backed by a thin pool. Its virtual size can exceed the pool's physical size; real blocks are materialised on demand.
- Chunk size
- The allocation unit for the pool (default 64 KiB; 256 KiB or more is common for VMs/databases). Bigger chunks = less metadata, more internal fragmentation.
Use thin LVM when you want fast cheap snapshots (CI, dev environments, container overlays), when you genuinely cannot size volumes up-front, or when you need many similar volumes (VM images). Prefer thick LVs for single-purpose hosts where the cost of a pool exhaustion incident outweighs the flexibility.
Creating a thin pool and thin volumes
pvcreate /dev/sdb /dev/sdc
vgcreate data /dev/sdb /dev/sdc
# Create a 500G thin pool with explicit metadata size and chunk size
lvcreate -L 500G -T data/thinpool \
--poolmetadatasize 1G \
--chunksize 256K
# Carve a 200G thin volume (virtual)
lvcreate -V 200G -T data/thinpool -n app
mkfs.xfs /dev/data/app
mkdir -p /srv/app
mount /dev/data/app /srv/app
# Carve another, overprovisioned
lvcreate -V 400G -T data/thinpool -n db
mkfs.xfs /dev/data/db
mount /dev/data/db /srv/db
lvs -a -o +chunk_size,metadata_percent,data_percent
Sample lvs -a output:
LV VG Attr LSize Pool Origin Data% Meta% Chunk
thinpool data twi-aotz-- 500.00g 0.05 0.20 256.00k
[thinpool_tdata] data Twi-ao---- 500.00g 0
[thinpool_tmeta] data ewi-ao---- 1.00g 0
app data Vwi-aotz-- 200.00g thinpool 0.11 0
db data Vwi-aotz-- 400.00g thinpool 0.00 0
Overprovisioning math
Overprovisioning is the ratio of the sum of thin-volume virtual sizes to the pool's physical data size. A pool with 500 GiB of data and 2 TiB of thin volumes is 4× overprovisioned.
| Term | Symbol | Notes |
|---|---|---|
| Pool data size | D | From lvs: the _tdata LV size |
| Pool metadata size | M | From lvs: the _tmeta LV size (typ. 0.1–0.5% of D) |
| Sum of thin virtual sizes | V | awk '{s+=$4} END{print s}' across thin volumes |
| Overprovision ratio | V/D | ≥ 1 means some overprovisioning; 2–3× typical |
| Used data | U | data_percent * D / 100 |
| Used metadata | Um | metadata_percent * M / 100 |
Rule-of-thumb sizing:
- Metadata: start at 1 GiB per TiB of data with 256 KiB chunks. Grow when
metadata_percentcrosses 50%. - Data: keep
data_percentcomfortably below autoextend threshold (typically 80%). Treat 90% as page-the-oncall. - Overprovision: 1.5–3× is common. Beyond that, you must have both monitoring and headroom in the VG to extend into.
Monitoring: lvs and systemd
# Headline numbers
lvs -a -o +metadata_percent,data_percent,chunk_size
# Full state: pool vs origins, attributes, flags
lvs -a -o lv_name,lv_attr,lv_size,pool_lv,origin,data_percent,metadata_percent
# Also check for "out of metadata" flag
lvs -o lv_name,lv_attr | grep -E 'D|M' # a 'D' in attr = thin pool data space exhausted
dmsetup status # kernel view of thin targets
The kernel exposes a dmeventd-driven lvm2-monitor service that enforces autoextend thresholds. Keep it enabled on every host:
systemctl enable --now lvm2-monitor
systemctl status lvm2-monitor
journalctl -u dm-event -u lvm2-monitor --since today
A simple host-level alerting loop (Prometheus node_exporter textfile, cron, or Ansible-delivered timer):
#!/bin/bash
# /usr/local/sbin/lvm-thin-probe.sh
set -euo pipefail
lvs --noheadings --units b -o vg_name,lv_name,lv_attr,data_percent,metadata_percent 2>/dev/null \
| awk '$3 ~ /^t/ {
gsub("%", "", $4); gsub("%", "", $5);
if ($4+0 > 85 || $5+0 > 50)
printf "ALERT %s/%s data=%s%% meta=%s%%\n", $1, $2, $4, $5
}'
Autoextend configuration
Autoextend grows the pool (data and/or metadata) when usage crosses a threshold. Configure in /etc/lvm/lvm.conf under [activation]. Default on RHEL-family is 100/0 which effectively disables it — change this.
# /etc/lvm/lvm.conf (activation {})
activation {
thin_pool_autoextend_threshold = 80
thin_pool_autoextend_percent = 20
# Optional: metadata pool auto-extend (modern LVM)
# thin_pool_autoextend_threshold applies to both data and metadata
# when the pool is monitored.
}
lvmconfig --type current activation/thin_pool_autoextend_threshold
lvmconfig --type current activation/thin_pool_autoextend_percent
# Per-pool overrides (not common, but possible via lvchange)
lvchange --errorwhenfull y data/thinpool # fail I/O instead of hanging when pool full
lvchange --monitor y data/thinpool # ensure dmeventd is watching
--errorwhenfull y is safer than the default. By default a full pool blocks writes (the process hangs in D state). With --errorwhenfull y the kernel returns ENOSPC to the filesystem, which is usually handled more gracefully than an indefinite freeze.
Thick snapshots
A thick snapshot allocates a fixed CoW region. Once that region fills, the snapshot is invalidated (unusable). They are cheap for short-lived consistency points (backup of a quiesced DB) but expensive to keep open on busy volumes.
# Create: -s for snapshot, -L for CoW area, -n for name
lvcreate -s -L 20G -n db-snap /dev/data/db-thick
# Check CoW usage
lvs -a -o +origin,data_percent | grep db-snap
# When done (or automatically on invalidation), remove
lvremove -f /dev/data/db-snap
Sizing the CoW: during the snapshot's lifetime you need one chunk per modified block in the origin. For a database taking 5 minutes of backup with ~5% of blocks changing, 5–10% of origin size is usually safe. Too small = invalidation; too big = wasted VG space.
Thin snapshots
Thin snapshots are first-class thin LVs sharing chunks with the origin. They cost almost nothing at create time; writes to either origin or snapshot allocate from the pool. They do not invalidate when "full" — they share the same space accounting as the pool.
# Create
lvcreate -s -n app-snap-preupgrade /dev/data/app
# Mount (note: thin snapshots are writable by default; use -pr for read-only)
lvcreate -s -pr -n app-ro-snap /dev/data/app
mkdir -p /mnt/app-snap
mount -o ro /dev/data/app-ro-snap /mnt/app-snap
lvs -a -o +origin,data_percent,metadata_percent | grep app
lvremove -f /dev/data/app-snap-preupgrade
-o ro is a filesystem layer; the LV itself may still accept writes. Use lvcreate -s -pr (permissions read-only) for immutable snapshots.
Snapshot-based backups
The value of a snapshot for backup is a stable point-in-time image while the application keeps running. The pattern:
- Quiesce the application just enough for a consistent on-disk state (DB:
FLUSH TABLES WITH READ LOCK,pg_start_backup/ low-level backup mode, or use a filesystem-freeze). - Create a snapshot.
- Release the application.
- Mount the snapshot read-only, stream to backup target, unmount, remove the snapshot.
#!/bin/bash
# /usr/local/sbin/thin-backup.sh
set -euo pipefail
VG=data
ORIGIN=db
SNAP="${ORIGIN}-bak-$(date +%Y%m%dT%H%M%S)"
MOUNT="/mnt/${SNAP}"
fsfreeze -f "/srv/${ORIGIN}"
lvcreate -s -pr -n "$SNAP" "/dev/${VG}/${ORIGIN}"
fsfreeze -u "/srv/${ORIGIN}"
mkdir -p "$MOUNT"
mount -o ro,noload "/dev/${VG}/${SNAP}" "$MOUNT"
restic -r "$BACKUP_REPO" backup --tag "lvm-snap" "$MOUNT"
umount "$MOUNT"
rmdir "$MOUNT"
lvremove -fy "/dev/${VG}/${SNAP}"
noload. Mounting an XFS snapshot containing an unclean log requires -o ro,norecovery. Since snapshots of a live filesystem almost always have a dirty log, this flag is routine. Ext4's equivalent is -o ro,noload.
Rollback procedure
Rolling back an origin to a snapshot is called merge in LVM. It works for both thick and thin snapshots.
# Create a pre-change snapshot
lvcreate -s -n app-pre /dev/data/app
# ... change goes wrong ...
# Unmount the origin first (rollback can't merge while it's open read-write)
umount /srv/app
lvconvert --merge /dev/data/app-pre
# If the origin is open, the merge is deferred to next activation.
# For thin volumes, you often have to deactivate + reactivate:
lvchange -an data/app
lvchange -ay data/app
mount /dev/data/app /srv/app
- After the merge completes, the snapshot LV is removed automatically.
- You can monitor progress with
lvs -a -o +seg_pe_ranges,progresson the merging origin (some LVM versions use different column names). - Rolling back a thin origin does not free pool space that was shared with other snapshots — accounting gets subtle; watch
data_percent.
Online extension
# Grow the pool's data area by 200G
lvextend -L +200G data/thinpool
# Grow the pool's metadata area by 512M
lvextend --poolmetadatasize +512M data/thinpool
# Grow a thin volume and then the filesystem in one step
lvextend -L +50G --resizefs /dev/data/app # ext4/xfs both supported
# Or step-by-step
lvextend -L +50G /dev/data/app
xfs_growfs /srv/app
# or: resize2fs /dev/data/app
# After extending, verify
lvs -a -o +data_percent,metadata_percent
df -h /srv/app
xfsdump/mkfs.xfs/xfsrestore. Size with some thought; overprovisioning hides sizing mistakes but doesn't fix them.
Metadata repair
If a pool's metadata is corrupt (unexpected power loss, kernel bug, storage problem), LVM refuses to activate it and you see thin-pool: no free metadata space or Thin metadata device has insufficient space.
# Deactivate the pool
lvchange -an data/thinpool
# Dump and repair metadata (modern LVM wraps thin_repair for you)
lvconvert --repair data/thinpool
# After repair, activate and inspect
lvchange -ay data/thinpool
lvs -a -o +metadata_percent,data_percent
dmesg | grep -i thin
--repair does a round trip through a spare metadata LV. Keep the VG with at least metadata-size worth of free extents or the repair cannot run.
Gotchas and failure modes
- Metadata exhaustion: the #1 cause of unrecoverable thin pools. Autoextend helps but only if the VG has free extents. If metadata fills and you cannot extend in time, the pool may need offline
thin_repairwith good odds of data loss. - Data exhaustion with default behaviour: writes hang in D state. Combined with default systemd journal on the same pool, you lose logging and SSH responsiveness.
- Discards propagation: ensure
issue_discards = 1inlvm.confand-o discardon filesystems that can benefit (XFS with appropriate kernel version), otherwise deleting files does not return chunks to the pool. - Snapshot churn on CI pools: hundreds of short-lived thin snapshots can saturate metadata faster than data. Size metadata for the workload, not just the data volume.
- Encryption layers: LUKS below LVM hides discards by default; pass
discardin/etc/crypttabto let deletes reach the pool. - Cloning a VM with thin LV inside a thin LV: double CoW blows up amplification. Prefer file-based VM images on thick XFS for simple setups, or a single layer of thin if you want snapshots.
- Mixed chunk sizes: you cannot create a thin pool from two VGs with different chunk settings. Plan chunk size per pool up-front.
- RAID underneath: a thin pool on top of mdraid or lvmraid is fine; note that resync traffic counts against physical I/O but not against
data_percent.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Writes hang; dmesg says thin-pool: reached low water mark |
Pool near exhaustion | lvextend -L +N data/thinpool; ensure autoextend is enabled; consider --errorwhenfull y |
thin-pool: no free metadata space |
Metadata full | lvextend --poolmetadatasize +1G data/thinpool; if too late, offline lvconvert --repair |
Pool attr shows D (data exhausted) or M (metadata) |
Pool flagged unhealthy by dmeventd | Extend the relevant area; lvchange --monitor y after |
| Autoextend never fires | lvm2-monitor disabled, or thresholds at defaults (100%) |
systemctl enable --now lvm2-monitor; set thin_pool_autoextend_threshold = 80 |
| Snapshot activation fails after reboot | Merge was pending; origin was active at boot | lvchange -an the origin, then lvchange -ay; lvs -a should show the snap gone |
XFS snapshot mount fails with needs recovery |
Dirty log, as expected | Mount with -o ro,norecovery; never run recovery on a snapshot you plan to discard |
Deleted files don't shrink data_percent |
Discards not propagated | Check issue_discards = 1 in lvm.conf; fstrim -av; verify LUKS discard if present |
| Rollback merge refuses to start | Origin is mounted; merge deferred | Unmount origin, lvchange -an, lvchange -ay; merge completes during activation |
| Chunk size "too small" warning at pool creation | Virtual size vs chunk size exceeds metadata limits | Pick a larger chunk (256K/512K) or a larger metadata LV; reconsider overprovision ratio |
| After snapshot removed, space still held | Filesystem never trimmed; chunks retained pending discard | fstrim -v /mountpoint; confirm lvs -a -o +data_percent drops afterwards |
Cross-reference
- LVM — PV/VG/LV fundamentals and thick-LV workflow.
- Backup & Restore — where snapshot-based flows fit into your overall backup plan.
- Postgres backup and MySQL backup — quiescing DBs before taking snapshots.
- sysctl tuning —
vm.dirty_*, elevator, and I/O knobs that interact with thin pools under load. - SSSD — unrelated, but often deployed on the same EL hosts that carry LVM thin pools.
- DR Runbook Template — incorporate pool-exhaustion recovery into your DR plan.