Prometheus & node_exporter
- Prometheus is a pull system. If it can't reach the target, there's no metric. Plan networking accordingly.
- Use
file_sdfor any fleet over 20 hosts. Ansible writes the JSON; Prometheus reloads. - Keep retention small locally (15–30 days) and use
remote_writefor long-term. - Alert on symptoms (
up == 0, filesystem < 10%), not on causes. - The textfile collector is the escape hatch for any metric that doesn't have a real exporter — cron jobs, backups, certificate expiry.
- Run
promtool check config prometheus.ymlin CI. Catches half of all outages before they ship.
Install Prometheus
The distro packages lag. Use the upstream binary, unpack into /opt, and manage it with systemd.
sudo useradd --system --no-create-home --shell /usr/sbin/nologin prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus: /var/lib/prometheus
PROM_VER=2.53.2
cd /tmp
curl -sSL -O https://github.com/prometheus/prometheus/releases/download/v${PROM_VER}/prometheus-${PROM_VER}.linux-amd64.tar.gz
tar xzf prometheus-${PROM_VER}.linux-amd64.tar.gz
sudo install -m 0755 prometheus-${PROM_VER}.linux-amd64/{prometheus,promtool} /usr/local/bin/
sudo cp -r prometheus-${PROM_VER}.linux-amd64/{consoles,console_libraries} /etc/prometheus/
Minimal /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
env: prod
region: eu-west-1
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager.internal:9093']
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
- job_name: node
file_sd_configs:
- files:
- /etc/prometheus/targets/node/*.json
refresh_interval: 30s
systemd unit
# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network-online.target
Wants=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
RestartSec=5
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--storage.tsdb.retention.size=50GB \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.console.templates=/etc/prometheus/consoles \
--web.enable-lifecycle
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
sudo chown -R prometheus: /etc/prometheus
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
curl -s localhost:9090/-/healthy
--web.enable-lifecycle lets you curl -X POST localhost:9090/-/reload instead of restarting. Pair with a config-change handler in your config-management tool.
Install node_exporter on a fleet
node_exporter is a single static binary. Install identically on every host, run as a dedicated user, expose port 9100.
# roles/node_exporter/tasks/main.yml
- name: Create node_exporter user
ansible.builtin.user:
name: node_exporter
system: true
shell: /usr/sbin/nologin
create_home: false
- name: Download node_exporter
ansible.builtin.unarchive:
src: "https://github.com/prometheus/node_exporter/releases/download/v{{ nex_ver }}/node_exporter-{{ nex_ver }}.linux-amd64.tar.gz"
dest: /tmp
remote_src: true
creates: "/tmp/node_exporter-{{ nex_ver }}.linux-amd64/node_exporter"
- name: Install binary
ansible.builtin.copy:
remote_src: true
src: "/tmp/node_exporter-{{ nex_ver }}.linux-amd64/node_exporter"
dest: /usr/local/bin/node_exporter
mode: '0755'
notify: restart node_exporter
- name: Textfile directory
ansible.builtin.file:
path: /var/lib/node_exporter/textfile_collector
state: directory
owner: node_exporter
mode: '0755'
- name: systemd unit
ansible.builtin.copy:
dest: /etc/systemd/system/node_exporter.service
mode: '0644'
content: |
[Unit]
Description=Prometheus node_exporter
After=network-online.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
--web.listen-address=:9100
Restart=on-failure
[Install]
WantedBy=multi-user.target
notify: restart node_exporter
- name: Enable
ansible.builtin.systemd:
name: node_exporter
enabled: true
state: started
daemon_reload: true
Scrape configs — static and file_sd
Two hosts, static_configs is fine. Twenty hosts, write it by hand and live with the regret. A hundred, you want file_sd.
file_sd — Prometheus reads JSON files on disk
// /etc/prometheus/targets/node/prod-web.json
[
{
"targets": [
"web01.prod.internal:9100",
"web02.prod.internal:9100",
"web03.prod.internal:9100"
],
"labels": {
"env": "prod",
"role": "web",
"dc": "eu-west-1a"
}
}
]
Generate these from Ansible inventory, Consul, or whatever source of truth you have. Prometheus picks up file changes within refresh_interval — no restart.
# Ansible: build the target file from inventory
- name: Render node target file
ansible.builtin.copy:
dest: "/etc/prometheus/targets/node/{{ item.key }}.json"
mode: '0644'
content: "{{ [{'targets': item.value, 'labels': {'env': env, 'role': item.key}}] | to_nice_json }}"
loop: "{{ groups | dict2items | selectattr('key', 'in', roles_to_scrape) | list }}"
delegate_to: "{{ prometheus_host }}"
run_once: true
Relabeling — the part that trips everyone
You rarely want the raw __address__ as the instance label. Drop the port for readability:
- job_name: node
file_sd_configs:
- files: [/etc/prometheus/targets/node/*.json]
relabel_configs:
- source_labels: [__address__]
regex: '([^:]+)(?::\d+)?'
replacement: '${1}'
target_label: instance
Enabling collectors
node_exporter ships with dozens of collectors; some are on by default, some not. Enable what you use, disable what you don't.
| Collector | Flag | When to enable |
|---|---|---|
systemd | --collector.systemd | Always on Linux — gives unit state metrics |
textfile | --collector.textfile.directory=… | Always — escape hatch for custom metrics |
processes | --collector.processes | For per-process counts (expensive on huge process counts) |
ethtool | --collector.ethtool | Physical hosts with real NIC debugging needs |
hwmon | on by default | Temperature/fan; noisy on VMs — disable with --no-collector.hwmon |
nfs / nfsd | on by default | Disable if you don't use NFS; otherwise keep |
List all collectors the running binary knows about:
node_exporter --help | grep -E '^\s+--(no-)?collector\.'
The textfile collector
A cron job, a backup script, a certificate renewer — none of these have real exporters. The textfile collector reads *.prom files from a directory and exposes their contents.
# /usr/local/bin/backup-metrics
#!/usr/bin/env bash
set -euo pipefail
TARGET=/var/lib/node_exporter/textfile_collector/backup.prom
TMP=$(mktemp)
START=$(date +%s)
/usr/local/bin/restic-backup.sh
RC=$?
END=$(date +%s)
cat > "$TMP" <<EOF
# HELP backup_last_success_unixtime Timestamp of last successful backup
# TYPE backup_last_success_unixtime gauge
backup_last_success_unixtime $( [ $RC -eq 0 ] && echo $END || echo 0 )
# HELP backup_duration_seconds Duration of last backup run
# TYPE backup_duration_seconds gauge
backup_duration_seconds $((END - START))
# HELP backup_exit_code Last backup exit code (0 == success)
# TYPE backup_exit_code gauge
backup_exit_code $RC
EOF
mv "$TMP" "$TARGET"
mv. Write-in-place leaves a half-written file that Prometheus will parse-error on.
Now you can alert on missing backups:
time() - backup_last_success_unixtime > 36 * 3600
Alertmanager and useful alerts
Install alertmanager alongside Prometheus, same pattern (systemd unit, dedicated user). Point Prometheus at it via the alerting: block shown above.
A handful of alerts that actually matter
# /etc/prometheus/rules/node.yml
groups:
- name: node
rules:
- alert: NodeDown
expr: up{job="node"} == 0
for: 5m
labels:
severity: page
annotations:
summary: "node_exporter on {{ $labels.instance }} is down"
runbook_url: "https://wiki.internal/runbooks/node-down"
- alert: NodeFilesystemAlmostFull
expr: |
(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs"}
/ node_filesystem_size_bytes) * 100 < 10
for: 15m
labels:
severity: ticket
annotations:
summary: "{{ $labels.instance }}:{{ $labels.mountpoint }} below 10% free"
- alert: NodeLoadHigh
expr: node_load5 / count without (cpu) (node_cpu_seconds_total{mode="idle"}) > 2
for: 30m
labels:
severity: ticket
- alert: BackupStale
expr: time() - backup_last_success_unixtime > 36 * 3600
labels:
severity: page
annotations:
summary: "No successful backup on {{ $labels.instance }} in 36h"
Note: severity: page vs severity: ticket — your Alertmanager route splits those to pager vs email. Never page on disk-almost-full; that's a ticket.
Remote-write for long-term storage
Prometheus local storage is a 30-day goldfish. For long-term, ship samples out via remote_write to Mimir, Thanos, VictoriaMetrics, or a hosted service.
remote_write:
- url: https://mimir.internal/api/v1/push
basic_auth:
username: prom-eu-west-1
password_file: /etc/prometheus/mimir.pass
queue_config:
max_samples_per_send: 2000
capacity: 20000
max_shards: 50
write_relabel_configs:
# Drop noisy series we don't want to keep long-term
- source_labels: [__name__]
regex: 'go_gc_.*|process_open_fds'
action: drop
prometheus_remote_storage_samples_pending, prometheus_remote_storage_failed_samples_total. Silent drops are the worst kind.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
Target is down, curl from Prometheus host works | Firewall drops on target, or SELinux | firewall-cmd --add-port=9100/tcp --permanent; semanage port -a -t http_port_t -p tcp 9100 |
| Cardinality explosion; Prometheus OOMs | Labeled metric with user_id/uuid | topk(10, count by (__name__)({__name__=~".+"})); drop with metric_relabel_configs |
| Scrape durations rising | Exporter too slow, or too many collectors | Disable unused collectors; measure with scrape_duration_seconds |
promtool check config errors on rules/*.yml | Wrong indentation or missing groups: | Always wrap rules in a groups: list |
| Textfile metrics disappear | Script wrote in place, parser failed once, then file removed | Always mv from a temp file; keep textfile_mtime_seconds > 0 as a health gauge |
| No alerts fire despite metric being obviously bad | Alertmanager unreachable; Prometheus shows notifications_alertmanagers_discovered = 0 | Check the alerting: block DNS + port; hit /api/v2/status on AM |
Next: put dashboards on top of all this in Grafana Basics.