The model
Every service exposes a /metrics endpoint in the Prometheus text format. Prometheus scrapes those endpoints on a schedule (15s by default), labels each sample with the target's identity, and stores it. Alerting rules are evaluated continuously; when one fires, an alert event flows to Alertmanager, which decides where (and whether) to notify.
Install Prometheus
sudo useradd -r -s /sbin/nologin prometheus
sudo mkdir -p /var/lib/prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
PV=2.55.0
curl -L -o /tmp/prom.tar.gz \
"https://github.com/prometheus/prometheus/releases/download/v${PV}/prometheus-${PV}.linux-amd64.tar.gz"
tar -xzf /tmp/prom.tar.gz -C /tmp
sudo mv /tmp/prometheus-${PV}.linux-amd64/prometheus /usr/local/bin/
sudo mv /tmp/prometheus-${PV}.linux-amd64/promtool /usr/local/bin/
sudo mv /tmp/prometheus-${PV}.linux-amd64/consoles /etc/prometheus/
sudo mv /tmp/prometheus-${PV}.linux-amd64/console_libraries /etc/prometheus/
sudo tee /etc/prometheus/prometheus.yml <<'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: lab
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets: [ "alertmanager:9093" ]
scrape_configs:
- job_name: prometheus
static_configs:
- targets: [ "localhost:9090" ]
- job_name: node
static_configs:
- targets: [ "node1:9100", "node2:9100", "node3:9100" ]
- job_name: cadvisor
static_configs:
- targets: [ "node1:8080", "node2:8080" ]
EOF
sudo tee /etc/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus
After=network-online.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=90d \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.console.templates=/etc/prometheus/consoles \
--web.enable-lifecycle
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable --now prometheus
--web.enable-lifecycle lets curl -X POST http://localhost:9090/-/reload reload the config without restarting the server.
node_exporter on every host
Per-host CPU, memory, disk, network, filesystem, load average, NTP drift — one binary that scrapes /proc and exposes it as Prometheus metrics:
NV=1.9.0
curl -L -o /tmp/node.tar.gz \
"https://github.com/prometheus/node_exporter/releases/download/v${NV}/node_exporter-${NV}.linux-amd64.tar.gz"
tar -xzf /tmp/node.tar.gz -C /tmp
sudo mv /tmp/node_exporter-${NV}.linux-amd64/node_exporter /usr/local/bin/
sudo useradd -r -s /sbin/nologin node_exporter
sudo tee /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable --now node_exporter
curl http://localhost:9100/metrics | head
The output is the Prometheus text format: lots of # HELP, # TYPE, then metric{labels} value lines.
cAdvisor for container metrics
For Docker/Podman containers, cAdvisor (Google's container resource analyzer) exposes per-container CPU, memory, network, and I/O:
docker run -d \
--name cadvisor \
--restart unless-stopped \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
gcr.io/cadvisor/cadvisor:latest
PromQL by example
# CPU usage per host (idle inverted)
100 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
# Disk usage % per filesystem
100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100
# Memory available % per host
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
# Top-10 containers by memory
topk(10, container_memory_working_set_bytes{name!=""})
# Rate of HTTP errors per service
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
Alerting rules
# /etc/prometheus/rules/host.yml
groups:
- name: host
rules:
- alert: HostDiskAlmostFull
expr: |
100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} disk almost full ({{ $value | humanize }}%)"
- alert: HostHighCPU
expr: |
100 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} CPU >90% for 15m"
- alert: HostDown
expr: up{job="node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} is down"
The for: field is the dwell time — the condition must hold for that long before the alert actually fires. This is the single biggest source of alert noise reduction in practice.
Validate before reloading:
promtool check rules /etc/prometheus/rules/host.yml
curl -X POST http://localhost:9090/-/reload
Alertmanager
AV=0.27.0
curl -L -o /tmp/am.tar.gz \
"https://github.com/prometheus/alertmanager/releases/download/v${AV}/alertmanager-${AV}.linux-amd64.tar.gz"
tar -xzf /tmp/am.tar.gz -C /tmp
sudo mv /tmp/alertmanager-${AV}.linux-amd64/alertmanager /usr/local/bin/
sudo mv /tmp/alertmanager-${AV}.linux-amd64/amtool /usr/local/bin/
sudo tee /etc/alertmanager/alertmanager.yml <<'EOF'
global:
resolve_timeout: 5m
route:
receiver: default
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match: { severity: critical }
receiver: pagerduty
- match: { severity: warning }
receiver: slack
receivers:
- name: default
email_configs:
- to: ops@example.com
from: alerts@example.com
smarthost: smtp.example.com:587
auth_username: alerts@example.com
auth_password: '<app-password>'
require_tls: true
- name: slack
slack_configs:
- api_url: 'https://hooks.slack.com/services/T.../B.../...'
channel: '#alerts'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Instance:* {{ .Labels.instance }}
*Description:* {{ .Annotations.summary }}
{{ end }}
- name: pagerduty
pagerduty_configs:
- service_key: '<pagerduty-integration-key>'
EOF
Three concepts in that config:
- group_by collapses multiple alerts that share labels into one notification.
- group_wait — how long to wait before sending the first notification for a new group, to allow related alerts to roll in.
- repeat_interval — how often to re-notify if the alert is still firing.
Silences
Before a planned maintenance window, silence the alerts for the relevant hosts so the team isn't paged:
amtool silence add instance=node1 --duration=2h --comment="Planned reboot"
Or do it via the Alertmanager web UI (http://<host>:9093/).
Grafana for the dashboards
Install Grafana, add Prometheus as a data source (URL http://localhost:9090), then import dashboard 1860 (Node Exporter Full) and dashboard 14282 (cAdvisor). Both are community-maintained, work out of the box, and cover 90% of the questions a homelab dashboards need to answer.
Long-term storage
Prometheus's local TSDB is great for the recent past (last few weeks). For multi-year retention or cross-replica querying, point Prometheus at a remote_write endpoint:
remote_write:
- url: http://thanos-receive:19291/api/v1/receive
# or:
- url: http://victoriametrics:8428/api/v1/write
# or Grafana Mimir / Grafana Cloud, etc.
VictoriaMetrics is the easiest single-binary long-term-storage drop-in; Thanos / Mimir are the higher-scale options. For a homelab, 90 days local on a 50 GB disk is usually enough — don't add complexity until you have a question that requires it.