Prometheus + Alertmanager: metrics and alerting end-to-end

The model

Every service exposes a /metrics endpoint in the Prometheus text format. Prometheus scrapes those endpoints on a schedule (15s by default), labels each sample with the target's identity, and stores it. Alerting rules are evaluated continuously; when one fires, an alert event flows to Alertmanager, which decides where (and whether) to notify.

Install Prometheus

sudo useradd -r -s /sbin/nologin prometheus
sudo mkdir -p /var/lib/prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

PV=2.55.0
curl -L -o /tmp/prom.tar.gz \
    "https://github.com/prometheus/prometheus/releases/download/v${PV}/prometheus-${PV}.linux-amd64.tar.gz"
tar -xzf /tmp/prom.tar.gz -C /tmp
sudo mv /tmp/prometheus-${PV}.linux-amd64/prometheus     /usr/local/bin/
sudo mv /tmp/prometheus-${PV}.linux-amd64/promtool       /usr/local/bin/
sudo mv /tmp/prometheus-${PV}.linux-amd64/consoles       /etc/prometheus/
sudo mv /tmp/prometheus-${PV}.linux-amd64/console_libraries /etc/prometheus/

sudo tee /etc/prometheus/prometheus.yml <<'EOF'
global:
  scrape_interval:     15s
  evaluation_interval: 15s
  external_labels:
    cluster: lab

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: [ "alertmanager:9093" ]

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: [ "localhost:9090" ]

  - job_name: node
    static_configs:
      - targets: [ "node1:9100", "node2:9100", "node3:9100" ]

  - job_name: cadvisor
    static_configs:
      - targets: [ "node1:8080", "node2:8080" ]
EOF

sudo tee /etc/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus
After=network-online.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus \
    --storage.tsdb.retention.time=90d \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --web.console.templates=/etc/prometheus/consoles \
    --web.enable-lifecycle
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now prometheus

--web.enable-lifecycle lets curl -X POST http://localhost:9090/-/reload reload the config without restarting the server.

node_exporter on every host

Per-host CPU, memory, disk, network, filesystem, load average, NTP drift — one binary that scrapes /proc and exposes it as Prometheus metrics:

NV=1.9.0
curl -L -o /tmp/node.tar.gz \
    "https://github.com/prometheus/node_exporter/releases/download/v${NV}/node_exporter-${NV}.linux-amd64.tar.gz"
tar -xzf /tmp/node.tar.gz -C /tmp
sudo mv /tmp/node_exporter-${NV}.linux-amd64/node_exporter /usr/local/bin/

sudo useradd -r -s /sbin/nologin node_exporter
sudo tee /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
    --collector.systemd \
    --collector.processes
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now node_exporter
curl http://localhost:9100/metrics | head

The output is the Prometheus text format: lots of # HELP, # TYPE, then metric{labels} value lines.

cAdvisor for container metrics

For Docker/Podman containers, cAdvisor (Google's container resource analyzer) exposes per-container CPU, memory, network, and I/O:

docker run -d \
    --name cadvisor \
    --restart unless-stopped \
    --volume=/:/rootfs:ro \
    --volume=/var/run:/var/run:ro \
    --volume=/sys:/sys:ro \
    --volume=/var/lib/docker/:/var/lib/docker:ro \
    --volume=/dev/disk/:/dev/disk:ro \
    --publish=8080:8080 \
    gcr.io/cadvisor/cadvisor:latest

PromQL by example

# CPU usage per host (idle inverted)
100 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100

# Disk usage % per filesystem
100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
       / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100

# Memory available % per host
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

# Top-10 containers by memory
topk(10, container_memory_working_set_bytes{name!=""})

# Rate of HTTP errors per service
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))

Alerting rules

# /etc/prometheus/rules/host.yml
groups:
  - name: host
    rules:
      - alert: HostDiskAlmostFull
        expr: |
          100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
                 / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} disk almost full ({{ $value | humanize }}%)"

      - alert: HostHighCPU
        expr: |
          100 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 90
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} CPU >90% for 15m"

      - alert: HostDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"

The for: field is the dwell time — the condition must hold for that long before the alert actually fires. This is the single biggest source of alert noise reduction in practice.

Validate before reloading:

promtool check rules /etc/prometheus/rules/host.yml
curl -X POST http://localhost:9090/-/reload

Alertmanager

AV=0.27.0
curl -L -o /tmp/am.tar.gz \
    "https://github.com/prometheus/alertmanager/releases/download/v${AV}/alertmanager-${AV}.linux-amd64.tar.gz"
tar -xzf /tmp/am.tar.gz -C /tmp
sudo mv /tmp/alertmanager-${AV}.linux-amd64/alertmanager /usr/local/bin/
sudo mv /tmp/alertmanager-${AV}.linux-amd64/amtool       /usr/local/bin/

sudo tee /etc/alertmanager/alertmanager.yml <<'EOF'
global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by: ['alertname', 'instance']
  group_wait:      30s
  group_interval:  5m
  repeat_interval: 4h
  routes:
    - match: { severity: critical }
      receiver: pagerduty
    - match: { severity: warning }
      receiver: slack

receivers:
  - name: default
    email_configs:
      - to: ops@example.com
        from: alerts@example.com
        smarthost: smtp.example.com:587
        auth_username: alerts@example.com
        auth_password: '<app-password>'
        require_tls: true

  - name: slack
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/T.../B.../...'
        channel: '#alerts'
        send_resolved: true
        title: '{{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Instance:* {{ .Labels.instance }}
          *Description:* {{ .Annotations.summary }}
          {{ end }}

  - name: pagerduty
    pagerduty_configs:
      - service_key: '<pagerduty-integration-key>'
EOF

Three concepts in that config:

group_by collapses multiple alerts that share labels into one notification.
group_wait — how long to wait before sending the first notification for a new group, to allow related alerts to roll in.
repeat_interval — how often to re-notify if the alert is still firing.

Silences

Before a planned maintenance window, silence the alerts for the relevant hosts so the team isn't paged:

amtool silence add instance=node1 --duration=2h --comment="Planned reboot"

Or do it via the Alertmanager web UI (http://<host>:9093/).

Grafana for the dashboards

Install Grafana, add Prometheus as a data source (URL http://localhost:9090), then import dashboard 1860 (Node Exporter Full) and dashboard 14282 (cAdvisor). Both are community-maintained, work out of the box, and cover 90% of the questions a homelab dashboards need to answer.

Long-term storage

Prometheus's local TSDB is great for the recent past (last few weeks). For multi-year retention or cross-replica querying, point Prometheus at a remote_write endpoint:

remote_write:
  - url: http://thanos-receive:19291/api/v1/receive
  # or:
  - url: http://victoriametrics:8428/api/v1/write
  # or Grafana Mimir / Grafana Cloud, etc.

VictoriaMetrics is the easiest single-binary long-term-storage drop-in; Thanos / Mimir are the higher-scale options. For a homelab, 90 days local on a 50 GB disk is usually enough — don't add complexity until you have a question that requires it.