smartmontools: read SMART data, predict disk failures

Install

# Debian / Ubuntu
sudo apt install smartmontools

# Fedora / RHEL
sudo dnf install smartmontools

# Arch
sudo pacman -S smartmontools

# macOS
brew install smartmontools

smartctl --version

Identify your disks

sudo smartctl --scan
# /dev/sda -d sat # /dev/sda [SAT], ATA device
# /dev/sdb -d sat
# /dev/nvme0 -d nvme # /dev/nvme0, NVMe device

-d <type> tells smartctl which protocol to speak to the drive (SAT for SATA disks behind a USB or AHCI controller, nvme for NVMe, scsi for SAS, etc.). --scan identifies it automatically.

Read SMART data

# Health summary
sudo smartctl -H /dev/sda
# SMART overall-health self-assessment test result: PASSED

# Full SMART data + drive identification
sudo smartctl -a /dev/sda

# The interesting attributes (varies by drive vendor)
sudo smartctl -A /dev/sda

The attribute table is the meat. Key columns: ID, name, normalized value, threshold, raw value. Each vendor's interpretation differs slightly; the names that matter most:

5 Reallocated_Sector_Ct — sectors the drive remapped after failure. Anything >0 is a flag; rapid growth is "replace soon."
197 Current_Pending_Sector — sectors waiting for reallocation. >0 means a future failure if those reads come back bad.
198 Offline_Uncorrectable — uncorrectable read errors. Bad sign.
196 Reallocated_Event_Count — how many distinct reallocation events; complementary to attribute 5.
199 UDMA_CRC_Error_Count — cable / connection errors (not necessarily the drive's fault). Sudden spikes = bad SATA cable.
9 Power_On_Hours — total uptime. Useful for "is this drive 10 years old?"
194 Temperature_Celsius — current temp. Sustained >55°C significantly shortens HDD life.

For SSDs / NVMe

SMART for SSDs uses different attributes:

177 Wear_Leveling_Count (Samsung) or 173 Wear_Leveling_Count (Crucial / Micron) — remaining life as a percentage. 100 = new, decreases over writes.
233 Media_Wearout_Indicator (Intel) — similar.
241 Total_LBAs_Written — total writes ever. Compare to the drive's rated TBW (terabytes written) endurance.

For NVMe drives, smartctl reads NVMe-native data:

sudo smartctl -a /dev/nvme0
# Critical Warning: 0x00
# Temperature: 38 Celsius
# Available Spare: 100%
# Available Spare Threshold: 10%
# Percentage Used: 3%
# Data Units Read: 12,345,678 [6.32 TB]
# Data Units Written: 5,432,109 [2.78 TB]
# Media and Data Integrity Errors: 0
# Error Information Log Entries: 0

"Percentage Used" is the headline: the drive's estimate of how much of its rated endurance has been consumed. Replace when it hits ~90%.

Self-tests: let the drive check itself

# Quick (2 minutes; reads from various sectors)
sudo smartctl -t short /dev/sda

# Long (hours; reads every sector)
sudo smartctl -t long /dev/sda

# Conveyance (for newly-shipped drives; checks for damage in transit)
sudo smartctl -t conveyance /dev/sda

# View results
sudo smartctl -l selftest /dev/sda

Self-tests run in the drive's firmware; they don't impact I/O performance heavily. Run a short test weekly + a long test monthly via cron / systemd timer.

smartd: monitor + email on warning

The smartd daemon scans drives on a schedule, runs tests, and emails you when bad attributes change. Edit /etc/smartd.conf:

# Default scan + email on issues + run short test weekly, long test monthly
DEVICESCAN -a -m admin@example.com -M test \
    -s (S/../.././02|L/../../6/03) \
    -W 4,45,55                              # email if temp jumps >4°C/poll or absolute >45/55

# Or per-disk for finer control
/dev/sda -d sat -a -m admin@example.com \
    -s (S/../.././02|L/../../6/03) \
    -W 4,45,55

/dev/nvme0 -d nvme -a -m admin@example.com

The cryptic -s argument is the test schedule: short test every day at 02:00, long test every Saturday at 03:00.

Enable + start the daemon:

sudo systemctl enable --now smartd
sudo systemctl status smartd

# Test that email works
sudo smartd -q onecheck     # one-shot run; logs to syslog + sends test emails

Email backend (msmtp / sendmail)

smartd uses mailx / mail for sending; install msmtp-mta or postfix to provide a working /usr/bin/sendmail:

sudo apt install msmtp-mta
# Configure /etc/msmtprc with your SMTP relay (Gmail, Mailgun, etc.)

Or skip email entirely and use the executable-on-warning option (-M exec /usr/local/bin/notify.sh) to fire a script that posts to ntfy / Discord / Slack:

/dev/sda -d sat -a -M exec /usr/local/bin/disk-alert.sh

# /usr/local/bin/disk-alert.sh
#!/bin/bash
curl -d "Disk $SMARTD_DEVICESTRING raised SMART warning: $SMARTD_MESSAGE" \
    ntfy.example.com/disks

Push into Prometheus

The smartctl_exporter turns SMART attributes into Prometheus metrics:

docker run -d --name smartctl-exporter \
    --restart unless-stopped \
    -p 9633:9633 \
    --privileged \
    -v /dev:/dev:ro \
    prometheuscommunity/smartctl-exporter

Scrape from Prometheus; Grafana dashboard for disk health across the whole fleet; PromQL alerts on rising reallocated-sector counts.

Read the data with context

A few SMART attributes have well-known meaning across drives; many are vendor-specific or interpreted differently. For HDDs, Backblaze's published drive-failure data (their quarterly Drive Stats reports) is the gold-standard reference for "which SMART attributes actually predict failure."

The headline: attributes 5, 187, 188, 197, 198 are the strongest failure predictors for HDDs. Any of these going non-zero, especially trending up, is "schedule a replacement" territory.

For RAID / ZFS

Filesystem-level checksums (ZFS, Btrfs) catch silent corruption, but they can't replace SMART monitoring — ZFS doesn't tell you the drive is heading for failure until it actually corrupts data. Run both: ZFS scrub for silent-corruption catching, SMART for early-warning replacement timing.

What SMART won't tell you

Firmware bugs — some drives die from firmware issues without any SMART warning.
Sudden controller failures — the drive's electronics can fail completely with no graceful transition.
Whole-drive logical errors that the firmware doesn't notice.

SMART catches roughly 60% of failures with warning; the other 40% are "the drive disappears one day." That's why backups exist (see restic tutorial) and why RAID isn't optional for important data.