Install
# Debian / Ubuntu
sudo apt install smartmontools
# Fedora / RHEL
sudo dnf install smartmontools
# Arch
sudo pacman -S smartmontools
# macOS
brew install smartmontools
smartctl --version
Identify your disks
sudo smartctl --scan
# /dev/sda -d sat # /dev/sda [SAT], ATA device
# /dev/sdb -d sat
# /dev/nvme0 -d nvme # /dev/nvme0, NVMe device
-d <type> tells smartctl which protocol to speak to the drive (SAT for SATA disks behind a USB or AHCI controller, nvme for NVMe, scsi for SAS, etc.). --scan identifies it automatically.
Read SMART data
# Health summary
sudo smartctl -H /dev/sda
# SMART overall-health self-assessment test result: PASSED
# Full SMART data + drive identification
sudo smartctl -a /dev/sda
# The interesting attributes (varies by drive vendor)
sudo smartctl -A /dev/sda
The attribute table is the meat. Key columns: ID, name, normalized value, threshold, raw value. Each vendor's interpretation differs slightly; the names that matter most:
- 5 Reallocated_Sector_Ct — sectors the drive remapped after failure. Anything >0 is a flag; rapid growth is "replace soon."
- 197 Current_Pending_Sector — sectors waiting for reallocation. >0 means a future failure if those reads come back bad.
- 198 Offline_Uncorrectable — uncorrectable read errors. Bad sign.
- 196 Reallocated_Event_Count — how many distinct reallocation events; complementary to attribute 5.
- 199 UDMA_CRC_Error_Count — cable / connection errors (not necessarily the drive's fault). Sudden spikes = bad SATA cable.
- 9 Power_On_Hours — total uptime. Useful for "is this drive 10 years old?"
- 194 Temperature_Celsius — current temp. Sustained >55°C significantly shortens HDD life.
For SSDs / NVMe
SMART for SSDs uses different attributes:
- 177 Wear_Leveling_Count (Samsung) or 173 Wear_Leveling_Count (Crucial / Micron) — remaining life as a percentage. 100 = new, decreases over writes.
- 233 Media_Wearout_Indicator (Intel) — similar.
- 241 Total_LBAs_Written — total writes ever. Compare to the drive's rated TBW (terabytes written) endurance.
For NVMe drives, smartctl reads NVMe-native data:
sudo smartctl -a /dev/nvme0
# Critical Warning: 0x00
# Temperature: 38 Celsius
# Available Spare: 100%
# Available Spare Threshold: 10%
# Percentage Used: 3%
# Data Units Read: 12,345,678 [6.32 TB]
# Data Units Written: 5,432,109 [2.78 TB]
# Media and Data Integrity Errors: 0
# Error Information Log Entries: 0
"Percentage Used" is the headline: the drive's estimate of how much of its rated endurance has been consumed. Replace when it hits ~90%.
Self-tests: let the drive check itself
# Quick (2 minutes; reads from various sectors)
sudo smartctl -t short /dev/sda
# Long (hours; reads every sector)
sudo smartctl -t long /dev/sda
# Conveyance (for newly-shipped drives; checks for damage in transit)
sudo smartctl -t conveyance /dev/sda
# View results
sudo smartctl -l selftest /dev/sda
Self-tests run in the drive's firmware; they don't impact I/O performance heavily. Run a short test weekly + a long test monthly via cron / systemd timer.
smartd: monitor + email on warning
The smartd daemon scans drives on a schedule, runs tests, and emails you when bad attributes change. Edit /etc/smartd.conf:
# Default scan + email on issues + run short test weekly, long test monthly
DEVICESCAN -a -m admin@example.com -M test \
-s (S/../.././02|L/../../6/03) \
-W 4,45,55 # email if temp jumps >4°C/poll or absolute >45/55
# Or per-disk for finer control
/dev/sda -d sat -a -m admin@example.com \
-s (S/../.././02|L/../../6/03) \
-W 4,45,55
/dev/nvme0 -d nvme -a -m admin@example.com
The cryptic -s argument is the test schedule: short test every day at 02:00, long test every Saturday at 03:00.
Enable + start the daemon:
sudo systemctl enable --now smartd
sudo systemctl status smartd
# Test that email works
sudo smartd -q onecheck # one-shot run; logs to syslog + sends test emails
Email backend (msmtp / sendmail)
smartd uses mailx / mail for sending; install msmtp-mta or postfix to provide a working /usr/bin/sendmail:
sudo apt install msmtp-mta
# Configure /etc/msmtprc with your SMTP relay (Gmail, Mailgun, etc.)
Or skip email entirely and use the executable-on-warning option (-M exec /usr/local/bin/notify.sh) to fire a script that posts to ntfy / Discord / Slack:
/dev/sda -d sat -a -M exec /usr/local/bin/disk-alert.sh
# /usr/local/bin/disk-alert.sh
#!/bin/bash
curl -d "Disk $SMARTD_DEVICESTRING raised SMART warning: $SMARTD_MESSAGE" \
ntfy.example.com/disks
Push into Prometheus
The smartctl_exporter turns SMART attributes into Prometheus metrics:
docker run -d --name smartctl-exporter \
--restart unless-stopped \
-p 9633:9633 \
--privileged \
-v /dev:/dev:ro \
prometheuscommunity/smartctl-exporter
Scrape from Prometheus; Grafana dashboard for disk health across the whole fleet; PromQL alerts on rising reallocated-sector counts.
Read the data with context
A few SMART attributes have well-known meaning across drives; many are vendor-specific or interpreted differently. For HDDs, Backblaze's published drive-failure data (their quarterly Drive Stats reports) is the gold-standard reference for "which SMART attributes actually predict failure."
The headline: attributes 5, 187, 188, 197, 198 are the strongest failure predictors for HDDs. Any of these going non-zero, especially trending up, is "schedule a replacement" territory.
For RAID / ZFS
Filesystem-level checksums (ZFS, Btrfs) catch silent corruption, but they can't replace SMART monitoring — ZFS doesn't tell you the drive is heading for failure until it actually corrupts data. Run both: ZFS scrub for silent-corruption catching, SMART for early-warning replacement timing.
What SMART won't tell you
- Firmware bugs — some drives die from firmware issues without any SMART warning.
- Sudden controller failures — the drive's electronics can fail completely with no graceful transition.
- Whole-drive logical errors that the firmware doesn't notice.
SMART catches roughly 60% of failures with warning; the other 40% are "the drive disappears one day." That's why backups exist (see restic tutorial) and why RAID isn't optional for important data.