Smartctl
From wikinotes
This guide is the most complete I've seen on how to interpret SMART HDD info, by the author of the tool: https://www.linuxjournal.com/article/6983?page=0%2C1
Documentation
wikipedia SMART attributes https://en.wikipedia.org/wiki/S.M.A.R.T.#ATA_S.M.A.R.T._attributes Tutorials
Watching a hard drive die (in smartctl) http://notemagnet.blogspot.com/2009/10/watching-hard-drive-die.html Dan Langville monitoring HDD health https://www.freebsddiary.org/smart.php
Install
sudo pacman -S smartmontools # ArchLinux sudo pkg install smartmontools # FreeBSD
Usage
TL;DR - familiarize yourself with your drive's smartctl info early,
determine whether to use value/raw,
and create monitors specific to each drive.If error rates are suspicious, view the smartctl error logs.
Error Logs
sudo smartctl -l error /dev/sda # list logs of type 'error' (see man page to help read result)... Error 2 occurred at disk power-on lifetime: 153 hours (6 days + 9 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 00 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ec 00 00 00 00 00 a0 08 00:00:17.300 IDENTIFY DEVICE Error 1 occurred at disk power-on lifetime: 154 hours (6 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 01 01 00 00 a0 Error: UNC 1 sectors at LBA = 0x00000001 = 1 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 03 01 01 00 00 a0 ff 11:17:17.000 READ DMA EXT 25 03 01 01 00 00 a0 ff 11:17:17.000 READ DMA EXT 25 03 30 5e 00 d4 48 04 11:17:05.800 READ DMA EXT 25 03 40 4f 00 d4 40 00 11:16:56.600 READ DMA EXT 35 03 08 4f 00 9c 40 00 11:16:56.600 WRITE DMA EXTAuto Health Assessment (not always accurate)
sudo smartctl -H /dev/sda # overall health assessmentSMART overall-health self-assessment test result: PASSEDSmartCtl Attributes and values
sudo smartctl -a /dev/sda # list smartctl values
- check raw_value of Reallocated_Sector_Ct and Reallocated_Event_Count If you see a value over 0, the drive is starting to die.
- check Power_On_Hours to verify your error log entries pertain to the current drive issues.
- check Total_LBAs_written on SSDs. SSDs have a max number of writes. LBAs may refer to sectors, or for example with INTEL, raw_value is incremented by 1 every 32MB. Generally, refers to logical sector size. Find your sector-size using
fdisk -l
. http://www.dewassoc.com/kbase/hard_drives/lba.htm... SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 092 092 016 Pre-fail Always - 2621443 2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0 3 Spin_Up_Time 0x0007 111 111 024 Pre-fail Always - 600 (Average 660) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 74 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 7 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 154 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 69 192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age Always - 77 193 Load_Cycle_Count 0x0012 100 100 050 Old_age Always - 77 194 Temperature_Celsius 0x0002 114 114 000 Old_age Always - 48 (Lifetime Min/Max 18/56) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 12 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 4 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 4 ...Self Testing
sudo smartctl -t short /dev/sda # run a short test sudo smartctl -t long /dev/sda # run a long test sudo smartctl -l selftest /dev/sda # print self test report