====== xymon-SMART.sh ====== ^ Author | [[ jlaidman+xymon-smart@rebel-it.com.au | Jeremy Laidman ]] | ^ Compatibility | Xymon 4.3.3 | ^ Requirements | smarttools, GNU ls, GNU date | ^ Download | None | ^ Last Update | 2012-08-30 | ===== Description ===== This script queries the SMART parameters of the drives on a system, and returns the status of those drives as well as reporting various metrics available from the SMART data. The script gets its configuration from the environment or from a configuration file. The script runs in write mode (with a "-w" switch) to create the status file from the output of the smartctl command. Typically this is done every 5 minutes from cron. The script also runs in read mode (with a "-r" switch) to read in the status file and parse it for sending data and status reports to Xymon. Typically this is done every 5 minutes from a xymonlaunch configuration file (tasks.cfg on a Xymon server, or xymonlaunch.cfg on a Xymon client). In read mode, the script constructs a status report for Xymon to warn if one of the following problems are detected: * SMART is not enabled on the drive * SMART self-test is not "OK" * SMART health status is not "OK" The script also sends a data report for Xymon to turn into RRD files for graphing. The data points reported are: * corrected read errors * corrected write errors * uncorrected read errors * uncorrected write errors * non-medium errors * disk temperature {{:monitors:xymon-smart.sh-1.png?200|}} {{:monitors:xymon-smart.sh-2.png?200|}} ===== Installation ===== === Client side === 1) Copy the script into a suitable location, such as ''/usr/lib/xymon/client/ext/xymon-SMART.sh'' 2) Create a crontab entry (eg /etc/cron.d/xymon-SMART.cron) containing this: */5 * * * * root ( umask 002; XYMONCLIENTHOME=/usr/lib/xymon/client \ CONTROLLER=cciss COUNT=0 DEVICE=cciss/c0d0 \ /path/to/xymon-SMART.sh -w /tmp/SMART.status ) 2>/tmp/SMART.status.err Adjust for your requirements. Use "cat /proc/partitions" to find a suitable DEVICE value. Test out values with: smartctl -d $CONTROLLER,$COUNT -i /dev/$DEVICE For multiple devices, specify a comma-separated list of numbers in the COUNT variable, such as: ... COUNT=0,1 ... Note: This usage of COUNT is not supported by smartctl. 3) Create a Xymon client tasks entry like this: [smart] ENVFILE $XYMONCLIENTHOME/etc/xymonclient.cfg CMD /path/to/xymon-SMART.sh -r /tmp/SMART.status LOGFILE $XYMONCLIENTLOGS/xymonclient.log INTERVAL 5m === Server side === 4) Create entries in graphs.cfg like so: [smart] # total read/write errors TITLE S.M.A.R.T. Total Media Errors YAXIS errors per second FNPATTERN ^smart.(.*).rrd DEF:rc@RRDIDX@=@RRDFN@:err_r_c:AVERAGE DEF:ru@RRDIDX@=@RRDFN@:err_r_u:AVERAGE DEF:wc@RRDIDX@=@RRDFN@:err_w_c:AVERAGE DEF:wu@RRDIDX@=@RRDFN@:err_w_u:AVERAGE CDEF:re@RRDIDX@=rc@RRDIDX@,ru@RRDIDX@,+ CDEF:we@RRDIDX@=wc@RRDIDX@,wu@RRDIDX@,+ COMMENT:@RRDPARAM@\:\n LINE1:re@RRDIDX@#@COLOR@:Read Errors : GPRINT:re@RRDIDX@:LAST:\: %5.1lf %s (cur) GPRINT:re@RRDIDX@:MAX: %5.1lf %s (max) GPRINT:re@RRDIDX@:MIN: %5.1lf %s (min) GPRINT:re@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n LINE1:we@RRDIDX@#@COLOR@:Write Errors : GPRINT:we@RRDIDX@:LAST:\: %5.1lf %s (cur) GPRINT:we@RRDIDX@:MAX: %5.1lf %s (max) GPRINT:we@RRDIDX@:MIN: %5.1lf %s (min) GPRINT:we@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n [smart_temp] TITLE S.M.A.R.T. Disk Temperature YAXIS Celcius FNPATTERN ^smart.(.*).rrd DEF:temp@RRDIDX@=@RRDFN@:temp:AVERAGE LINE1:temp@RRDIDX@#@COLOR@:@RRDPARAM@ temperature: GPRINT:temp@RRDIDX@:LAST:\: %5.1lf°C (cur) GPRINT:temp@RRDIDX@:MAX: %5.1lf°C (max) GPRINT:temp@RRDIDX@:MIN: %5.1lf°C (min) GPRINT:temp@RRDIDX@:AVERAGE: %5.1lf°C (avg)\n [smart_nonmedium] TITLE S.M.A.R.T. Non-Medium Errors YAXIS errors per second FNPATTERN ^smart.(.*).rrd DEF:nmec@RRDIDX@=@RRDFN@:err_nmec:AVERAGE LINE1:nmec@RRDIDX@#@COLOR@:@RRDPARAM@ non-medium errors: GPRINT:nmec@RRDIDX@:LAST:\: %5.1lf %s (cur) GPRINT:nmec@RRDIDX@:MAX: %5.1lf %s (max) GPRINT:nmec@RRDIDX@:MIN: %5.1lf %s (min) GPRINT:nmec@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n Add further graph definitions are desired. The RRD files produce the following DS names: * err_r_c = corrected read errors * err_w_c = corrected write errors * err_r_u = uncorrected read errors * err_w_u = uncorrected write errors * err_nmec = non-medium errors * temp = disk temperature 5) Add "smart" to the TEST2RRD and GRAPHS variables in xymonserver.cfg, to have the graphs included on the smart status page and the trends page. 6) Add "TRENDS:*,smart:smart|smart_temp" to the relevant entries in hosts.cfg, or the "_default_" entry. ===== Source ===== ==== xymon-SMART.sh ==== #!/bin/sh # SMART disk monitor # Jeremy Laidman, 2012 # # Version 0.5 - August 2012 # - initial public release # # Initially based on Michael Adelmann's "smart" script # (see: http://xymonton.org/monitors:smart), the main # improvements are to support multiple disks, and to # send error counts for graphing. # # This script queries the SMART parameters of the drives # on a system, and returns the status of those drives # as well as reporting various metrics available from # the SMART data. # # How it Works # ------------ # # The script gets its configuration from the environment # or from a configuration file. # # The script runs in write mode (with a "-w" switch) to # create the status file from the output of the smartctl # command. Typically this is done every 5 minutes from cron. # # The script also runs in read mode (with a "-r" switch) # to read in the status file and parse it for sending data # and status reports to Xymon. Typically this is done # every 5 minutes from a xymonlaunch configuration file # (tasks.cfg on a Xymon server, or xymonlaunch.cfg on # a Xymon client). # # In read mode, the script constructs a status report # for Xymon to warn if one of the following problems are # detected: # - SMART is not enabled on the drive # - SMART self-test is not "OK" # - SMART health status is not "OK" # # The script also sends a data report for Xymon to turn # into RRD files for graphing. The data points reported # are: # - corrected read errors # - corrected write errors # - uncorrected read errors # - uncorrected write errors # - non-medium errors # - disk temperature # # # To Install # ---------- # # Client-side: # # 1) Copy the script into a suitable location, # such as /usr/lib/xymon/client/ext/xymon-SMART.sh # # 2) Create a crontab entry (eg /etc/cron.d/xymon-SMART.cron) containing this: # # */5 * * * * root ( umask 002; XYMONCLIENTHOME=/usr/lib/xymon/client \ # CONTROLLER=cciss COUNT=0 DEVICE=cciss/c0d0 \ # /path/to/xymon-SMART.sh -w /tmp/SMART.status ) 2>/tmp/SMART.status.err # # Adjust for your requirements. Use "cat /proc/partitions" to # find a suitable DEVICE value. Test out values with: # # smartctl -d $CONTROLLER,$COUNT -i /dev/$DEVICE # # For multiple devices, specify a comma-separated list of numbers # in the COUNT variable, such as: # ... COUNT=0,1 ... # This usage of COUNT is not supported by smartctl. # # 3) Create a Xymon client tasks entry like this: # # [smart] # ENVFILE $XYMONCLIENTHOME/etc/xymonclient.cfg # CMD /path/to/xymon-SMART.sh -r /tmp/SMART.status # LOGFILE $XYMONCLIENTLOGS/xymonclient.log # INTERVAL 5m # # Server-side: # # 4) Create entries in graphs.cfg like so: # # [smart] # # total read/write errors # TITLE S.M.A.R.T. Total Media Errors # YAXIS errors per second # FNPATTERN ^smart.(.*).rrd # DEF:rc@RRDIDX@=@RRDFN@:err_r_c:AVERAGE # DEF:ru@RRDIDX@=@RRDFN@:err_r_u:AVERAGE # DEF:wc@RRDIDX@=@RRDFN@:err_w_c:AVERAGE # DEF:wu@RRDIDX@=@RRDFN@:err_w_u:AVERAGE # CDEF:re@RRDIDX@=rc@RRDIDX@,ru@RRDIDX@,+ # CDEF:we@RRDIDX@=wc@RRDIDX@,wu@RRDIDX@,+ # COMMENT:@RRDPARAM@\:\n # LINE1:re@RRDIDX@#@COLOR@:Read Errors : # GPRINT:re@RRDIDX@:LAST:\: %5.1lf %s (cur) # GPRINT:re@RRDIDX@:MAX: %5.1lf %s (max) # GPRINT:re@RRDIDX@:MIN: %5.1lf %s (min) # GPRINT:re@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n # LINE1:we@RRDIDX@#@COLOR@:Write Errors : # GPRINT:we@RRDIDX@:LAST:\: %5.1lf %s (cur) # GPRINT:we@RRDIDX@:MAX: %5.1lf %s (max) # GPRINT:we@RRDIDX@:MIN: %5.1lf %s (min) # GPRINT:we@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n # # [smart_temp] # TITLE S.M.A.R.T. Disk Temperature # YAXIS Celcius # FNPATTERN ^smart.(.*).rrd # DEF:temp@RRDIDX@=@RRDFN@:temp:AVERAGE # LINE1:temp@RRDIDX@#@COLOR@:@RRDPARAM@ temperature: # GPRINT:temp@RRDIDX@:LAST:\: %5.1lf°C (cur) # GPRINT:temp@RRDIDX@:MAX: %5.1lf°C (max) # GPRINT:temp@RRDIDX@:MIN: %5.1lf°C (min) # GPRINT:temp@RRDIDX@:AVERAGE: %5.1lf°C (avg)\n # # [smart_nonmedium] # TITLE S.M.A.R.T. Non-Medium Errors # YAXIS errors per second # FNPATTERN ^smart.(.*).rrd # DEF:nmec@RRDIDX@=@RRDFN@:err_nmec:AVERAGE # LINE1:nmec@RRDIDX@#@COLOR@:@RRDPARAM@ non-medium errors: # GPRINT:nmec@RRDIDX@:LAST:\: %5.1lf %s (cur) # GPRINT:nmec@RRDIDX@:MAX: %5.1lf %s (max) # GPRINT:nmec@RRDIDX@:MIN: %5.1lf %s (min) # GPRINT:nmec@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n # # Add further graph definitions are desired. # The RRD files produce the following DS names: # - err_r_c = corrected read errors # - err_w_c = corrected write errors # - err_r_u = uncorrected read errors # - err_w_u = uncorrected write errors # - err_nmec = non-medium errors # - temp = disk temperature # # 5) Add "smart" to the TEST2RRD and GRAPHS variables in # xymonserver.cfg, to have the graphs included on the # smart status page and the trends page. # # 6) Add "TRENDS:*,smart:smart|smart_temp" to the relevant # entries in hosts.cfg, or the "_default_" entry. # # # Troubleshooting # --------------- # # * Check the cron output in /tmp/SMART.status.err and look # for errors that indicate where the problem might be. # # * Check that the file /tmp/SMART.status is being updated. # If not, ensure that the script is being run by cron. # # * Ensure that the crontab entry is being run. On some # systems, simply creating a file in /etc/cron.d/ will # not tell crond that there has been a change to its # configuration. If this appears to be a problem, simply # touch the directory containing the crontabs, such as # # sudo touch /var/spool/cron/tabs # # * If the status file appears correct, manually run the # script in read (-r) mode with debugging and dry-run: # # xymoncmd /path/to/xymon-SMART.sh -r -d 1 -n /tmp/SMART.status # # * Check the Xymon log files, particularly xymonclient.log, # xymonlaunch.log and rrd-status.log. # # # A note about compatibility # -------------------------- # # This script makes use of features of GNU "ls" and # GNU "date" to determine if a status file is fresh. # This probably won't work on systems that don't have # GNU "ls" and GNU "date". However such a scenario # is unlikely on systems where smartctl is functioning. die() { echo "$@" >&2; exit 1; } VERSION=0.5 NL=" " # newline character if [ "$DEBUG" ]; then BB="echo" [ "$XYMONCLIENTHOME" ] || XYMONCLIENTHOME="/usr/lib/xymon/client" [ "$BBDISP" ] || BBDISP="0.0.0.0" [ "$MACHINE" ] || MACHINE="machine" fi [ "$XYMON" ] || XYMON="$BB" [ "$XYMSRV" ] || XYMSRV="$BBDISP" COLOR="clear" COLUMN="smart" CONFIG="${XYMONCLIENTHOME}/etc/smart.conf" MSG="No S.M.A.R.T. device detected." RAID="" RAIDADDR="" SMARTCTL="/usr/sbin/smartctl" SUDO="/usr/bin/sudo" setup_config() { # read config file if [ -f $CONFIG ]; then source $CONFIG else [ "$CONTROLLER" -a "$COUNT" -a "$DEVICE" ] || die "Configuration file not found: $CONFIG" fi if [ -n "$CONTROLLER" ]; then RAIDADDR="$CONTROLLER,$COUNT" RAID="-d $RAIDADDR" [ 0$DEBUG -gt 1 ] && echo "debug: RAID set to '$RAID'" fi [ -b "/dev/$DEVICE" ] || die "Invalid device: /dev/$DEVICE" RESULT="Device:\n\t$DEVICE\n\nStatus:\n\n" } get_smart_status() { # we parese the output and set some flags echo "$@" | while read LINE; do case $LINE in "Device Address:"*) COUNTER=`expr 0$COUNTER + 1` set - $LINE"" DEVADDR=$3 echo "DEVADDR_$COUNTER=$DEVADDR" echo "DEVICES=\"\$DEVICES $COUNTER\"" ;; "Self Test returned without error") echo "SMART_SELFTEST_$COUNTER=OK" ;; "SMART Health Status:"*) set - $LINE"" echo "SMART_HEALTH_$COUNTER=$4" ;; "Device supports SMART and is Enabled") set - $LINE"" echo "SMART_ENABLED_$COUNTER=1" echo "SMART_ENABLED=1" ;; esac done } get_rrd_data() { # we parse the output and show some numbers echo "$@" | while read LINE; do case $LINE in "Device Address:"*) set - $LINE"" [ "$FIRST" ] && echo "" echo "[smart.$3.rrd]" FIRST=1 [ 0$DEBUG -gt 0 ] && echo "Found device $3" >&2 ;; read:*) set - $LINE"" echo "DS:err_r_c:COUNTER:600:0:U $5" echo "DS:err_r_u:COUNTER:600:0:U $8" ;; write:*) set - $LINE"" echo "DS:err_w_c:COUNTER:600:0:U $5" echo "DS:err_w_u:COUNTER:600:0:U $8" ;; "Non-medium error count:"*) set - $LINE"" echo "DS:err_nmec:COUNTER:600:0:U $4" ;; "Current Drive Temperature:"*) set - $LINE"" echo "DS:temp:GAUGE:600:U:U $4" ;; esac done } show_version() { echo "Version: $VERSION" } show_usage() { echo "Usage: $0 [-w writefile|-r readfile|-n|-d|-d N|-h|-V]" show_version; echo "Specify -w filename (or --write) to write to file (use '-' for STDOUT)" echo "Specify -r filename (or --read) to read from a file (use '-' for STDIN)" echo "Specify -d [N] (or --debug [N]) to enable debug mode, optionally with a debug level" echo "Specify -n (or --dryrun) to stop short of updating Xymon (typically used with -d)" echo "Typically, run as root: '$0 -w > tmpfile' and then as Xymon: '$0 -r < tmpfile'." echo "If no switches are given, Xymon must have sudo rights to run the script with no password." } # Handle CLI modifiers while [ "$1" ]; do case "$1" in "") ;; -d|--debug) DEBUG=1 test 0$2 -gt 0 2>/dev/null && { DEBUG=$2; shift; } echo "debug: Debug level $DEBUG" ;; -q|--quiet) QUIET=1 ;; -r|--read) READ=1 [ 0$DEBUG -gt 0 ] && echo "debug: read mode" [ "$2" ] || die "Specify file to read" READFILE="$2" shift if [ -f "$READFILE" ]; then [ -r "$READFILE" ] || die "Unable to read file: $READFILE" else [ 0$QUIET -gt 0 ] && exit die "File not found: $READFILE" fi ;; -w|--write) [ 0$DEBUG -gt 0 ] && echo "debug: write mode" [ "$2" ] || die "Specify file to write" WRITEFILE="$2" shift > $WRITEFILE for C in `IFS=,; set - ""$COUNT; echo $@`; do COUNT=$C setup_config if [ "$WRITEFILE" = "-" ]; then [ "$RAIDADDR" ] && echo "Device Address: $RAIDADDR" $SMARTCTL /dev/$DEVICE $RAID --all -X else # assume that $SMARTCTL or ">" will output any errors # so we just bail silently with RC=1 { if [ "$RAIDADDR" ]; then [ -s $WRITEFILE ] && echo "" echo "Device Address: $RAIDADDR" fi $SMARTCTL /dev/$DEVICE $RAID --all -X } >> $WRITEFILE || exit 1 fi [ 0$DEBUG -eq 0 -o "$WRITEFILE" = "-" ] || cat $WRITEFILE | sed 's/^/debug: /' done exit ;; -n|--dryrun) DRYRUN=1 ;; -V|--version) show_version exit ;; -h|--help) show_usage exit ;; *) die "Unexpected parameter: $1" ;; esac shift done if [ 0$READ -gt 0 ]; then [ 0$DEBUG -gt 0 ] && echo "debug: reading status from file '$READFILE'" # bail if the file is older than 5 minutes if [ "$READFILE" = "-" ]; then FILETIME=`ls -lL --time-style "+%s" ===== Known Bugs and Issues ===== ===== To Do ===== ===== Credits ===== ===== Changelog ===== * **2012-08-30** * Initial release