xymon-SMART.sh

Author Jeremy Laidman
Compatibility Xymon 4.3.3
Requirements smarttools, GNU ls, GNU date
Download None
Last Update 2012-08-30

This script queries the SMART parameters of the drives on a system, and returns the status of those drives as well as reporting various metrics available from the SMART data.

The script gets its configuration from the environment or from a configuration file.

The script runs in write mode (with a “-w” switch) to create the status file from the output of the smartctl command. Typically this is done every 5 minutes from cron.

The script also runs in read mode (with a “-r” switch) to read in the status file and parse it for sending data and status reports to Xymon. Typically this is done every 5 minutes from a xymonlaunch configuration file (tasks.cfg on a Xymon server, or xymonlaunch.cfg on a Xymon client).

In read mode, the script constructs a status report for Xymon to warn if one of the following problems are detected:

  • SMART is not enabled on the drive
  • SMART self-test is not “OK”
  • SMART health status is not “OK”

The script also sends a data report for Xymon to turn into RRD files for graphing. The data points reported are:

  • corrected read errors
  • corrected write errors
  • uncorrected read errors
  • uncorrected write errors
  • non-medium errors
  • disk temperature

Client side

1) Copy the script into a suitable location, such as /usr/lib/xymon/client/ext/xymon-SMART.sh

2) Create a crontab entry (eg /etc/cron.d/xymon-SMART.cron) containing this:

*/5 * * * * root ( umask 002; XYMONCLIENTHOME=/usr/lib/xymon/client \
    CONTROLLER=cciss COUNT=0 DEVICE=cciss/c0d0 \
    /path/to/xymon-SMART.sh -w /tmp/SMART.status ) 2>/tmp/SMART.status.err

Adjust for your requirements. Use “cat /proc/partitions” to find a suitable DEVICE value. Test out values with:

smartctl -d $CONTROLLER,$COUNT -i /dev/$DEVICE

For multiple devices, specify a comma-separated list of numbers in the COUNT variable, such as:

... COUNT=0,1 ...

Note: This usage of COUNT is not supported by smartctl.

3) Create a Xymon client tasks entry like this:

[smart]
     ENVFILE $XYMONCLIENTHOME/etc/xymonclient.cfg
     CMD /path/to/xymon-SMART.sh -r /tmp/SMART.status
     LOGFILE $XYMONCLIENTLOGS/xymonclient.log
     INTERVAL 5m

Server side

4) Create entries in graphs.cfg like so:

  [smart]
      # total read/write errors
      TITLE S.M.A.R.T. Total Media Errors
      YAXIS errors per second
      FNPATTERN ^smart.(.*).rrd
      DEF:rc@RRDIDX@=@RRDFN@:err_r_c:AVERAGE
      DEF:ru@RRDIDX@=@RRDFN@:err_r_u:AVERAGE
      DEF:wc@RRDIDX@=@RRDFN@:err_w_c:AVERAGE
      DEF:wu@RRDIDX@=@RRDFN@:err_w_u:AVERAGE
      CDEF:re@RRDIDX@=rc@RRDIDX@,ru@RRDIDX@,+
      CDEF:we@RRDIDX@=wc@RRDIDX@,wu@RRDIDX@,+
      COMMENT:@RRDPARAM@\:\n
      LINE1:re@RRDIDX@#@COLOR@:Read Errors         :
      GPRINT:re@RRDIDX@:LAST:\: %5.1lf %s (cur)
      GPRINT:re@RRDIDX@:MAX: %5.1lf %s (max)
      GPRINT:re@RRDIDX@:MIN: %5.1lf %s (min)
      GPRINT:re@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
      LINE1:we@RRDIDX@#@COLOR@:Write Errors        :
      GPRINT:we@RRDIDX@:LAST:\: %5.1lf %s (cur)
      GPRINT:we@RRDIDX@:MAX: %5.1lf %s (max)
      GPRINT:we@RRDIDX@:MIN: %5.1lf %s (min)
      GPRINT:we@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
  
  [smart_temp]
      TITLE S.M.A.R.T. Disk Temperature
      YAXIS Celcius
      FNPATTERN ^smart.(.*).rrd
      DEF:temp@RRDIDX@=@RRDFN@:temp:AVERAGE
      LINE1:temp@RRDIDX@#@COLOR@:@RRDPARAM@ temperature:
      GPRINT:temp@RRDIDX@:LAST:\: %5.1lf°C (cur)
      GPRINT:temp@RRDIDX@:MAX: %5.1lf°C (max)
      GPRINT:temp@RRDIDX@:MIN: %5.1lf°C (min)
      GPRINT:temp@RRDIDX@:AVERAGE: %5.1lf°C (avg)\n
  
  [smart_nonmedium]
      TITLE S.M.A.R.T. Non-Medium Errors
      YAXIS errors per second
      FNPATTERN ^smart.(.*).rrd
      DEF:nmec@RRDIDX@=@RRDFN@:err_nmec:AVERAGE
      LINE1:nmec@RRDIDX@#@COLOR@:@RRDPARAM@ non-medium errors:
      GPRINT:nmec@RRDIDX@:LAST:\: %5.1lf %s (cur)
      GPRINT:nmec@RRDIDX@:MAX: %5.1lf %s (max)
      GPRINT:nmec@RRDIDX@:MIN: %5.1lf %s (min)
      GPRINT:nmec@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n

Add further graph definitions are desired. The RRD files produce the following DS names:

  • err_r_c = corrected read errors
  • err_w_c = corrected write errors
  • err_r_u = uncorrected read errors
  • err_w_u = uncorrected write errors
  • err_nmec = non-medium errors
  • temp = disk temperature

5) Add “smart” to the TEST2RRD and GRAPHS variables in xymonserver.cfg, to have the graphs included on the smart status page and the trends page.

6) Add “TRENDS:*,smart:smart|smart_temp” to the relevant entries in hosts.cfg, or the “_default_” entry.

xymon-SMART.sh

Show Code ⇲

Hide Code ⇱

#!/bin/sh

# SMART disk monitor
# Jeremy Laidman, 2012
#
# Version 0.5 - August 2012
#    - initial public release
#
# Initially based on Michael Adelmann's "smart" script
# (see: http://xymonton.org/monitors:smart), the main
# improvements are to support multiple disks, and to
# send error counts for graphing.
#
# This script queries the SMART parameters of the drives
# on a system, and returns the status of those drives
# as well as reporting various metrics available from
# the SMART data.
#
# How it Works
# ------------
#
# The script gets its configuration from the environment
# or from a configuration file.
#
# The script runs in write mode (with a "-w" switch) to
# create the status file from the output of the smartctl
# command.  Typically this is done every 5 minutes from cron.
#
# The script also runs in read mode (with a "-r" switch)
# to read in the status file and parse it for sending data
# and status reports to Xymon.  Typically this is done
# every 5 minutes from a xymonlaunch configuration file
# (tasks.cfg on a Xymon server, or xymonlaunch.cfg on
# a Xymon client).
#
# In read mode, the script constructs a status report
# for Xymon to warn if one of the following problems are
# detected:
#     - SMART is not enabled on the drive
#     - SMART self-test is not "OK"
#     - SMART health status is not "OK"
#
# The script also sends a data report for Xymon to turn
# into RRD files for graphing.  The data points reported
# are:
#    - corrected read errors
#    - corrected write errors
#    - uncorrected read errors
#    - uncorrected write errors
#    - non-medium errors
#    - disk temperature
#
#
# To Install
# ----------
#
# Client-side:
#
# 1) Copy the script into a suitable location,
#    such as /usr/lib/xymon/client/ext/xymon-SMART.sh
#
# 2) Create a crontab entry (eg /etc/cron.d/xymon-SMART.cron) containing this:
#
#    */5 * * * * root ( umask 002; XYMONCLIENTHOME=/usr/lib/xymon/client \
#       CONTROLLER=cciss COUNT=0 DEVICE=cciss/c0d0 \
#       /path/to/xymon-SMART.sh -w /tmp/SMART.status ) 2>/tmp/SMART.status.err
#
#    Adjust for your requirements.  Use "cat /proc/partitions" to
#    find a suitable DEVICE value.  Test out values with:
#
#        smartctl -d $CONTROLLER,$COUNT -i /dev/$DEVICE
#
#    For multiple devices, specify a comma-separated list of numbers
#    in the COUNT variable, such as:
#       ... COUNT=0,1 ...
#    This usage of COUNT is not supported by smartctl.
#
# 3) Create a Xymon client tasks entry like this:
#
#    [smart]
#           ENVFILE $XYMONCLIENTHOME/etc/xymonclient.cfg
#           CMD /path/to/xymon-SMART.sh -r /tmp/SMART.status
#           LOGFILE $XYMONCLIENTLOGS/xymonclient.log
#           INTERVAL 5m
#
# Server-side:
#
# 4) Create entries in graphs.cfg like so:
#
#    [smart]
#        # total read/write errors
#        TITLE S.M.A.R.T. Total Media Errors
#        YAXIS errors per second
#        FNPATTERN ^smart.(.*).rrd
#        DEF:rc@RRDIDX@=@RRDFN@:err_r_c:AVERAGE
#        DEF:ru@RRDIDX@=@RRDFN@:err_r_u:AVERAGE
#        DEF:wc@RRDIDX@=@RRDFN@:err_w_c:AVERAGE
#        DEF:wu@RRDIDX@=@RRDFN@:err_w_u:AVERAGE
#        CDEF:re@RRDIDX@=rc@RRDIDX@,ru@RRDIDX@,+
#        CDEF:we@RRDIDX@=wc@RRDIDX@,wu@RRDIDX@,+
#        COMMENT:@RRDPARAM@\:\n
#        LINE1:re@RRDIDX@#@COLOR@:Read Errors         :
#        GPRINT:re@RRDIDX@:LAST:\: %5.1lf %s (cur)
#        GPRINT:re@RRDIDX@:MAX: %5.1lf %s (max)
#        GPRINT:re@RRDIDX@:MIN: %5.1lf %s (min)
#        GPRINT:re@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
#        LINE1:we@RRDIDX@#@COLOR@:Write Errors        :
#        GPRINT:we@RRDIDX@:LAST:\: %5.1lf %s (cur)
#        GPRINT:we@RRDIDX@:MAX: %5.1lf %s (max)
#        GPRINT:we@RRDIDX@:MIN: %5.1lf %s (min)
#        GPRINT:we@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
#
#    [smart_temp]
#        TITLE S.M.A.R.T. Disk Temperature
#        YAXIS Celcius
#        FNPATTERN ^smart.(.*).rrd
#        DEF:temp@RRDIDX@=@RRDFN@:temp:AVERAGE
#        LINE1:temp@RRDIDX@#@COLOR@:@RRDPARAM@ temperature:
#        GPRINT:temp@RRDIDX@:LAST:\: %5.1lf°C (cur)
#        GPRINT:temp@RRDIDX@:MAX: %5.1lf°C (max)
#        GPRINT:temp@RRDIDX@:MIN: %5.1lf°C (min)
#        GPRINT:temp@RRDIDX@:AVERAGE: %5.1lf°C (avg)\n
#
#    [smart_nonmedium]
#        TITLE S.M.A.R.T. Non-Medium Errors
#        YAXIS errors per second
#        FNPATTERN ^smart.(.*).rrd
#        DEF:nmec@RRDIDX@=@RRDFN@:err_nmec:AVERAGE
#        LINE1:nmec@RRDIDX@#@COLOR@:@RRDPARAM@ non-medium errors:
#        GPRINT:nmec@RRDIDX@:LAST:\: %5.1lf %s (cur)
#        GPRINT:nmec@RRDIDX@:MAX: %5.1lf %s (max)
#        GPRINT:nmec@RRDIDX@:MIN: %5.1lf %s (min)
#        GPRINT:nmec@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
#
#    Add further graph definitions are desired.
#    The RRD files produce the following DS names:
#    - err_r_c  = corrected read errors
#    - err_w_c  = corrected write errors
#    - err_r_u  = uncorrected read errors
#    - err_w_u  = uncorrected write errors
#    - err_nmec = non-medium errors
#    - temp     = disk temperature
#
# 5) Add "smart" to the TEST2RRD and GRAPHS variables in
#    xymonserver.cfg, to have the graphs included on the
#    smart status page and the trends page.
#
# 6) Add "TRENDS:*,smart:smart|smart_temp" to the relevant
#    entries in hosts.cfg, or the "_default_" entry.
#
#
# Troubleshooting
# ---------------
#
# * Check the cron output in /tmp/SMART.status.err and look
#   for errors that indicate where the problem might be.
#
# * Check that the file /tmp/SMART.status is being updated.
#   If not, ensure that the script is being run by cron.
#
# * Ensure that the crontab entry is being run.  On some
#   systems, simply creating a file in /etc/cron.d/ will
#   not tell crond that there has been a change to its
#   configuration.  If this appears to be a problem, simply
#   touch the directory containing the crontabs, such as
#
#      sudo touch /var/spool/cron/tabs
#
# * If the status file appears correct, manually run the
#   script in read (-r) mode with debugging and dry-run:
#
#      xymoncmd /path/to/xymon-SMART.sh -r -d 1 -n /tmp/SMART.status
#
# * Check the Xymon log files, particularly xymonclient.log,
#   xymonlaunch.log and rrd-status.log.
#
#
# A note about compatibility
# --------------------------
#
# This script makes use of features of GNU "ls" and
# GNU "date" to determine if a status file is fresh.
# This probably won't work on systems that don't have
# GNU "ls" and GNU "date".  However such a scenario
# is unlikely on systems where smartctl is functioning.

die() { echo "$@" >&2; exit 1; }

VERSION=0.5

NL="
"       # newline character


if [ "$DEBUG" ]; then
        BB="echo"
        [ "$XYMONCLIENTHOME" ] || XYMONCLIENTHOME="/usr/lib/xymon/client"
        [ "$BBDISP" ] || BBDISP="0.0.0.0"
        [ "$MACHINE" ] || MACHINE="machine"
fi

[ "$XYMON" ] || XYMON="$BB"
[ "$XYMSRV" ] || XYMSRV="$BBDISP"

COLOR="clear"
COLUMN="smart"
CONFIG="${XYMONCLIENTHOME}/etc/smart.conf"
MSG="No S.M.A.R.T. device detected."
RAID=""
RAIDADDR=""
SMARTCTL="/usr/sbin/smartctl"
SUDO="/usr/bin/sudo"

setup_config() {
        # read config file
        if [ -f $CONFIG ]; then
                source $CONFIG
        else
                [ "$CONTROLLER" -a "$COUNT" -a "$DEVICE" ] ||
                        die "Configuration file not found: $CONFIG"
        fi

        if [ -n "$CONTROLLER" ]; then
                RAIDADDR="$CONTROLLER,$COUNT"
                RAID="-d $RAIDADDR"
                [ 0$DEBUG -gt 1 ] && echo "debug: RAID set to '$RAID'"
        fi

        [ -b "/dev/$DEVICE" ] || die "Invalid device: /dev/$DEVICE"

        RESULT="Device:\n\t$DEVICE\n\nStatus:\n\n"
}

get_smart_status() {
        # we parese the output and set some flags
        echo "$@" | while read LINE; do
                case $LINE in
                        "Device Address:"*)
                                COUNTER=`expr 0$COUNTER + 1`
                                set - $LINE""
                                DEVADDR=$3
                                echo "DEVADDR_$COUNTER=$DEVADDR"
                                echo "DEVICES=\"\$DEVICES $COUNTER\""
                                ;;
                        "Self Test returned without error")
                                echo "SMART_SELFTEST_$COUNTER=OK"
                                ;;
                        "SMART Health Status:"*)
                                set - $LINE""
                                echo "SMART_HEALTH_$COUNTER=$4"
                                ;;
                        "Device supports SMART and is Enabled")
                                set - $LINE""
                                echo "SMART_ENABLED_$COUNTER=1"
                                echo "SMART_ENABLED=1"
                                ;;
                esac
        done
}

get_rrd_data() {
        # we parse the output and show some numbers
        echo "$@" | while read LINE; do
                case $LINE in
                        "Device Address:"*)
                                set - $LINE""
                                [ "$FIRST" ] && echo ""
                                echo "[smart.$3.rrd]"
                                FIRST=1
                                [ 0$DEBUG -gt 0 ] && echo "Found device $3" >&2
                                ;;
                        read:*)
                                set - $LINE""
                                echo "DS:err_r_c:COUNTER:600:0:U $5"
                                echo "DS:err_r_u:COUNTER:600:0:U $8"
                                ;;
                        write:*)
                                set - $LINE""
                                echo "DS:err_w_c:COUNTER:600:0:U $5"
                                echo "DS:err_w_u:COUNTER:600:0:U $8"
                                ;;
                        "Non-medium error count:"*)
                                set - $LINE""
                                echo "DS:err_nmec:COUNTER:600:0:U $4"
                                ;;
                        "Current Drive Temperature:"*)
                                set - $LINE""
                                echo "DS:temp:GAUGE:600:U:U $4"
                                ;;
                esac
        done
}

show_version() {
        echo "Version: $VERSION"
}

show_usage() {
        echo "Usage: $0 [-w writefile|-r readfile|-n|-d|-d N|-h|-V]"
        show_version;
        echo "Specify -w filename (or --write) to write to file (use '-' for STDOUT)"
        echo "Specify -r filename (or --read) to read from a file (use '-' for STDIN)"
        echo "Specify -d [N] (or --debug [N]) to enable debug mode, optionally with a debug level"
        echo "Specify -n (or --dryrun) to stop short of updating Xymon (typically used with -d)"
        echo "Typically, run as root: '$0 -w > tmpfile' and then as Xymon: '$0 -r < tmpfile'."
        echo "If no switches are given, Xymon must have sudo rights to run the script with no password."
}

# Handle CLI modifiers
while [ "$1" ]; do
        case "$1" in
                "")             ;;
                -d|--debug)     DEBUG=1
                                test 0$2 -gt 0 2>/dev/null && { DEBUG=$2; shift; }
                                echo "debug: Debug level $DEBUG"
                                ;;
                -q|--quiet)     QUIET=1
                                ;;
                -r|--read)      READ=1
                                [ 0$DEBUG -gt 0 ] && echo "debug: read mode"
                                [ "$2" ] || die "Specify file to read"
                                READFILE="$2"
                                shift
                                if [ -f "$READFILE" ]; then
                                        [ -r "$READFILE" ] || die "Unable to read file: $READFILE"
                                else
                                        [ 0$QUIET -gt 0 ] && exit
                                        die "File not found: $READFILE"
                                fi
                                ;;
                -w|--write)
                                [ 0$DEBUG -gt 0 ] && echo "debug: write mode"
                                [ "$2" ] || die "Specify file to write"
                                WRITEFILE="$2"
                                shift
                                > $WRITEFILE
                                for C in `IFS=,; set - ""$COUNT; echo $@`; do
                                        COUNT=$C setup_config
                                        if [ "$WRITEFILE" = "-" ]; then
                                                [ "$RAIDADDR" ] && echo "Device Address: $RAIDADDR"
                                                $SMARTCTL /dev/$DEVICE $RAID --all -X
                                        else
                                                # assume that $SMARTCTL or ">" will output any errors
                                                # so we just bail silently with RC=1
                                                {
                                                        if [ "$RAIDADDR" ]; then
                                                                [ -s $WRITEFILE ] && echo ""
                                                                echo "Device Address: $RAIDADDR"
                                                        fi
                                                        $SMARTCTL /dev/$DEVICE $RAID --all -X
                                                } >> $WRITEFILE || exit 1
                                        fi
                                        [ 0$DEBUG -eq 0 -o "$WRITEFILE" = "-" ] || cat $WRITEFILE | sed 's/^/debug: /'
                                done
                                exit
                                ;;
                -n|--dryrun)    DRYRUN=1
                                ;;
                -V|--version)
                                show_version
                                exit
                                ;;
                -h|--help)
                                show_usage
                                exit
                                ;;
                *)              die "Unexpected parameter: $1"  ;;
        esac
        shift
done

if [ 0$READ -gt 0 ]; then
        [ 0$DEBUG -gt 0 ] && echo "debug: reading status from file '$READFILE'"
        # bail if the file is older than 5 minutes
        if [ "$READFILE" = "-" ]; then
                FILETIME=`ls -lL --time-style "+%s" </dev/stdin | { read X X X X X B X; echo $B; }`
        else
                FILETIME=`ls -lL --time-style "+%s" $READFILE | { read X X X X X B X; echo $B; }`
        fi
        TIMENOW=`date "+%s"`
        TIMEDIFF=`expr $TIMENOW - $FILETIME`
        [ 0$TIMEDIFF -lt 0 ] && die "Invalid timestamp"
        [ 0$TIMEDIFF -gt 600 ] && die "Stale SMART file is $TIMEDIFF seconds old"
        if [ "$READFILE" = "-" ]; then
                TMP=`cat`
        else
                TMP=`cat $READFILE`
        fi
else
        TMP=""
        for C in `IFS=,; set - ""$COUNT; echo $@`; do
                COUNT=$C setup_config
                [ "$RAIDADDR" ] && TMP="$TMP${NL}`echo Device Address: $RAIDADDR`"
                TMP="$TMP{$NL}`$SUDO $SMARTCTL /dev/$DEVICE $RAID --all -X`"
        done
fi

SMARTSTATUS=`get_smart_status "$TMP"`
[ 0$DEBUG -gt 1 ] && echo "$SMARTSTATUS"
eval $SMARTSTATUS

RRDDATA=`get_rrd_data "$TMP"`

[ "$SMART_ENABLED" ] && SMART=1

[ "$XYMON" ] || die "Xymon environment is not setup"

MSG="$TMP"
for DEVINDEX in $DEVICES; do
        COLOR="green"

        eval DEVNAME=\$DEVADDR_$DEVINDEX
        [ 0$DEBUG -gt 0 ] && echo "Checking SMART for $DEVNAME"

        eval SMART_ENABLED=\$SMART_ENABLED_$DEVINDEX
        if [ "$SMART_ENABLED" ]; then
                RESULT="$RESULT\t&green $DEVNAME supports SMART and is enabled\n"
        else
                COLOR="yellow"
                RESULT="$RESULT\t&yellow $DEVNAME does not support SMART or is not enabled\n"
        fi

        eval SMART_HEALTH=\$SMART_HEALTH_$DEVINDEX
        if [ "$SMART_HEALTH" = "OK" ]; then
                RESULT="$RESULT\t&green $DEVNAME SMART Health Status: OK\n"
        else
                COLOR="red"
                RESULT="$RESULT\t&red $DEVNAME SMART Health Status: $SMART_HEALTH\n"
        fi

        SELF=`echo "$TMP" | grep "Self Test returned without error"`
        eval SMART_SELFTEST=\$SMART_SELFTEST_$DEVINDEX
        if [ "$SMART_SELFTEST" = "OK" ]; then
                RESULT="$RESULT\t&green $DEVNAME Self Test returned without error\n"
        else
                COLOR="red"
                RESULT="$RESULT\t&red $DEVNAME Self Test returned with error: $SMART_SELFTEST\n"
        fi
done

MSG=`echo -e "\n$RESULT\n\n$MSG\n"`

if [ 0$DEBUG -gt 0 ]; then
        echo "Messages to Xymon:"
        echo
        echo $XYMON $BBDISP "status $MACHINE.$COLUMN $COLOR `date` $MSG"
        echo
        echo $XYMON $BBDISP "data $MACHINE.trends${NL}$RRDDATA"
fi
if [ 0$DRYRUN -eq 0 ]; then
        $XYMON $BBDISP "status $MACHINE.$COLUMN $COLOR `date` $MSG"
        $XYMON $BBDISP "data $MACHINE.trends${NL}$RRDDATA"
fi
  • 2012-08-30
    • Initial release
  • monitors/xymon-smart.txt
  • Last modified: 2012/08/30 05:14
  • (external edit)