Author | Jeremy Laidman |
Compatibility | Xymon 4.3.3 |
Requirements | smarttools, GNU ls, GNU date |
Download | None |
Last Update | 2012-08-30 |
This script queries the SMART parameters of the drives on a system, and returns the status of those drives as well as reporting various metrics available from the SMART data.
The script gets its configuration from the environment or from a configuration file.
The script runs in write mode (with a “-w” switch) to create the status file from the output of the smartctl command. Typically this is done every 5 minutes from cron.
The script also runs in read mode (with a “-r” switch) to read in the status file and parse it for sending data and status reports to Xymon. Typically this is done every 5 minutes from a xymonlaunch configuration file (tasks.cfg on a Xymon server, or xymonlaunch.cfg on a Xymon client).
In read mode, the script constructs a status report for Xymon to warn if one of the following problems are detected:
The script also sends a data report for Xymon to turn into RRD files for graphing. The data points reported are:
Client side
1) Copy the script into a suitable location, such as /usr/lib/xymon/client/ext/xymon-SMART.sh
2) Create a crontab entry (eg /etc/cron.d/xymon-SMART.cron) containing this:
*/5 * * * * root ( umask 002; XYMONCLIENTHOME=/usr/lib/xymon/client \
CONTROLLER=cciss COUNT=0 DEVICE=cciss/c0d0 \
/path/to/xymon-SMART.sh -w /tmp/SMART.status ) 2>/tmp/SMART.status.err
Adjust for your requirements. Use “cat /proc/partitions” to
find a suitable DEVICE value. Test out values with:
smartctl -d $CONTROLLER,$COUNT -i /dev/$DEVICE
For multiple devices, specify a comma-separated list of numbers
in the COUNT variable, such as:
... COUNT=0,1 ...
Note: This usage of COUNT is not supported by smartctl.
3) Create a Xymon client tasks entry like this:
[smart]
ENVFILE $XYMONCLIENTHOME/etc/xymonclient.cfg
CMD /path/to/xymon-SMART.sh -r /tmp/SMART.status
LOGFILE $XYMONCLIENTLOGS/xymonclient.log
INTERVAL 5m
Server side
4) Create entries in graphs.cfg like so:
[smart]
# total read/write errors
TITLE S.M.A.R.T. Total Media Errors
YAXIS errors per second
FNPATTERN ^smart.(.*).rrd
DEF:rc@RRDIDX@=@RRDFN@:err_r_c:AVERAGE
DEF:ru@RRDIDX@=@RRDFN@:err_r_u:AVERAGE
DEF:wc@RRDIDX@=@RRDFN@:err_w_c:AVERAGE
DEF:wu@RRDIDX@=@RRDFN@:err_w_u:AVERAGE
CDEF:re@RRDIDX@=rc@RRDIDX@,ru@RRDIDX@,+
CDEF:we@RRDIDX@=wc@RRDIDX@,wu@RRDIDX@,+
COMMENT:@RRDPARAM@\:\n
LINE1:re@RRDIDX@#@COLOR@:Read Errors :
GPRINT:re@RRDIDX@:LAST:\: %5.1lf %s (cur)
GPRINT:re@RRDIDX@:MAX: %5.1lf %s (max)
GPRINT:re@RRDIDX@:MIN: %5.1lf %s (min)
GPRINT:re@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
LINE1:we@RRDIDX@#@COLOR@:Write Errors :
GPRINT:we@RRDIDX@:LAST:\: %5.1lf %s (cur)
GPRINT:we@RRDIDX@:MAX: %5.1lf %s (max)
GPRINT:we@RRDIDX@:MIN: %5.1lf %s (min)
GPRINT:we@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
[smart_temp]
TITLE S.M.A.R.T. Disk Temperature
YAXIS Celcius
FNPATTERN ^smart.(.*).rrd
DEF:temp@RRDIDX@=@RRDFN@:temp:AVERAGE
LINE1:temp@RRDIDX@#@COLOR@:@RRDPARAM@ temperature:
GPRINT:temp@RRDIDX@:LAST:\: %5.1lf°C (cur)
GPRINT:temp@RRDIDX@:MAX: %5.1lf°C (max)
GPRINT:temp@RRDIDX@:MIN: %5.1lf°C (min)
GPRINT:temp@RRDIDX@:AVERAGE: %5.1lf°C (avg)\n
[smart_nonmedium]
TITLE S.M.A.R.T. Non-Medium Errors
YAXIS errors per second
FNPATTERN ^smart.(.*).rrd
DEF:nmec@RRDIDX@=@RRDFN@:err_nmec:AVERAGE
LINE1:nmec@RRDIDX@#@COLOR@:@RRDPARAM@ non-medium errors:
GPRINT:nmec@RRDIDX@:LAST:\: %5.1lf %s (cur)
GPRINT:nmec@RRDIDX@:MAX: %5.1lf %s (max)
GPRINT:nmec@RRDIDX@:MIN: %5.1lf %s (min)
GPRINT:nmec@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
Add further graph definitions are desired. The RRD files produce the following DS names:
5) Add “smart” to the TEST2RRD and GRAPHS variables in xymonserver.cfg, to have the graphs included on the smart status page and the trends page.
6) Add “TRENDS:*,smart:smart|smart_temp” to the relevant entries in hosts.cfg, or the “_default_” entry.
#!/bin/sh
# SMART disk monitor
# Jeremy Laidman, 2012
#
# Version 0.5 - August 2012
# - initial public release
#
# Initially based on Michael Adelmann's "smart" script
# (see: http://xymonton.org/monitors:smart), the main
# improvements are to support multiple disks, and to
# send error counts for graphing.
#
# This script queries the SMART parameters of the drives
# on a system, and returns the status of those drives
# as well as reporting various metrics available from
# the SMART data.
#
# How it Works
# ------------
#
# The script gets its configuration from the environment
# or from a configuration file.
#
# The script runs in write mode (with a "-w" switch) to
# create the status file from the output of the smartctl
# command. Typically this is done every 5 minutes from cron.
#
# The script also runs in read mode (with a "-r" switch)
# to read in the status file and parse it for sending data
# and status reports to Xymon. Typically this is done
# every 5 minutes from a xymonlaunch configuration file
# (tasks.cfg on a Xymon server, or xymonlaunch.cfg on
# a Xymon client).
#
# In read mode, the script constructs a status report
# for Xymon to warn if one of the following problems are
# detected:
# - SMART is not enabled on the drive
# - SMART self-test is not "OK"
# - SMART health status is not "OK"
#
# The script also sends a data report for Xymon to turn
# into RRD files for graphing. The data points reported
# are:
# - corrected read errors
# - corrected write errors
# - uncorrected read errors
# - uncorrected write errors
# - non-medium errors
# - disk temperature
#
#
# To Install
# ----------
#
# Client-side:
#
# 1) Copy the script into a suitable location,
# such as /usr/lib/xymon/client/ext/xymon-SMART.sh
#
# 2) Create a crontab entry (eg /etc/cron.d/xymon-SMART.cron) containing this:
#
# */5 * * * * root ( umask 002; XYMONCLIENTHOME=/usr/lib/xymon/client \
# CONTROLLER=cciss COUNT=0 DEVICE=cciss/c0d0 \
# /path/to/xymon-SMART.sh -w /tmp/SMART.status ) 2>/tmp/SMART.status.err
#
# Adjust for your requirements. Use "cat /proc/partitions" to
# find a suitable DEVICE value. Test out values with:
#
# smartctl -d $CONTROLLER,$COUNT -i /dev/$DEVICE
#
# For multiple devices, specify a comma-separated list of numbers
# in the COUNT variable, such as:
# ... COUNT=0,1 ...
# This usage of COUNT is not supported by smartctl.
#
# 3) Create a Xymon client tasks entry like this:
#
# [smart]
# ENVFILE $XYMONCLIENTHOME/etc/xymonclient.cfg
# CMD /path/to/xymon-SMART.sh -r /tmp/SMART.status
# LOGFILE $XYMONCLIENTLOGS/xymonclient.log
# INTERVAL 5m
#
# Server-side:
#
# 4) Create entries in graphs.cfg like so:
#
# [smart]
# # total read/write errors
# TITLE S.M.A.R.T. Total Media Errors
# YAXIS errors per second
# FNPATTERN ^smart.(.*).rrd
# DEF:rc@RRDIDX@=@RRDFN@:err_r_c:AVERAGE
# DEF:ru@RRDIDX@=@RRDFN@:err_r_u:AVERAGE
# DEF:wc@RRDIDX@=@RRDFN@:err_w_c:AVERAGE
# DEF:wu@RRDIDX@=@RRDFN@:err_w_u:AVERAGE
# CDEF:re@RRDIDX@=rc@RRDIDX@,ru@RRDIDX@,+
# CDEF:we@RRDIDX@=wc@RRDIDX@,wu@RRDIDX@,+
# COMMENT:@RRDPARAM@\:\n
# LINE1:re@RRDIDX@#@COLOR@:Read Errors :
# GPRINT:re@RRDIDX@:LAST:\: %5.1lf %s (cur)
# GPRINT:re@RRDIDX@:MAX: %5.1lf %s (max)
# GPRINT:re@RRDIDX@:MIN: %5.1lf %s (min)
# GPRINT:re@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
# LINE1:we@RRDIDX@#@COLOR@:Write Errors :
# GPRINT:we@RRDIDX@:LAST:\: %5.1lf %s (cur)
# GPRINT:we@RRDIDX@:MAX: %5.1lf %s (max)
# GPRINT:we@RRDIDX@:MIN: %5.1lf %s (min)
# GPRINT:we@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
#
# [smart_temp]
# TITLE S.M.A.R.T. Disk Temperature
# YAXIS Celcius
# FNPATTERN ^smart.(.*).rrd
# DEF:temp@RRDIDX@=@RRDFN@:temp:AVERAGE
# LINE1:temp@RRDIDX@#@COLOR@:@RRDPARAM@ temperature:
# GPRINT:temp@RRDIDX@:LAST:\: %5.1lf°C (cur)
# GPRINT:temp@RRDIDX@:MAX: %5.1lf°C (max)
# GPRINT:temp@RRDIDX@:MIN: %5.1lf°C (min)
# GPRINT:temp@RRDIDX@:AVERAGE: %5.1lf°C (avg)\n
#
# [smart_nonmedium]
# TITLE S.M.A.R.T. Non-Medium Errors
# YAXIS errors per second
# FNPATTERN ^smart.(.*).rrd
# DEF:nmec@RRDIDX@=@RRDFN@:err_nmec:AVERAGE
# LINE1:nmec@RRDIDX@#@COLOR@:@RRDPARAM@ non-medium errors:
# GPRINT:nmec@RRDIDX@:LAST:\: %5.1lf %s (cur)
# GPRINT:nmec@RRDIDX@:MAX: %5.1lf %s (max)
# GPRINT:nmec@RRDIDX@:MIN: %5.1lf %s (min)
# GPRINT:nmec@RRDIDX@:AVERAGE: %5.1lf %s (avg)\n
#
# Add further graph definitions are desired.
# The RRD files produce the following DS names:
# - err_r_c = corrected read errors
# - err_w_c = corrected write errors
# - err_r_u = uncorrected read errors
# - err_w_u = uncorrected write errors
# - err_nmec = non-medium errors
# - temp = disk temperature
#
# 5) Add "smart" to the TEST2RRD and GRAPHS variables in
# xymonserver.cfg, to have the graphs included on the
# smart status page and the trends page.
#
# 6) Add "TRENDS:*,smart:smart|smart_temp" to the relevant
# entries in hosts.cfg, or the "_default_" entry.
#
#
# Troubleshooting
# ---------------
#
# * Check the cron output in /tmp/SMART.status.err and look
# for errors that indicate where the problem might be.
#
# * Check that the file /tmp/SMART.status is being updated.
# If not, ensure that the script is being run by cron.
#
# * Ensure that the crontab entry is being run. On some
# systems, simply creating a file in /etc/cron.d/ will
# not tell crond that there has been a change to its
# configuration. If this appears to be a problem, simply
# touch the directory containing the crontabs, such as
#
# sudo touch /var/spool/cron/tabs
#
# * If the status file appears correct, manually run the
# script in read (-r) mode with debugging and dry-run:
#
# xymoncmd /path/to/xymon-SMART.sh -r -d 1 -n /tmp/SMART.status
#
# * Check the Xymon log files, particularly xymonclient.log,
# xymonlaunch.log and rrd-status.log.
#
#
# A note about compatibility
# --------------------------
#
# This script makes use of features of GNU "ls" and
# GNU "date" to determine if a status file is fresh.
# This probably won't work on systems that don't have
# GNU "ls" and GNU "date". However such a scenario
# is unlikely on systems where smartctl is functioning.
die() { echo "$@" >&2; exit 1; }
VERSION=0.5
NL="
" # newline character
if [ "$DEBUG" ]; then
BB="echo"
[ "$XYMONCLIENTHOME" ] || XYMONCLIENTHOME="/usr/lib/xymon/client"
[ "$BBDISP" ] || BBDISP="0.0.0.0"
[ "$MACHINE" ] || MACHINE="machine"
fi
[ "$XYMON" ] || XYMON="$BB"
[ "$XYMSRV" ] || XYMSRV="$BBDISP"
COLOR="clear"
COLUMN="smart"
CONFIG="${XYMONCLIENTHOME}/etc/smart.conf"
MSG="No S.M.A.R.T. device detected."
RAID=""
RAIDADDR=""
SMARTCTL="/usr/sbin/smartctl"
SUDO="/usr/bin/sudo"
setup_config() {
# read config file
if [ -f $CONFIG ]; then
source $CONFIG
else
[ "$CONTROLLER" -a "$COUNT" -a "$DEVICE" ] ||
die "Configuration file not found: $CONFIG"
fi
if [ -n "$CONTROLLER" ]; then
RAIDADDR="$CONTROLLER,$COUNT"
RAID="-d $RAIDADDR"
[ 0$DEBUG -gt 1 ] && echo "debug: RAID set to '$RAID'"
fi
[ -b "/dev/$DEVICE" ] || die "Invalid device: /dev/$DEVICE"
RESULT="Device:\n\t$DEVICE\n\nStatus:\n\n"
}
get_smart_status() {
# we parese the output and set some flags
echo "$@" | while read LINE; do
case $LINE in
"Device Address:"*)
COUNTER=`expr 0$COUNTER + 1`
set - $LINE""
DEVADDR=$3
echo "DEVADDR_$COUNTER=$DEVADDR"
echo "DEVICES=\"\$DEVICES $COUNTER\""
;;
"Self Test returned without error")
echo "SMART_SELFTEST_$COUNTER=OK"
;;
"SMART Health Status:"*)
set - $LINE""
echo "SMART_HEALTH_$COUNTER=$4"
;;
"Device supports SMART and is Enabled")
set - $LINE""
echo "SMART_ENABLED_$COUNTER=1"
echo "SMART_ENABLED=1"
;;
esac
done
}
get_rrd_data() {
# we parse the output and show some numbers
echo "$@" | while read LINE; do
case $LINE in
"Device Address:"*)
set - $LINE""
[ "$FIRST" ] && echo ""
echo "[smart.$3.rrd]"
FIRST=1
[ 0$DEBUG -gt 0 ] && echo "Found device $3" >&2
;;
read:*)
set - $LINE""
echo "DS:err_r_c:COUNTER:600:0:U $5"
echo "DS:err_r_u:COUNTER:600:0:U $8"
;;
write:*)
set - $LINE""
echo "DS:err_w_c:COUNTER:600:0:U $5"
echo "DS:err_w_u:COUNTER:600:0:U $8"
;;
"Non-medium error count:"*)
set - $LINE""
echo "DS:err_nmec:COUNTER:600:0:U $4"
;;
"Current Drive Temperature:"*)
set - $LINE""
echo "DS:temp:GAUGE:600:U:U $4"
;;
esac
done
}
show_version() {
echo "Version: $VERSION"
}
show_usage() {
echo "Usage: $0 [-w writefile|-r readfile|-n|-d|-d N|-h|-V]"
show_version;
echo "Specify -w filename (or --write) to write to file (use '-' for STDOUT)"
echo "Specify -r filename (or --read) to read from a file (use '-' for STDIN)"
echo "Specify -d [N] (or --debug [N]) to enable debug mode, optionally with a debug level"
echo "Specify -n (or --dryrun) to stop short of updating Xymon (typically used with -d)"
echo "Typically, run as root: '$0 -w > tmpfile' and then as Xymon: '$0 -r < tmpfile'."
echo "If no switches are given, Xymon must have sudo rights to run the script with no password."
}
# Handle CLI modifiers
while [ "$1" ]; do
case "$1" in
"") ;;
-d|--debug) DEBUG=1
test 0$2 -gt 0 2>/dev/null && { DEBUG=$2; shift; }
echo "debug: Debug level $DEBUG"
;;
-q|--quiet) QUIET=1
;;
-r|--read) READ=1
[ 0$DEBUG -gt 0 ] && echo "debug: read mode"
[ "$2" ] || die "Specify file to read"
READFILE="$2"
shift
if [ -f "$READFILE" ]; then
[ -r "$READFILE" ] || die "Unable to read file: $READFILE"
else
[ 0$QUIET -gt 0 ] && exit
die "File not found: $READFILE"
fi
;;
-w|--write)
[ 0$DEBUG -gt 0 ] && echo "debug: write mode"
[ "$2" ] || die "Specify file to write"
WRITEFILE="$2"
shift
> $WRITEFILE
for C in `IFS=,; set - ""$COUNT; echo $@`; do
COUNT=$C setup_config
if [ "$WRITEFILE" = "-" ]; then
[ "$RAIDADDR" ] && echo "Device Address: $RAIDADDR"
$SMARTCTL /dev/$DEVICE $RAID --all -X
else
# assume that $SMARTCTL or ">" will output any errors
# so we just bail silently with RC=1
{
if [ "$RAIDADDR" ]; then
[ -s $WRITEFILE ] && echo ""
echo "Device Address: $RAIDADDR"
fi
$SMARTCTL /dev/$DEVICE $RAID --all -X
} >> $WRITEFILE || exit 1
fi
[ 0$DEBUG -eq 0 -o "$WRITEFILE" = "-" ] || cat $WRITEFILE | sed 's/^/debug: /'
done
exit
;;
-n|--dryrun) DRYRUN=1
;;
-V|--version)
show_version
exit
;;
-h|--help)
show_usage
exit
;;
*) die "Unexpected parameter: $1" ;;
esac
shift
done
if [ 0$READ -gt 0 ]; then
[ 0$DEBUG -gt 0 ] && echo "debug: reading status from file '$READFILE'"
# bail if the file is older than 5 minutes
if [ "$READFILE" = "-" ]; then
FILETIME=`ls -lL --time-style "+%s" </dev/stdin | { read X X X X X B X; echo $B; }`
else
FILETIME=`ls -lL --time-style "+%s" $READFILE | { read X X X X X B X; echo $B; }`
fi
TIMENOW=`date "+%s"`
TIMEDIFF=`expr $TIMENOW - $FILETIME`
[ 0$TIMEDIFF -lt 0 ] && die "Invalid timestamp"
[ 0$TIMEDIFF -gt 600 ] && die "Stale SMART file is $TIMEDIFF seconds old"
if [ "$READFILE" = "-" ]; then
TMP=`cat`
else
TMP=`cat $READFILE`
fi
else
TMP=""
for C in `IFS=,; set - ""$COUNT; echo $@`; do
COUNT=$C setup_config
[ "$RAIDADDR" ] && TMP="$TMP${NL}`echo Device Address: $RAIDADDR`"
TMP="$TMP{$NL}`$SUDO $SMARTCTL /dev/$DEVICE $RAID --all -X`"
done
fi
SMARTSTATUS=`get_smart_status "$TMP"`
[ 0$DEBUG -gt 1 ] && echo "$SMARTSTATUS"
eval $SMARTSTATUS
RRDDATA=`get_rrd_data "$TMP"`
[ "$SMART_ENABLED" ] && SMART=1
[ "$XYMON" ] || die "Xymon environment is not setup"
MSG="$TMP"
for DEVINDEX in $DEVICES; do
COLOR="green"
eval DEVNAME=\$DEVADDR_$DEVINDEX
[ 0$DEBUG -gt 0 ] && echo "Checking SMART for $DEVNAME"
eval SMART_ENABLED=\$SMART_ENABLED_$DEVINDEX
if [ "$SMART_ENABLED" ]; then
RESULT="$RESULT\t&green $DEVNAME supports SMART and is enabled\n"
else
COLOR="yellow"
RESULT="$RESULT\t&yellow $DEVNAME does not support SMART or is not enabled\n"
fi
eval SMART_HEALTH=\$SMART_HEALTH_$DEVINDEX
if [ "$SMART_HEALTH" = "OK" ]; then
RESULT="$RESULT\t&green $DEVNAME SMART Health Status: OK\n"
else
COLOR="red"
RESULT="$RESULT\t&red $DEVNAME SMART Health Status: $SMART_HEALTH\n"
fi
SELF=`echo "$TMP" | grep "Self Test returned without error"`
eval SMART_SELFTEST=\$SMART_SELFTEST_$DEVINDEX
if [ "$SMART_SELFTEST" = "OK" ]; then
RESULT="$RESULT\t&green $DEVNAME Self Test returned without error\n"
else
COLOR="red"
RESULT="$RESULT\t&red $DEVNAME Self Test returned with error: $SMART_SELFTEST\n"
fi
done
MSG=`echo -e "\n$RESULT\n\n$MSG\n"`
if [ 0$DEBUG -gt 0 ]; then
echo "Messages to Xymon:"
echo
echo $XYMON $BBDISP "status $MACHINE.$COLUMN $COLOR `date` $MSG"
echo
echo $XYMON $BBDISP "data $MACHINE.trends${NL}$RRDDATA"
fi
if [ 0$DRYRUN -eq 0 ]; then
$XYMON $BBDISP "status $MACHINE.$COLUMN $COLOR `date` $MSG"
$XYMON $BBDISP "data $MACHINE.trends${NL}$RRDDATA"
fi
Known Bugs and Issues