no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
| — | monitors:xymon-smart [2012/08/30 05:14] (current) – created - external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== xymon-SMART.sh ====== | ||
| + | ^ Author | [[ jlaidman+xymon-smart@rebel-it.com.au | Jeremy Laidman ]] | | ||
| + | ^ Compatibility | Xymon 4.3.3 | | ||
| + | ^ Requirements | smarttools, GNU ls, GNU date | | ||
| + | ^ Download | None | | ||
| + | ^ Last Update | 2012-08-30 | | ||
| + | |||
| + | ===== Description ===== | ||
| + | This script queries the SMART parameters of the drives on a system, and returns the status of those drives as well as reporting various metrics available from the SMART data. | ||
| + | |||
| + | The script gets its configuration from the environment or from a configuration file. | ||
| + | |||
| + | The script runs in write mode (with a " | ||
| + | |||
| + | The script also runs in read mode (with a " | ||
| + | |||
| + | In read mode, the script constructs a status report for Xymon to warn if one of the following problems are detected: | ||
| + | * SMART is not enabled on the drive | ||
| + | * SMART self-test is not " | ||
| + | * SMART health status is not " | ||
| + | |||
| + | The script also sends a data report for Xymon to turn into RRD files for graphing. | ||
| + | * corrected read errors | ||
| + | * corrected write errors | ||
| + | * uncorrected read errors | ||
| + | * uncorrected write errors | ||
| + | * non-medium errors | ||
| + | * disk temperature | ||
| + | |||
| + | {{: | ||
| + | |||
| + | {{: | ||
| + | |||
| + | ===== Installation ===== | ||
| + | === Client side === | ||
| + | 1) Copy the script into a suitable location, such as ''/ | ||
| + | |||
| + | 2) Create a crontab entry (eg / | ||
| + | |||
| + | < | ||
| + | */5 * * * * root ( umask 002; XYMONCLIENTHOME=/ | ||
| + | CONTROLLER=cciss COUNT=0 DEVICE=cciss/ | ||
| + | / | ||
| + | </ | ||
| + | |||
| + | Adjust for your requirements. | ||
| + | find a suitable DEVICE value. | ||
| + | |||
| + | smartctl -d $CONTROLLER, | ||
| + | |||
| + | For multiple devices, specify a comma-separated list of numbers | ||
| + | in the COUNT variable, such as: | ||
| + | ... COUNT=0,1 ... | ||
| + | Note: This usage of COUNT is not supported by smartctl. | ||
| + | |||
| + | 3) Create a Xymon client tasks entry like this: | ||
| + | |||
| + | [smart] | ||
| + | | ||
| + | CMD / | ||
| + | | ||
| + | | ||
| + | |||
| + | === Server side === | ||
| + | 4) Create entries in graphs.cfg like so: | ||
| + | |||
| + | [smart] | ||
| + | # total read/write errors | ||
| + | TITLE S.M.A.R.T. Total Media Errors | ||
| + | YAXIS errors per second | ||
| + | FNPATTERN ^smart.(.*).rrd | ||
| + | DEF: | ||
| + | DEF: | ||
| + | DEF: | ||
| + | DEF: | ||
| + | CDEF: | ||
| + | CDEF: | ||
| + | COMMENT: | ||
| + | LINE1: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | LINE1: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | | ||
| + | [smart_temp] | ||
| + | TITLE S.M.A.R.T. Disk Temperature | ||
| + | YAXIS Celcius | ||
| + | FNPATTERN ^smart.(.*).rrd | ||
| + | DEF: | ||
| + | LINE1: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | | ||
| + | [smart_nonmedium] | ||
| + | TITLE S.M.A.R.T. Non-Medium Errors | ||
| + | YAXIS errors per second | ||
| + | FNPATTERN ^smart.(.*).rrd | ||
| + | DEF: | ||
| + | LINE1: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | GPRINT: | ||
| + | |||
| + | Add further graph definitions are desired. | ||
| + | * err_r_c | ||
| + | * err_w_c | ||
| + | * err_r_u | ||
| + | * err_w_u | ||
| + | * err_nmec = non-medium errors | ||
| + | * temp = disk temperature | ||
| + | |||
| + | 5) Add " | ||
| + | |||
| + | 6) Add " | ||
| + | ===== Source ===== | ||
| + | ==== xymon-SMART.sh ==== | ||
| + | |||
| + | <hidden onHidden=" | ||
| + | < | ||
| + | #!/bin/sh | ||
| + | |||
| + | # SMART disk monitor | ||
| + | # Jeremy Laidman, 2012 | ||
| + | # | ||
| + | # Version 0.5 - August 2012 | ||
| + | # - initial public release | ||
| + | # | ||
| + | # Initially based on Michael Adelmann' | ||
| + | # (see: http:// | ||
| + | # improvements are to support multiple disks, and to | ||
| + | # send error counts for graphing. | ||
| + | # | ||
| + | # This script queries the SMART parameters of the drives | ||
| + | # on a system, and returns the status of those drives | ||
| + | # as well as reporting various metrics available from | ||
| + | # the SMART data. | ||
| + | # | ||
| + | # How it Works | ||
| + | # ------------ | ||
| + | # | ||
| + | # The script gets its configuration from the environment | ||
| + | # or from a configuration file. | ||
| + | # | ||
| + | # The script runs in write mode (with a " | ||
| + | # create the status file from the output of the smartctl | ||
| + | # command. | ||
| + | # | ||
| + | # The script also runs in read mode (with a " | ||
| + | # to read in the status file and parse it for sending data | ||
| + | # and status reports to Xymon. | ||
| + | # every 5 minutes from a xymonlaunch configuration file | ||
| + | # (tasks.cfg on a Xymon server, or xymonlaunch.cfg on | ||
| + | # a Xymon client). | ||
| + | # | ||
| + | # In read mode, the script constructs a status report | ||
| + | # for Xymon to warn if one of the following problems are | ||
| + | # detected: | ||
| + | # - SMART is not enabled on the drive | ||
| + | # - SMART self-test is not " | ||
| + | # - SMART health status is not " | ||
| + | # | ||
| + | # The script also sends a data report for Xymon to turn | ||
| + | # into RRD files for graphing. | ||
| + | # are: | ||
| + | # - corrected read errors | ||
| + | # - corrected write errors | ||
| + | # - uncorrected read errors | ||
| + | # - uncorrected write errors | ||
| + | # - non-medium errors | ||
| + | # - disk temperature | ||
| + | # | ||
| + | # | ||
| + | # To Install | ||
| + | # ---------- | ||
| + | # | ||
| + | # Client-side: | ||
| + | # | ||
| + | # 1) Copy the script into a suitable location, | ||
| + | # such as / | ||
| + | # | ||
| + | # 2) Create a crontab entry (eg / | ||
| + | # | ||
| + | # */5 * * * * root ( umask 002; XYMONCLIENTHOME=/ | ||
| + | # | ||
| + | # / | ||
| + | # | ||
| + | # Adjust for your requirements. | ||
| + | # find a suitable DEVICE value. | ||
| + | # | ||
| + | # smartctl -d $CONTROLLER, | ||
| + | # | ||
| + | # For multiple devices, specify a comma-separated list of numbers | ||
| + | # in the COUNT variable, such as: | ||
| + | # ... COUNT=0,1 ... | ||
| + | # This usage of COUNT is not supported by smartctl. | ||
| + | # | ||
| + | # 3) Create a Xymon client tasks entry like this: | ||
| + | # | ||
| + | # [smart] | ||
| + | # | ||
| + | # CMD / | ||
| + | # | ||
| + | # | ||
| + | # | ||
| + | # Server-side: | ||
| + | # | ||
| + | # 4) Create entries in graphs.cfg like so: | ||
| + | # | ||
| + | # [smart] | ||
| + | # # total read/write errors | ||
| + | # TITLE S.M.A.R.T. Total Media Errors | ||
| + | # YAXIS errors per second | ||
| + | # FNPATTERN ^smart.(.*).rrd | ||
| + | # DEF: | ||
| + | # DEF: | ||
| + | # DEF: | ||
| + | # DEF: | ||
| + | # CDEF: | ||
| + | # CDEF: | ||
| + | # COMMENT: | ||
| + | # LINE1: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # LINE1: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # | ||
| + | # [smart_temp] | ||
| + | # TITLE S.M.A.R.T. Disk Temperature | ||
| + | # YAXIS Celcius | ||
| + | # FNPATTERN ^smart.(.*).rrd | ||
| + | # DEF: | ||
| + | # LINE1: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # | ||
| + | # [smart_nonmedium] | ||
| + | # TITLE S.M.A.R.T. Non-Medium Errors | ||
| + | # YAXIS errors per second | ||
| + | # FNPATTERN ^smart.(.*).rrd | ||
| + | # DEF: | ||
| + | # LINE1: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # GPRINT: | ||
| + | # | ||
| + | # Add further graph definitions are desired. | ||
| + | # The RRD files produce the following DS names: | ||
| + | # - err_r_c | ||
| + | # - err_w_c | ||
| + | # - err_r_u | ||
| + | # - err_w_u | ||
| + | # - err_nmec = non-medium errors | ||
| + | # - temp = disk temperature | ||
| + | # | ||
| + | # 5) Add " | ||
| + | # xymonserver.cfg, | ||
| + | # smart status page and the trends page. | ||
| + | # | ||
| + | # 6) Add " | ||
| + | # entries in hosts.cfg, or the " | ||
| + | # | ||
| + | # | ||
| + | # Troubleshooting | ||
| + | # --------------- | ||
| + | # | ||
| + | # * Check the cron output in / | ||
| + | # for errors that indicate where the problem might be. | ||
| + | # | ||
| + | # * Check that the file / | ||
| + | # If not, ensure that the script is being run by cron. | ||
| + | # | ||
| + | # * Ensure that the crontab entry is being run. On some | ||
| + | # | ||
| + | # not tell crond that there has been a change to its | ||
| + | # | ||
| + | # touch the directory containing the crontabs, such as | ||
| + | # | ||
| + | # sudo touch / | ||
| + | # | ||
| + | # * If the status file appears correct, manually run the | ||
| + | # | ||
| + | # | ||
| + | # xymoncmd / | ||
| + | # | ||
| + | # * Check the Xymon log files, particularly xymonclient.log, | ||
| + | # | ||
| + | # | ||
| + | # | ||
| + | # A note about compatibility | ||
| + | # -------------------------- | ||
| + | # | ||
| + | # This script makes use of features of GNU " | ||
| + | # GNU " | ||
| + | # This probably won't work on systems that don't have | ||
| + | # GNU " | ||
| + | # is unlikely on systems where smartctl is functioning. | ||
| + | |||
| + | die() { echo " | ||
| + | |||
| + | VERSION=0.5 | ||
| + | |||
| + | NL=" | ||
| + | " | ||
| + | |||
| + | |||
| + | if [ " | ||
| + | BB=" | ||
| + | [ " | ||
| + | [ " | ||
| + | [ " | ||
| + | fi | ||
| + | |||
| + | [ " | ||
| + | [ " | ||
| + | |||
| + | COLOR=" | ||
| + | COLUMN=" | ||
| + | CONFIG=" | ||
| + | MSG=" | ||
| + | RAID="" | ||
| + | RAIDADDR="" | ||
| + | SMARTCTL="/ | ||
| + | SUDO="/ | ||
| + | |||
| + | setup_config() { | ||
| + | # read config file | ||
| + | if [ -f $CONFIG ]; then | ||
| + | source $CONFIG | ||
| + | else | ||
| + | [ " | ||
| + | die " | ||
| + | fi | ||
| + | |||
| + | if [ -n " | ||
| + | RAIDADDR=" | ||
| + | RAID=" | ||
| + | [ 0$DEBUG -gt 1 ] && echo " | ||
| + | fi | ||
| + | |||
| + | [ -b "/ | ||
| + | |||
| + | RESULT=" | ||
| + | } | ||
| + | |||
| + | get_smart_status() { | ||
| + | # we parese the output and set some flags | ||
| + | echo " | ||
| + | case $LINE in | ||
| + | " | ||
| + | COUNTER=`expr 0$COUNTER + 1` | ||
| + | set - $LINE"" | ||
| + | DEVADDR=$3 | ||
| + | echo " | ||
| + | echo " | ||
| + | ;; | ||
| + | "Self Test returned without error" | ||
| + | echo " | ||
| + | ;; | ||
| + | "SMART Health Status:" | ||
| + | set - $LINE"" | ||
| + | echo " | ||
| + | ;; | ||
| + | " | ||
| + | set - $LINE"" | ||
| + | echo " | ||
| + | echo " | ||
| + | ;; | ||
| + | esac | ||
| + | done | ||
| + | } | ||
| + | |||
| + | get_rrd_data() { | ||
| + | # we parse the output and show some numbers | ||
| + | echo " | ||
| + | case $LINE in | ||
| + | " | ||
| + | set - $LINE"" | ||
| + | [ " | ||
| + | echo " | ||
| + | FIRST=1 | ||
| + | [ 0$DEBUG -gt 0 ] && echo "Found device $3" >&2 | ||
| + | ;; | ||
| + | read:*) | ||
| + | set - $LINE"" | ||
| + | echo " | ||
| + | echo " | ||
| + | ;; | ||
| + | write:*) | ||
| + | set - $LINE"" | ||
| + | echo " | ||
| + | echo " | ||
| + | ;; | ||
| + | " | ||
| + | set - $LINE"" | ||
| + | echo " | ||
| + | ;; | ||
| + | " | ||
| + | set - $LINE"" | ||
| + | echo " | ||
| + | ;; | ||
| + | esac | ||
| + | done | ||
| + | } | ||
| + | |||
| + | show_version() { | ||
| + | echo " | ||
| + | } | ||
| + | |||
| + | show_usage() { | ||
| + | echo " | ||
| + | show_version; | ||
| + | echo " | ||
| + | echo " | ||
| + | echo " | ||
| + | echo " | ||
| + | echo " | ||
| + | echo "If no switches are given, Xymon must have sudo rights to run the script with no password." | ||
| + | } | ||
| + | |||
| + | # Handle CLI modifiers | ||
| + | while [ " | ||
| + | case " | ||
| + | "" | ||
| + | -d|--debug) | ||
| + | test 0$2 -gt 0 2>/ | ||
| + | echo " | ||
| + | ;; | ||
| + | -q|--quiet) | ||
| + | ;; | ||
| + | -r|--read) | ||
| + | [ 0$DEBUG -gt 0 ] && echo " | ||
| + | [ " | ||
| + | READFILE=" | ||
| + | shift | ||
| + | if [ -f " | ||
| + | [ -r " | ||
| + | else | ||
| + | [ 0$QUIET -gt 0 ] && exit | ||
| + | die "File not found: $READFILE" | ||
| + | fi | ||
| + | ;; | ||
| + | -w|--write) | ||
| + | [ 0$DEBUG -gt 0 ] && echo " | ||
| + | [ " | ||
| + | WRITEFILE=" | ||
| + | shift | ||
| + | > $WRITEFILE | ||
| + | for C in `IFS=,; set - "" | ||
| + | COUNT=$C setup_config | ||
| + | if [ " | ||
| + | [ " | ||
| + | $SMARTCTL / | ||
| + | else | ||
| + | # assume that $SMARTCTL or ">" | ||
| + | # so we just bail silently with RC=1 | ||
| + | { | ||
| + | if [ " | ||
| + | [ -s $WRITEFILE ] && echo "" | ||
| + | echo " | ||
| + | fi | ||
| + | $SMARTCTL / | ||
| + | } >> $WRITEFILE || exit 1 | ||
| + | fi | ||
| + | [ 0$DEBUG -eq 0 -o " | ||
| + | done | ||
| + | exit | ||
| + | ;; | ||
| + | -n|--dryrun) | ||
| + | ;; | ||
| + | -V|--version) | ||
| + | show_version | ||
| + | exit | ||
| + | ;; | ||
| + | -h|--help) | ||
| + | show_usage | ||
| + | exit | ||
| + | ;; | ||
| + | *) die " | ||
| + | esac | ||
| + | shift | ||
| + | done | ||
| + | |||
| + | if [ 0$READ -gt 0 ]; then | ||
| + | [ 0$DEBUG -gt 0 ] && echo " | ||
| + | # bail if the file is older than 5 minutes | ||
| + | if [ " | ||
| + | FILETIME=`ls -lL --time-style " | ||
| + | else | ||
| + | FILETIME=`ls -lL --time-style " | ||
| + | fi | ||
| + | TIMENOW=`date " | ||
| + | TIMEDIFF=`expr $TIMENOW - $FILETIME` | ||
| + | [ 0$TIMEDIFF -lt 0 ] && die " | ||
| + | [ 0$TIMEDIFF -gt 600 ] && die "Stale SMART file is $TIMEDIFF seconds old" | ||
| + | if [ " | ||
| + | TMP=`cat` | ||
| + | else | ||
| + | TMP=`cat $READFILE` | ||
| + | fi | ||
| + | else | ||
| + | TMP="" | ||
| + | for C in `IFS=,; set - "" | ||
| + | COUNT=$C setup_config | ||
| + | [ " | ||
| + | TMP=" | ||
| + | done | ||
| + | fi | ||
| + | |||
| + | SMARTSTATUS=`get_smart_status " | ||
| + | [ 0$DEBUG -gt 1 ] && echo " | ||
| + | eval $SMARTSTATUS | ||
| + | |||
| + | RRDDATA=`get_rrd_data " | ||
| + | |||
| + | [ " | ||
| + | |||
| + | [ " | ||
| + | |||
| + | MSG=" | ||
| + | for DEVINDEX in $DEVICES; do | ||
| + | COLOR=" | ||
| + | |||
| + | eval DEVNAME=\$DEVADDR_$DEVINDEX | ||
| + | [ 0$DEBUG -gt 0 ] && echo " | ||
| + | |||
| + | eval SMART_ENABLED=\$SMART_ENABLED_$DEVINDEX | ||
| + | if [ " | ||
| + | RESULT=" | ||
| + | else | ||
| + | COLOR=" | ||
| + | RESULT=" | ||
| + | fi | ||
| + | |||
| + | eval SMART_HEALTH=\$SMART_HEALTH_$DEVINDEX | ||
| + | if [ " | ||
| + | RESULT=" | ||
| + | else | ||
| + | COLOR=" | ||
| + | RESULT=" | ||
| + | fi | ||
| + | |||
| + | SELF=`echo " | ||
| + | eval SMART_SELFTEST=\$SMART_SELFTEST_$DEVINDEX | ||
| + | if [ " | ||
| + | RESULT=" | ||
| + | else | ||
| + | COLOR=" | ||
| + | RESULT=" | ||
| + | fi | ||
| + | done | ||
| + | |||
| + | MSG=`echo -e " | ||
| + | |||
| + | if [ 0$DEBUG -gt 0 ]; then | ||
| + | echo " | ||
| + | echo | ||
| + | echo $XYMON $BBDISP " | ||
| + | echo | ||
| + | echo $XYMON $BBDISP "data $MACHINE.trends${NL}$RRDDATA" | ||
| + | fi | ||
| + | if [ 0$DRYRUN -eq 0 ]; then | ||
| + | $XYMON $BBDISP " | ||
| + | $XYMON $BBDISP "data $MACHINE.trends${NL}$RRDDATA" | ||
| + | fi | ||
| + | </ | ||
| + | </ | ||
| + | |||
| + | ===== Known Bugs and Issues ===== | ||
| + | |||
| + | ===== To Do ===== | ||
| + | |||
| + | ===== Credits ===== | ||
| + | |||
| + | ===== Changelog ===== | ||
| + | |||
| + | * **2012-08-30** | ||
| + | * Initial release | ||