====== Linux Software RAID monitoring ====== ^ Author | [[ doctor@makelofine.org | Damien Martins ]] | ^ Compatibility | Xymon 4.2.2/4.2.3 - Kernel Linux 2.2/2.4/2.6 | ^ Requirements | MDADM, unix, shell | ^ Download | Part of https://github.com/doktoil-makresh/xymon-plugins.git | ^ Last Update | 2010-01-15 | ===== Description ===== Linux software RAID monitoring (using MDADM) -Status of any RAID device -Resync/recovery detection ===== Installation ===== === Client side === Copy bb-mdstat.sh in hobboit/xymon ext directory (usually in HOBBITCLIENTHOME/ext) Add the following lines to HOBBITCLIENTHOME/etc/clientlaunch.cfg : [raid] #DISABLED ENVFILE $HOBBITCLIENTHOME/etc/hobbitclient.cfg CMD $HOBBITCLIENTHOME/ext/bb-mdstat.sh LOGFILE $HOBBITCLIENTHOME/logs/bb-mdstat.log INTERVAL 5m === Server side === Add the "raid" tag for appropriated hosts in HOBBITSERVERHOME/etc/bb-hosts, for example : 123.234.123.234 toto # raid ===== Source ===== #!/bin/sh ################################ bb-mdstat.sh ################################### # This script was based on bb-raid.sh, it worked for me, but I only have a # # a single raid5 array, so your mileage may vary. Apparently the /proc/mdstat # # file format changed from Linux 2.0 to 2.2 to 2.4, this script works on Linux # # 2.4.x, and may work on other versions with the appropriate patches. # # Due to bb-raid.sh license, all this script is still one hundred per cent GPL # # # # 15/01/2010 - Stuart dot Carmichael at iinet dot net dot au # # Version 1.3.1 # # -Different bug fixes, RAID1 and RAID5 support confirmed # # 05/01/2010 - Damien Martins - doctor hat makelofine d0t org # # Version 1.3-alpha # # -Getting working test for failed (F) device # # 18/12/2009 - Damien Martins - doctor hat makelofine d0t org # # Version 1.2-alpha # # Major bugfix to found devices failure (indicated with (F) in # # /proc/mdstat, and identify wich device is failed # # - Stuart Carmichael - new code block to test RAID1/5 md for # # failed/removed devices # # 1/12/2009 - Stuart dot Carmichael at iinet dot net dot au # # Version 1.1-alpha # # Minor bugfix to rectify red alerts on similarly named md's # # eg, a server with md1 and md10 would error on md1 (non-unique # # grep returned from /proc/mdstat) # # # # 10/10/2009 - Damien Martins - doctor hat makelofine d0t org # # Version 1.0-alpha - Major code rewrite to decrease CPU usage by using less # # commands and get a faster result # # # # 17/09/2009 - Damien Martins - doctor hat makelofine d0t org # # Version 0.6.1 - Minor code rewrites to increase debug and correct some bugs # # # # 28/08/2009 - Damien Martins - doctor hat makelofine d0t org # # Version 0.6 - Minor code rewrites in order to ease debug and new features # # # # 27/07/2009 - Damien Martins - doctor hat makelofine d0t org # # Version 0.5 - Support any name of RAID devices. Tested compatibility for # # Linux kernel 2.6 and wider resync detection. Higher # # compatibility with Xymon. # # # # 03/10/2003 # # Version 0.4 - Automatically detect number of raid devices. # # # # 25/09/2003 # # Version 0.3 - Set to support more than four raid devices. # # # # 16/09/2001 # # Version 0.2 - Significant bug fix for non-green detection. Added resync # # detection to change to yellow. Various other minor cosmetic bug # # fixes. # # # # 16/06/2001 # # Version 0.1 - Initial write, so far it is confirmed to be green when # # everything is OK, no other testing has been done ! # ################################################################################# export BBPROG=bb-mdstat.sh TEST="raid" unset DEBUG if [ "$1" == "debug" ] ; then DEBUG=1 BB=echo MACHINE=xymon_client BBDISP=xymon_server BBHOME=/tmp BBTMP=$(pwd) AWK=/bin/awk CAT=/bin/cat DATE=/bin/date GREP=/bin/grep HEAD=/bin/head RM=/bin/rm TAIL=/bin/tail SED=/bin/sed fi if [ -z $BBHOME ] ; then echo "BBHOME is not set... exiting" exit 1 fi if [ ! -d "$BBTMP" ] ; then # GET DEFINITIONS IF NEEDED echo "*** LOADING HOBBITCLIENT.CFG ***" . $BBHOME/etc/hobbitclient.cfg # INCLUDE STANDARD DEFINITIONS fi # # NOW COLLECT SOME DATA # md: syncing RAID array # md: updating # md: removing former faulty # md: active for MD_DEVICE in $($GREP ^md /proc/mdstat | $AWK '{print $1}') ; do # Create a list of MD devices, and for each one, do the following STATUS="success" if [ -f $BBTMP/bb-mdstat_"$MD_DEVICE"* ] ; then #Erase the temporary file we created previously $RM $BBTMP/bb-mdstat_"$MD_DEVICE"* fi $GREP ^"${MD_DEVICE} :" /proc/mdstat > $BBTMP/bb-mdstat_$MD_DEVICE #Create a temporary file to work on MD device if [ $? -ne 0 ] ; then LINE_COLOR=red TMPLINE="Disk failed" fi if [ $DEBUG ] ; then echo "Debug : MD_DEVICE : $MD_DEVICE ; STATUS_LINE : $($CAT $BBTMP/bb-mdstat_$MD_DEVICE)" fi $GREP -q "(F)" $BBTMP/bb-mdstat_$MD_DEVICE #Look for failed "(F)" in /proc/mdstat if [ $? -eq 0 ] ; then # SC Syntax error on next line. missing $CAT; missing value for $SED for DEVICE in $($SED -e s/${MD_DEVICE}\ :\ active\ raid[0-5]\ // -e s/${MD_DEVICE}\ :\ active\ linear// -e s/${MD_DEVICE}\ :\ active\ multipath// -e s/${MD_DEVICE}\ :\ active\ faulty// ${BBTMP}/bb-mdstat_${MD_DEVICE}) ; do #Found wich device is in (F) status echo $DEVICE | $GREP -q "(F)" if [ $? -eq 0 ] ; then LINE_COLOR=red RED=1 TMPLINE=" Device $DEVICE used for $MD_DEVICE is KO" # Write to temporary file the result fi done fi # The following test is limited to Linux only. Other Distros sto be tested (eg Solaris) # Additional testing added SC 17/12/09 if [ "$(uname -s)" = "Linux" ]; then # test the metadevice has all expected devices active. # only check RAID-1 and RAID-5 devices: test is not valid for RAID-0 raid_level="$(${GREP} ^"${MD_DEVICE} :" /proc/mdstat|$AWK '{ print $4 }')" if [[ $raid_level = "raid1" || $raid_level = "raid5" ]]; then # test for raid1 or raid5 only data="$(${GREP} -A 1 ^"${MD_DEVICE} :" /proc/mdstat|tail -1)" # contents of the last line for the md device active_devices="$(echo $data|$AWK '{ print $(NF - 1) }')" # extract the second last field (NF-1) failed_devices="$(echo $data|$AWK '{ print $NF }')" # extract the last field (NF) num_active_devices="$(echo $active_devices|$SED 's/\[//g'|$SED 's/\]//g'|$AWK -F/ '{ print $1 }')" num_failed_devices="$(echo $active_devices|$SED 's/\[//g'|$SED 's/\]//g'|$AWK -F/ '{ print $2 }')" if [ $num_active_devices -ne $num_failed_devices ]; then STATUS="failed" LINE_COLOR=red RED=1 TMPLINE=" Expected device count does not equal active device count ($active_devices)" # Write to temporary file the result fi fi # end raid1/raid5 test fi # end if GNU/Linux SC 17/12/09 if [ $STATUS = "failed" ] ; then echo "&"$LINE_COLOR" $MD_DEVICE $TMPLINE" > $BBTMP/bb-mdstat_$MD_DEVICE.out else # SC 11/1/10 if [ $DEBUG ] ; then echo "Debug : TMPLINE : $TMPLINE ; LINE_COLOR : $LINE_COLOR ; MD_DEVICE= : $MD_DEVICE" fi STATUS="$($AWK '{print $3}' $BBTMP/bb-mdstat_$MD_DEVICE)" #See the status of MD device, and depending on result, do the following case $STATUS in active) LINE_COLOR=green ; GREEN=1 ; TMPLINE="Status OK" ;; failed) LINE_COLOR=red ; RED=1 ; TMPLINE="Status Failed" ;; updating) LINE_COLOR=yellow ; YELLOW=1 ; TMPLINE="Status updating" ;; *) LINE_COLOR=red ; RED=1 ; TMPLINE="Status KO : Status : $STATUS $BBTMP/bb-mdstat_$MD_DEVICE : $($CAT $BBTMP/bb-mdstat_$MD_DEVICE) /proc/mdstat : $($CAT /proc/mdstat)" ;; esac RESYNC=$($GREP -A 3 ^$MD_DEVICE /proc/mdstat | $AWK '{print $2}') # Now check for resync bb-mdstat_ device if [ "$RESYNC" == "resync" ] || [ "$RESYNC" == "recovery" ] ; then if [ -z $RED ] ; then LINE_COLOR=yellow YELLOW=1 TMPLINE="Resync in progress" fi fi echo "&"$LINE_COLOR" $MD_DEVICE $TMPLINE" >> $BBTMP/bb-mdstat_$MD_DEVICE.out fi # end if $RED else if [ $DEBUG ] ; then echo "Debug : TMPLINE : $TMPLINE ; LINE_COLOR : $LINE_COLOR ; MD_DEVICE= : $MD_DEVICE" fi done if [ $RED ] ; then COLOR=red elif [ $YELLOW ] ; then COLOR=yellow elif [ $GREEN ] ; then COLOR=green else COLOR=grey fi LINE="status $MACHINE.$TEST $COLOR $($DATE) " for MD_DEVICE in $($GREP ^md /proc/mdstat | $AWK '{print $1}') ; do if [ -f $BBTMP/bb-mdstat_$MD_DEVICE.out ] ; then LINE="$LINE $($CAT $BBTMP/bb-mdstat_$MD_DEVICE.out)" fi done LINE="$LINE ============================ /proc/mdstat =========================== $($CAT /proc/mdstat) ============================ End of file ============================" if [ -z $DEBUG ] ; then $RM $BBTMP/bb-mdstat_* fi $BB $BBDISP "$LINE" # SEND TO BBDISPLAY ===== Known Bugs and Issues ===== None ===== To Do ===== None ===== Credits ===== Reimplementation of [[http://www.deadcat.net/viewfile.php?fileid=731|deadcat's]] bb-mdstat.sh\\ Several updates/bug fixes by Stuart Carmichael, who tested on more configurations than mine. ===== Changelog ===== * **2001-06-16 v0.1** * Initial release * **2001-00-16 v0.2** * Significant bug fix for non-green detection. * Added resync detection to change to yellow. * Various other minor cosmetic bug fixes. * **2003-09-25 v0.3** * Set to support more than four raid devices. * **2003-10-03 v0.4** * Automatically detect number of raid devices. * **2009-07-27 v0.5** * Support any name for RAID devices. * Tested compatibility for linux kernel 2.6 and wider resync detection. * Higher compatibility with Xymon. * **2009-08-28 v0.6** * Minor code rewrites in order to ease debug and new features. * **2009-09-17 v0.6.1** * Minor code rewrites to increase debug and correct some bugs. * **2009-10-11 v1.0alpha** * Major code rewrite to decrease CPU usage by using less commands and get a faster result. * **2009-10-11 v1.1alpha** * Minor bugfix to rectify red alerts on similarly named md's eg, a server with md1 and md10 would error on md1 (non-unique grep returned from /proc/mdstat). * **2010-01-15 v1.3.1** * Sevral bugfix and new tests. Confirmed on RAID1 and RAID5.