Differences

This shows you the differences between two versions of the page.

Link to this comparison view

monitors:diskstat.ksh [2010/10/12 03:43] (current)
Line 1: Line 1:
 +====== DiskStat ======
  
 +^ Author | [[ everett.vernon@gmail.com | Vernon Everett ]] |
 +^ Compatibility | Tested on Solaris 10 |
 +^ Requirements | Nothing special |
 +^ Download | None |
 +^ Last Update | 2010-09-21 |
 +===== Description =====
 +Graphs of iostat output designed to appear on the trends page.
 +Really useful to see what disks are being hit hard, and getting an idea of where your bottlenecks are.
 +
 +I called it diskstat, instead of iostat, for two reasons. ​
 +
 +1. There was already an iostat graph definition, and I had no idea what it was for
 +
 +2. Since it appears in the trends, it really makes no difference what it's called.
 +
 +
 +By default, it ignores NFS disks, but you can change that with the following in the appropriate section of clientlocal.cfg (or just hack the code)
 +  DISKSTAT:​SHOW_NFS=yes
 +
 +  ​
 +
 +===== Installation =====
 +=== Client side ===
 +1. Copy diskstat.ksh to ~$HOME/​client/​ext ​
 +
 +2. Edit the client/​etc/​clientlaunch.cfg and insert the following text: 
 +  [diskstat]
 +        ENVFILE $HOBBITCLIENTHOME/​etc/​hobbitclient.cfg
 +        CMD $HOBBITCLIENTHOME/​ext/​diskstat.ksh
 +        LOGFILE $HOBBITCLIENTHOME/​logs/​diskstat.log
 +        INTERVAL 5m      ​
 +
 +=== Server side ===
 +3. Add this to TEST2RRD= in hobbitserver.cfg ​
 +  diskstat-reads=ncv,​diskstat-writes=ncv,​diskstat-kreads=ncv,​diskstat-kwrites=ncv,​diskstat-wait=ncv,​diskstat-actv=ncv,​diskstat-svct=ncv,​diskstat-wsvc=ncv,​diskstat-pw=ncv,​diskstat-pb=ncv
 +
 +4. Add this to GRAPHS= in hobbitserver.cfg ​
 +  diskstat-reads::​7,​diskstat-writes::​7,​diskstat-kreads::​7,​diskstat-kwrites::​7,​diskstat-wait::​7,​diskstat-actv::​7,​diskstat-svct::​7,​diskstat-wsvc::​7,​diskstat-pw::​7,​diskstat-pb::​7
 +  # ::7 indicated number of lines per graph. (Default 4) Flavour to taste
 +
 +5. Add this to hobbitserver.cfg ​
 +  SPLITNCV_diskstat-pb="​*:​GAUGE"​
 +  SPLITNCV_diskstat-reads="​*:​GAUGE"​
 +  SPLITNCV_diskstat-writes="​*:​GAUGE"​
 +  SPLITNCV_diskstat-kreads="​*:​GAUGE"​
 +  SPLITNCV_diskstat-kwrites="​*:​GAUGE"​
 +  SPLITNCV_diskstat-wait="​*:​GAUGE"​
 +  SPLITNCV_diskstat-actv="​*:​GAUGE"​
 +  SPLITNCV_diskstat-wsvc="​*:​GAUGE"​
 +  SPLITNCV_diskstat-svct="​*:​GAUGE"​
 +  SPLITNCV_diskstat-pw="​*:​GAUGE"​
 +
 +6. Add this hobbitgraph.cfg
 +  [diskstat-reads]
 +    FNPATTERN diskstat-reads,​(.*).rrd
 +    TITLE Disk Reads per Second
 +    YAXIS Reads
 +    -l 0
 +    DEF:​p@RRDIDX@=@RRDFN@:​lambda:​AVERAGE
 +    LINE2:​p@RRDIDX@#​@COLOR@:​@RRDPARAM@
 +    GPRINT:​p@RRDIDX@:​LAST:​ \: %5.1lf (cur)
 +    GPRINT:​p@RRDIDX@:​MAX:​ \: %5.1lf (max)
 +    GPRINT:​p@RRDIDX@:​MIN:​ \: %5.1lf (min)
 +    GPRINT:​p@RRDIDX@:​AVERAGE:​ \: %5.1lf (avg)\n
 +  ​
 +  [diskstat-writes]
 +    FNPATTERN diskstat-writes,​(.*).rrd
 +    TITLE Disk Writes per Second
 +    YAXIS Writes
 +    -l 0
 +    DEF:​p@RRDIDX@=@RRDFN@:​lambda:​AVERAGE
 +    LINE2:​p@RRDIDX@#​@COLOR@:​@RRDPARAM@
 +    GPRINT:​p@RRDIDX@:​LAST:​ \: %5.1lf (cur)
 +    GPRINT:​p@RRDIDX@:​MAX:​ \: %5.1lf (max)
 +    GPRINT:​p@RRDIDX@:​MIN:​ \: %5.1lf (min)
 +    GPRINT:​p@RRDIDX@:​AVERAGE:​ \: %5.1lf (avg)\n
 +  ​
 +  [diskstat-kreads]
 +    FNPATTERN diskstat-kreads,​(.*).rrd
 +    TITLE Disk Reads per Second in Kb
 +    YAXIS Kb
 +    -l 0
 +    DEF:​p@RRDIDX@=@RRDFN@:​lambda:​AVERAGE
 +    LINE2:​p@RRDIDX@#​@COLOR@:​@RRDPARAM@
 +    GPRINT:​p@RRDIDX@:​LAST:​ \: %5.1lf (cur)
 +    GPRINT:​p@RRDIDX@:​MAX:​ \: %5.1lf (max)
 +    GPRINT:​p@RRDIDX@:​MIN:​ \: %5.1lf (min)
 +    GPRINT:​p@RRDIDX@:​AVERAGE:​ \: %5.1lf (avg)\n
 +  ​
 +  [diskstat-kwrites]
 +    FNPATTERN diskstat-writes,​(.*).rrd
 +    TITLE Disk Writes per Second in Kb
 +    YAXIS Kb
 +    -l 0
 +    DEF:​p@RRDIDX@=@RRDFN@:​lambda:​AVERAGE
 +    LINE2:​p@RRDIDX@#​@COLOR@:​@RRDPARAM@
 +    GPRINT:​p@RRDIDX@:​LAST:​ \: %5.1lf (cur)
 +    GPRINT:​p@RRDIDX@:​MAX:​ \: %5.1lf (max)
 +    GPRINT:​p@RRDIDX@:​MIN:​ \: %5.1lf (min)
 +    GPRINT:​p@RRDIDX@:​AVERAGE:​ \: %5.1lf (avg)\n
 +  ​
 +  [diskstat-wait]
 +    FNPATTERN diskstat-wait,​(.*).rrd
 +    TITLE Average Number of Transactions Waiting
 +    YAXIS Total
 +    -l 0
 +    DEF:​p@RRDIDX@=@RRDFN@:​lambda:​AVERAGE
 +    LINE2:​p@RRDIDX@#​@COLOR@:​@RRDPARAM@
 +    GPRINT:​p@RRDIDX@:​LAST:​ \: %5.1lf (cur)
 +    GPRINT:​p@RRDIDX@:​MAX:​ \: %5.1lf (max)
 +    GPRINT:​p@RRDIDX@:​MIN:​ \: %5.1lf (min)
 +    GPRINT:​p@RRDIDX@:​AVERAGE:​ \: %5.1lf (avg)\n
 +  ​
 +  [diskstat-actv]
 +    FNPATTERN diskstat-actv,​(.*).rrd
 +    TITLE Average Number of Transactions Active
 +    YAXIS Total
 +    -l 0
 +    DEF:​p@RRDIDX@=@RRDFN@:​lambda:​AVERAGE
 +    LINE2:​p@RRDIDX@#​@COLOR@:​@RRDPARAM@
 +    GPRINT:​p@RRDIDX@:​LAST:​ \: %5.1lf (cur)
 +    GPRINT:​p@RRDIDX@:​MAX:​ \: %5.1lf (max)
 +    GPRINT:​p@RRDIDX@:​MIN:​ \: %5.1lf (min)
 +    GPRINT:​p@RRDIDX@:​AVERAGE:​ \: %5.1lf (avg)\n
 +  ​
 +  [diskstat-svct]
 +    FNPATTERN diskstat-svct,​(.*).rrd
 +    TITLE Average Response Time of Transaction
 +    YAXIS Milliseconds
 +    -l 0
 +    DEF:​p@RRDIDX@=@RRDFN@:​lambda:​AVERAGE
 +    LINE2:​p@RRDIDX@#​@COLOR@:​@RRDPARAM@
 +    GPRINT:​p@RRDIDX@:​LAST:​ \: %5.1lf (cur)
 +    GPRINT:​p@RRDIDX@:​MAX:​ \: %5.1lf (max)
 +    GPRINT:​p@RRDIDX@:​MIN:​ \: %5.1lf (min)
 +    GPRINT:​p@RRDIDX@:​AVERAGE:​ \: %5.1lf (avg)\n
 +  ​
 +  [diskstat-wsvc]
 +    FNPATTERN diskstat-wsvc,​(.*).rrd
 +    TITLE Average Number of Transactions Waiting
 +    YAXIS Total
 +    -l 0
 +    DEF:​p@RRDIDX@=@RRDFN@:​lambda:​AVERAGE
 +    LINE2:​p@RRDIDX@#​@COLOR@:​@RRDPARAM@
 +    GPRINT:​p@RRDIDX@:​LAST:​ \: %5.1lf (cur)
 +    GPRINT:​p@RRDIDX@:​MAX:​ \: %5.1lf (max)
 +    GPRINT:​p@RRDIDX@:​MIN:​ \: %5.1lf (min)
 +    GPRINT:​p@RRDIDX@:​AVERAGE:​ \: %5.1lf (avg)\n
 +  ​
 +  [diskstat-pw]
 +    FNPATTERN diskstat-pw,​(.*).rrd
 +    TITLE Percent of Time Waiting
 +    YAXIS %
 +    -l 0
 +    -u 100
 +    DEF:​p@RRDIDX@=@RRDFN@:​lambda:​AVERAGE
 +    LINE2:​p@RRDIDX@#​@COLOR@:​@RRDPARAM@
 +    GPRINT:​p@RRDIDX@:​LAST:​ \: %5.1lf (cur)
 +    GPRINT:​p@RRDIDX@:​MAX:​ \: %5.1lf (max)
 +    GPRINT:​p@RRDIDX@:​MIN:​ \: %5.1lf (min)
 +    GPRINT:​p@RRDIDX@:​AVERAGE:​ \: %5.1lf (avg)\n
 +  ​
 +  [diskstat-pb]
 +    FNPATTERN diskstat-pb,​(.*).rrd
 +    TITLE Percent of Time Disk Busy
 +    YAXIS %
 +    -l 0
 +    -u 100
 +    DEF:​p@RRDIDX@=@RRDFN@:​lambda:​AVERAGE
 +    LINE2:​p@RRDIDX@#​@COLOR@:​@RRDPARAM@
 +    GPRINT:​p@RRDIDX@:​LAST:​ \: %5.1lf (cur)
 +    GPRINT:​p@RRDIDX@:​MAX:​ \: %5.1lf (max)
 +    GPRINT:​p@RRDIDX@:​MIN:​ \: %5.1lf (min)
 +    GPRINT:​p@RRDIDX@:​AVERAGE:​ \: %5.1lf (avg)\n
 +
 +
 +
 +===== Source =====
 +==== diskstat.ksh ====
 +
 +<hidden onHidden="​Show Code ⇲" onVisible="​Hide Code ⇱">​
 +<​code>​
 +#!/bin/ksh
 +TEMPFILE=$BBTMP/​diskstat.tmp
 +SHOW_NFS=no ​  # Set this to yes on server side clientlocal.cfg to change it
 +              # DISKSTAT:​SHOW_NFS=yes
 +DURATION=10 ​  # The duration of the iostat sample
 +              # This can be updated in the same way as above
 +
 +# Now we redefine some variables, if they are set in clientlocal
 +LOGFETCH=${BBTMP}/​logfetch.$(uname -n).cfg
 +if [ -f $LOGFETCH ]
 +    then
 +       grep "​^DISKSTAT:"​ $LOGFETCH | cut -d":"​ -f2 \
 +                                   | while read NEW_DEF
 +                                     do
 +                                        $NEW_DEF
 +                                     done
 +fi
 +
 +> $TEMPFILE ​ # Make sure it's empty
 +/​usr/​bin/​iostat -xrn $DURATION 2 > $TEMPFILE.raw ​ # And collect some data to work with.
 +# We have to collect 2 sets, because the first set is the average since boot.
 +
 +# Define where the second set of data starts
 +LINE=$(cat $TEMPFILE.raw | grep -n ",​device$"​ | tail -1 | cut -d":"​ -f1)
 +# take the second set, and massage it into usable data
 +cat $TEMPFILE.raw | awk "​NR>​$LINE"​ \
 +                  | sed "s/,/ /g" \
 +                  | awk '{ print $NF" "$0 }' \
 +                  | awk '{ $NF="";​print }' > $TEMPFILE.data
 +rm $TEMPFILE.raw
 +count=1
 +# Now we format the data and send it off to the server
 +for subtest in reads writes kreads kwrites wait actv wsvc svct pw pb
 +do
 +   ​((count=count+1))
 +   echo ""​ >> $TEMPFILE
 +   cat $TEMPFILE.data | cut -d" " -f1,$count \
 +                      | while read DEVICE VAL
 +                        do
 +                           echo "​$DEVICE"​ | grep ":/"​ > /dev/null
 +                           if [ $? -eq 0 -a "​$SHOW_NFS"​ = "​no"​ ]
 +                           then
 +                              break
 +                           else
 +                              DEVICE=$(echo $DEVICE | tr : - )
 +                           fi
 +                           echo "​${DEVICE}:​${VAL}"​ >> $TEMPFILE
 +                        done
 +                        echo ""​ >> $TEMPFILE
 +                        $BB $BBDISP "data $MACHINE.diskstat-${subtest} $(echo; cat $TEMPFILE ;echo ""​ ;echo "​ignore this" )"
 +                        # Without the last echo "​ignore this", it seems to not graph the last entry.
 +                        # Odd really, but that seems to fix it.
 +                        rm $TEMPFILE
 +done
 +rm $TEMPFILE.data
 +
 +</​code>​
 +</​hidden>​
 +
 +===== Known  Bugs and Issues =====
 +2010-09-21 - Found and fixed a bug. (Left out the wsvc_t stat.)
 +
 +
 +All bugs are currently unknown.
 +
 +If you find any, let me know, and I will see what I can do to fix them.
 +
 +===== To Do =====
 +Was toying with the idea of having some of the values appear as alerts, with standard red/​yellow/​green alert thresholds and all the rest, but not sure why?
 +
 +Might be useful to watch the average service time?
 +
 +However, to be of concern, high iostat figures need to be sustained. Disk usage is expected to peak from time to time, so is it really suitable for alerts?
 +And even if it does peak, sustained, what exactly can you do about it? 
 +
 +Your comments on the back of $100 bills only.
 +
 +===== Credits =====
 +This all started because a piece of software is crashing on one of my servers every month or so. The application admin is blaming me (and my server)
 +
 +I said it's not the server, but after some constructive googling, I found a link which hinted that it might be disk performance.
 +
 +I decided to monitor disk performance,​ and get some graphs for when it crashes again.
 +
 +So all credit for this goes to really poorly written mail server that doesn'​t do single instancing. (Name of application withheld to protect the guilty)
 +===== Changelog =====
 +
 +  * **2010-09-09**
 +    * Initial release
 +
 +  * **2010-09-21**
 +    * Fairly major bug fix. (Left out the wsvc_t stats)
  • monitors/diskstat.ksh.txt
  • Last modified: 2010/10/12 03:43
  • (external edit)