monitors:diskstat.ksh

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

monitors:diskstat.ksh [2010/10/12 03:43] (current)
Line 1: Line 1:
 +====== DiskStat ======
  
 +^ Author | [[ everett.vernon@gmail.com | Vernon Everett ]] |
 +^ Compatibility | Tested on Solaris 10 |
 +^ Requirements | Nothing special |
 +^ Download | None |
 +^ Last Update | 2010-09-21 |
 +===== Description =====
 +Graphs of iostat output designed to appear on the trends page.
 +Really useful to see what disks are being hit hard, and getting an idea of where your bottlenecks are.
 +
 +I called it diskstat, instead of iostat, for two reasons. 
 +
 +1. There was already an iostat graph definition, and I had no idea what it was for
 +
 +2. Since it appears in the trends, it really makes no difference what it's called.
 +
 +
 +By default, it ignores NFS disks, but you can change that with the following in the appropriate section of clientlocal.cfg (or just hack the code)
 +  DISKSTAT:SHOW_NFS=yes
 +
 +  
 +
 +===== Installation =====
 +=== Client side ===
 +1. Copy diskstat.ksh to ~$HOME/client/ext 
 +
 +2. Edit the client/etc/clientlaunch.cfg and insert the following text: 
 +  [diskstat]
 +        ENVFILE $HOBBITCLIENTHOME/etc/hobbitclient.cfg
 +        CMD $HOBBITCLIENTHOME/ext/diskstat.ksh
 +        LOGFILE $HOBBITCLIENTHOME/logs/diskstat.log
 +        INTERVAL 5m      
 +
 +=== Server side ===
 +3. Add this to TEST2RRD= in hobbitserver.cfg 
 +  diskstat-reads=ncv,diskstat-writes=ncv,diskstat-kreads=ncv,diskstat-kwrites=ncv,diskstat-wait=ncv,diskstat-actv=ncv,diskstat-svct=ncv,diskstat-wsvc=ncv,diskstat-pw=ncv,diskstat-pb=ncv
 +
 +4. Add this to GRAPHS= in hobbitserver.cfg 
 +  diskstat-reads::7,diskstat-writes::7,diskstat-kreads::7,diskstat-kwrites::7,diskstat-wait::7,diskstat-actv::7,diskstat-svct::7,diskstat-wsvc::7,diskstat-pw::7,diskstat-pb::7
 +  # ::7 indicated number of lines per graph. (Default 4) Flavour to taste
 +
 +5. Add this to hobbitserver.cfg 
 +  SPLITNCV_diskstat-pb="*:GAUGE"
 +  SPLITNCV_diskstat-reads="*:GAUGE"
 +  SPLITNCV_diskstat-writes="*:GAUGE"
 +  SPLITNCV_diskstat-kreads="*:GAUGE"
 +  SPLITNCV_diskstat-kwrites="*:GAUGE"
 +  SPLITNCV_diskstat-wait="*:GAUGE"
 +  SPLITNCV_diskstat-actv="*:GAUGE"
 +  SPLITNCV_diskstat-wsvc="*:GAUGE"
 +  SPLITNCV_diskstat-svct="*:GAUGE"
 +  SPLITNCV_diskstat-pw="*:GAUGE"
 +
 +6. Add this hobbitgraph.cfg
 +  [diskstat-reads]
 +    FNPATTERN diskstat-reads,(.*).rrd
 +    TITLE Disk Reads per Second
 +    YAXIS Reads
 +    -l 0
 +    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
 +    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
 +    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
 +    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
 +    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
 +    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
 +  
 +  [diskstat-writes]
 +    FNPATTERN diskstat-writes,(.*).rrd
 +    TITLE Disk Writes per Second
 +    YAXIS Writes
 +    -l 0
 +    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
 +    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
 +    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
 +    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
 +    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
 +    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
 +  
 +  [diskstat-kreads]
 +    FNPATTERN diskstat-kreads,(.*).rrd
 +    TITLE Disk Reads per Second in Kb
 +    YAXIS Kb
 +    -l 0
 +    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
 +    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
 +    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
 +    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
 +    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
 +    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
 +  
 +  [diskstat-kwrites]
 +    FNPATTERN diskstat-writes,(.*).rrd
 +    TITLE Disk Writes per Second in Kb
 +    YAXIS Kb
 +    -l 0
 +    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
 +    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
 +    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
 +    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
 +    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
 +    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
 +  
 +  [diskstat-wait]
 +    FNPATTERN diskstat-wait,(.*).rrd
 +    TITLE Average Number of Transactions Waiting
 +    YAXIS Total
 +    -l 0
 +    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
 +    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
 +    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
 +    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
 +    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
 +    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
 +  
 +  [diskstat-actv]
 +    FNPATTERN diskstat-actv,(.*).rrd
 +    TITLE Average Number of Transactions Active
 +    YAXIS Total
 +    -l 0
 +    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
 +    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
 +    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
 +    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
 +    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
 +    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
 +  
 +  [diskstat-svct]
 +    FNPATTERN diskstat-svct,(.*).rrd
 +    TITLE Average Response Time of Transaction
 +    YAXIS Milliseconds
 +    -l 0
 +    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
 +    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
 +    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
 +    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
 +    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
 +    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
 +  
 +  [diskstat-wsvc]
 +    FNPATTERN diskstat-wsvc,(.*).rrd
 +    TITLE Average Number of Transactions Waiting
 +    YAXIS Total
 +    -l 0
 +    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
 +    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
 +    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
 +    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
 +    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
 +    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
 +  
 +  [diskstat-pw]
 +    FNPATTERN diskstat-pw,(.*).rrd
 +    TITLE Percent of Time Waiting
 +    YAXIS %
 +    -l 0
 +    -u 100
 +    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
 +    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
 +    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
 +    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
 +    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
 +    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
 +  
 +  [diskstat-pb]
 +    FNPATTERN diskstat-pb,(.*).rrd
 +    TITLE Percent of Time Disk Busy
 +    YAXIS %
 +    -l 0
 +    -u 100
 +    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
 +    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
 +    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
 +    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
 +    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
 +    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
 +
 +
 +
 +===== Source =====
 +==== diskstat.ksh ====
 +
 +<hidden onHidden="Show Code ⇲" onVisible="Hide Code ⇱">
 +<code>
 +#!/bin/ksh
 +TEMPFILE=$BBTMP/diskstat.tmp
 +SHOW_NFS=no   # Set this to yes on server side clientlocal.cfg to change it
 +              # DISKSTAT:SHOW_NFS=yes
 +DURATION=10   # The duration of the iostat sample
 +              # This can be updated in the same way as above
 +
 +# Now we redefine some variables, if they are set in clientlocal
 +LOGFETCH=${BBTMP}/logfetch.$(uname -n).cfg
 +if [ -f $LOGFETCH ]
 +    then
 +       grep "^DISKSTAT:" $LOGFETCH | cut -d":" -f2 \
 +                                   | while read NEW_DEF
 +                                     do
 +                                        $NEW_DEF
 +                                     done
 +fi
 +
 +> $TEMPFILE  # Make sure it's empty
 +/usr/bin/iostat -xrn $DURATION 2 > $TEMPFILE.raw  # And collect some data to work with.
 +# We have to collect 2 sets, because the first set is the average since boot.
 +
 +# Define where the second set of data starts
 +LINE=$(cat $TEMPFILE.raw | grep -n ",device$" | tail -1 | cut -d":" -f1)
 +# take the second set, and massage it into usable data
 +cat $TEMPFILE.raw | awk "NR>$LINE" \
 +                  | sed "s/,/ /g" \
 +                  | awk '{ print $NF" "$0 }' \
 +                  | awk '{ $NF="";print }' > $TEMPFILE.data
 +rm $TEMPFILE.raw
 +count=1
 +# Now we format the data and send it off to the server
 +for subtest in reads writes kreads kwrites wait actv wsvc svct pw pb
 +do
 +   ((count=count+1))
 +   echo "" >> $TEMPFILE
 +   cat $TEMPFILE.data | cut -d" " -f1,$count \
 +                      | while read DEVICE VAL
 +                        do
 +                           echo "$DEVICE" | grep ":/" > /dev/null
 +                           if [ $? -eq 0 -a "$SHOW_NFS" = "no" ]
 +                           then
 +                              break
 +                           else
 +                              DEVICE=$(echo $DEVICE | tr : - )
 +                           fi
 +                           echo "${DEVICE}:${VAL}" >> $TEMPFILE
 +                        done
 +                        echo "" >> $TEMPFILE
 +                        $BB $BBDISP "data $MACHINE.diskstat-${subtest} $(echo; cat $TEMPFILE ;echo "" ;echo "ignore this" )"
 +                        # Without the last echo "ignore this", it seems to not graph the last entry.
 +                        # Odd really, but that seems to fix it.
 +                        rm $TEMPFILE
 +done
 +rm $TEMPFILE.data
 +
 +</code>
 +</hidden>
 +
 +===== Known  Bugs and Issues =====
 +2010-09-21 - Found and fixed a bug. (Left out the wsvc_t stat.)
 +
 +
 +All bugs are currently unknown.
 +
 +If you find any, let me know, and I will see what I can do to fix them.
 +
 +===== To Do =====
 +Was toying with the idea of having some of the values appear as alerts, with standard red/yellow/green alert thresholds and all the rest, but not sure why?
 +
 +Might be useful to watch the average service time?
 +
 +However, to be of concern, high iostat figures need to be sustained. Disk usage is expected to peak from time to time, so is it really suitable for alerts?
 +And even if it does peak, sustained, what exactly can you do about it? 
 +
 +Your comments on the back of $100 bills only.
 +
 +===== Credits =====
 +This all started because a piece of software is crashing on one of my servers every month or so. The application admin is blaming me (and my server)
 +
 +I said it's not the server, but after some constructive googling, I found a link which hinted that it might be disk performance.
 +
 +I decided to monitor disk performance, and get some graphs for when it crashes again.
 +
 +So all credit for this goes to really poorly written mail server that doesn't do single instancing. (Name of application withheld to protect the guilty)
 +===== Changelog =====
 +
 +  * **2010-09-09**
 +    * Initial release
 +
 +  * **2010-09-21**
 +    * Fairly major bug fix. (Left out the wsvc_t stats)
  • monitors/diskstat.ksh.txt
  • Last modified: 2010/10/12 03:43
  • (external edit)