====== DiskStat ======

^ Author | [[ everett.vernon@gmail.com | Vernon Everett ]] |
^ Compatibility | Tested on Solaris 10 |
^ Requirements | Nothing special |
^ Download | None |
^ Last Update | 2010-09-21 |
===== Description =====
Graphs of iostat output designed to appear on the trends page.
Really useful to see what disks are being hit hard, and getting an idea of where your bottlenecks are.

I called it diskstat, instead of iostat, for two reasons. 

1. There was already an iostat graph definition, and I had no idea what it was for

2. Since it appears in the trends, it really makes no difference what it's called.


By default, it ignores NFS disks, but you can change that with the following in the appropriate section of clientlocal.cfg (or just hack the code)
  DISKSTAT:SHOW_NFS=yes

  
===== Installation =====
=== Client side ===
1. Copy diskstat.ksh to ~$HOME/client/ext 

2. Edit the client/etc/clientlaunch.cfg and insert the following text: 
  [diskstat]
        ENVFILE $HOBBITCLIENTHOME/etc/hobbitclient.cfg
        CMD $HOBBITCLIENTHOME/ext/diskstat.ksh
        LOGFILE $HOBBITCLIENTHOME/logs/diskstat.log
        INTERVAL 5m      

=== Server side ===
3. Add this to TEST2RRD= in hobbitserver.cfg 
  diskstat-reads=ncv,diskstat-writes=ncv,diskstat-kreads=ncv,diskstat-kwrites=ncv,diskstat-wait=ncv,diskstat-actv=ncv,diskstat-svct=ncv,diskstat-wsvc=ncv,diskstat-pw=ncv,diskstat-pb=ncv

4. Add this to GRAPHS= in hobbitserver.cfg 
  diskstat-reads::7,diskstat-writes::7,diskstat-kreads::7,diskstat-kwrites::7,diskstat-wait::7,diskstat-actv::7,diskstat-svct::7,diskstat-wsvc::7,diskstat-pw::7,diskstat-pb::7
  # ::7 indicated number of lines per graph. (Default 4) Flavour to taste

5. Add this to hobbitserver.cfg 
  SPLITNCV_diskstat-pb="*:GAUGE"
  SPLITNCV_diskstat-reads="*:GAUGE"
  SPLITNCV_diskstat-writes="*:GAUGE"
  SPLITNCV_diskstat-kreads="*:GAUGE"
  SPLITNCV_diskstat-kwrites="*:GAUGE"
  SPLITNCV_diskstat-wait="*:GAUGE"
  SPLITNCV_diskstat-actv="*:GAUGE"
  SPLITNCV_diskstat-wsvc="*:GAUGE"
  SPLITNCV_diskstat-svct="*:GAUGE"
  SPLITNCV_diskstat-pw="*:GAUGE"

6. Add this hobbitgraph.cfg
  [diskstat-reads]
    FNPATTERN diskstat-reads,(.*).rrd
    TITLE Disk Reads per Second
    YAXIS Reads
    -l 0
    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
  
  [diskstat-writes]
    FNPATTERN diskstat-writes,(.*).rrd
    TITLE Disk Writes per Second
    YAXIS Writes
    -l 0
    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
  
  [diskstat-kreads]
    FNPATTERN diskstat-kreads,(.*).rrd
    TITLE Disk Reads per Second in Kb
    YAXIS Kb
    -l 0
    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
  
  [diskstat-kwrites]
    FNPATTERN diskstat-writes,(.*).rrd
    TITLE Disk Writes per Second in Kb
    YAXIS Kb
    -l 0
    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
  
  [diskstat-wait]
    FNPATTERN diskstat-wait,(.*).rrd
    TITLE Average Number of Transactions Waiting
    YAXIS Total
    -l 0
    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
  
  [diskstat-actv]
    FNPATTERN diskstat-actv,(.*).rrd
    TITLE Average Number of Transactions Active
    YAXIS Total
    -l 0
    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
  
  [diskstat-svct]
    FNPATTERN diskstat-svct,(.*).rrd
    TITLE Average Response Time of Transaction
    YAXIS Milliseconds
    -l 0
    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
  
  [diskstat-wsvc]
    FNPATTERN diskstat-wsvc,(.*).rrd
    TITLE Average Number of Transactions Waiting
    YAXIS Total
    -l 0
    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
  
  [diskstat-pw]
    FNPATTERN diskstat-pw,(.*).rrd
    TITLE Percent of Time Waiting
    YAXIS %
    -l 0
    -u 100
    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n
  
  [diskstat-pb]
    FNPATTERN diskstat-pb,(.*).rrd
    TITLE Percent of Time Disk Busy
    YAXIS %
    -l 0
    -u 100
    DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
    LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
    GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
    GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
    GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
    GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n


===== Source =====
==== diskstat.ksh ====

<hidden onHidden="Show Code ⇲" onVisible="Hide Code ⇱">
<code>
#!/bin/ksh
TEMPFILE=$BBTMP/diskstat.tmp
SHOW_NFS=no   # Set this to yes on server side clientlocal.cfg to change it
              # DISKSTAT:SHOW_NFS=yes
DURATION=10   # The duration of the iostat sample
              # This can be updated in the same way as above

# Now we redefine some variables, if they are set in clientlocal
LOGFETCH=${BBTMP}/logfetch.$(uname -n).cfg
if [ -f $LOGFETCH ]
    then
       grep "^DISKSTAT:" $LOGFETCH | cut -d":" -f2 \
                                   | while read NEW_DEF
                                     do
                                        $NEW_DEF
                                     done
fi

> $TEMPFILE  # Make sure it's empty
/usr/bin/iostat -xrn $DURATION 2 > $TEMPFILE.raw  # And collect some data to work with.
# We have to collect 2 sets, because the first set is the average since boot.

# Define where the second set of data starts
LINE=$(cat $TEMPFILE.raw | grep -n ",device$" | tail -1 | cut -d":" -f1)
# take the second set, and massage it into usable data
cat $TEMPFILE.raw | awk "NR>$LINE" \
                  | sed "s/,/ /g" \
                  | awk '{ print $NF" "$0 }' \
                  | awk '{ $NF="";print }' > $TEMPFILE.data
rm $TEMPFILE.raw
count=1
# Now we format the data and send it off to the server
for subtest in reads writes kreads kwrites wait actv wsvc svct pw pb
do
   ((count=count+1))
   echo "" >> $TEMPFILE
   cat $TEMPFILE.data | cut -d" " -f1,$count \
                      | while read DEVICE VAL
                        do
                           echo "$DEVICE" | grep ":/" > /dev/null
                           if [ $? -eq 0 -a "$SHOW_NFS" = "no" ]
                           then
                              break
                           else
                              DEVICE=$(echo $DEVICE | tr : - )
                           fi
                           echo "${DEVICE}:${VAL}" >> $TEMPFILE
                        done
                        echo "" >> $TEMPFILE
                        $BB $BBDISP "data $MACHINE.diskstat-${subtest} $(echo; cat $TEMPFILE ;echo "" ;echo "ignore this" )"
                        # Without the last echo "ignore this", it seems to not graph the last entry.
                        # Odd really, but that seems to fix it.
                        rm $TEMPFILE
done
rm $TEMPFILE.data

</code>
</hidden>

===== Known  Bugs and Issues =====
2010-09-21 - Found and fixed a bug. (Left out the wsvc_t stat.)


All bugs are currently unknown.

If you find any, let me know, and I will see what I can do to fix them.

===== To Do =====
Was toying with the idea of having some of the values appear as alerts, with standard red/yellow/green alert thresholds and all the rest, but not sure why?

Might be useful to watch the average service time?

However, to be of concern, high iostat figures need to be sustained. Disk usage is expected to peak from time to time, so is it really suitable for alerts?
And even if it does peak, sustained, what exactly can you do about it? 

Your comments on the back of $100 bills only.

===== Credits =====
This all started because a piece of software is crashing on one of my servers every month or so. The application admin is blaming me (and my server)

I said it's not the server, but after some constructive googling, I found a link which hinted that it might be disk performance.

I decided to monitor disk performance, and get some graphs for when it crashes again.

So all credit for this goes to really poorly written mail server that doesn't do single instancing. (Name of application withheld to protect the guilty)
===== Changelog =====

  * **2010-09-09**
    * Initial release

  * **2010-09-21**
    * Fairly major bug fix. (Left out the wsvc_t stats)