DiskStat

Author	Vernon Everett
Compatibility	Tested on Solaris 10
Requirements	Nothing special
Download	None
Last Update	2010-09-21

Description

Graphs of iostat output designed to appear on the trends page. Really useful to see what disks are being hit hard, and getting an idea of where your bottlenecks are.

I called it diskstat, instead of iostat, for two reasons.

1. There was already an iostat graph definition, and I had no idea what it was for

2. Since it appears in the trends, it really makes no difference what it's called.

By default, it ignores NFS disks, but you can change that with the following in the appropriate section of clientlocal.cfg (or just hack the code)

DISKSTAT:SHOW_NFS=yes

Client side

1. Copy diskstat.ksh to ~$HOME/client/ext

2. Edit the client/etc/clientlaunch.cfg and insert the following text:

[diskstat]
      ENVFILE $HOBBITCLIENTHOME/etc/hobbitclient.cfg
      CMD $HOBBITCLIENTHOME/ext/diskstat.ksh
      LOGFILE $HOBBITCLIENTHOME/logs/diskstat.log
      INTERVAL 5m

Server side

3. Add this to TEST2RRD= in hobbitserver.cfg

diskstat-reads=ncv,diskstat-writes=ncv,diskstat-kreads=ncv,diskstat-kwrites=ncv,diskstat-wait=ncv,diskstat-actv=ncv,diskstat-svct=ncv,diskstat-wsvc=ncv,diskstat-pw=ncv,diskstat-pb=ncv

4. Add this to GRAPHS= in hobbitserver.cfg

diskstat-reads::7,diskstat-writes::7,diskstat-kreads::7,diskstat-kwrites::7,diskstat-wait::7,diskstat-actv::7,diskstat-svct::7,diskstat-wsvc::7,diskstat-pw::7,diskstat-pb::7
# ::7 indicated number of lines per graph. (Default 4) Flavour to taste

5. Add this to hobbitserver.cfg

SPLITNCV_diskstat-pb="*:GAUGE"
SPLITNCV_diskstat-reads="*:GAUGE"
SPLITNCV_diskstat-writes="*:GAUGE"
SPLITNCV_diskstat-kreads="*:GAUGE"
SPLITNCV_diskstat-kwrites="*:GAUGE"
SPLITNCV_diskstat-wait="*:GAUGE"
SPLITNCV_diskstat-actv="*:GAUGE"
SPLITNCV_diskstat-wsvc="*:GAUGE"
SPLITNCV_diskstat-svct="*:GAUGE"
SPLITNCV_diskstat-pw="*:GAUGE"

6. Add this hobbitgraph.cfg

[diskstat-reads]
  FNPATTERN diskstat-reads,(.*).rrd
  TITLE Disk Reads per Second
  YAXIS Reads
  -l 0
  DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
  LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
  GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
  GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
  GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
  GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

[diskstat-writes]
  FNPATTERN diskstat-writes,(.*).rrd
  TITLE Disk Writes per Second
  YAXIS Writes
  -l 0
  DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
  LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
  GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
  GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
  GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
  GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

[diskstat-kreads]
  FNPATTERN diskstat-kreads,(.*).rrd
  TITLE Disk Reads per Second in Kb
  YAXIS Kb
  -l 0
  DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
  LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
  GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
  GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
  GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
  GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

[diskstat-kwrites]
  FNPATTERN diskstat-writes,(.*).rrd
  TITLE Disk Writes per Second in Kb
  YAXIS Kb
  -l 0
  DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
  LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
  GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
  GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
  GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
  GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

[diskstat-wait]
  FNPATTERN diskstat-wait,(.*).rrd
  TITLE Average Number of Transactions Waiting
  YAXIS Total
  -l 0
  DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
  LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
  GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
  GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
  GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
  GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

[diskstat-actv]
  FNPATTERN diskstat-actv,(.*).rrd
  TITLE Average Number of Transactions Active
  YAXIS Total
  -l 0
  DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
  LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
  GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
  GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
  GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
  GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

[diskstat-svct]
  FNPATTERN diskstat-svct,(.*).rrd
  TITLE Average Response Time of Transaction
  YAXIS Milliseconds
  -l 0
  DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
  LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
  GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
  GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
  GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
  GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

[diskstat-wsvc]
  FNPATTERN diskstat-wsvc,(.*).rrd
  TITLE Average Number of Transactions Waiting
  YAXIS Total
  -l 0
  DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
  LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
  GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
  GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
  GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
  GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

[diskstat-pw]
  FNPATTERN diskstat-pw,(.*).rrd
  TITLE Percent of Time Waiting
  YAXIS %
  -l 0
  -u 100
  DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
  LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
  GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
  GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
  GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
  GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

[diskstat-pb]
  FNPATTERN diskstat-pb,(.*).rrd
  TITLE Percent of Time Disk Busy
  YAXIS %
  -l 0
  -u 100
  DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE
  LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@
  GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur)
  GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max)
  GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min)
  GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n

Show Code ⇲

Hide Code ⇱

#!/bin/ksh
TEMPFILE=$BBTMP/diskstat.tmp
SHOW_NFS=no   # Set this to yes on server side clientlocal.cfg to change it
              # DISKSTAT:SHOW_NFS=yes
DURATION=10   # The duration of the iostat sample
              # This can be updated in the same way as above

# Now we redefine some variables, if they are set in clientlocal
LOGFETCH=${BBTMP}/logfetch.$(uname -n).cfg
if [ -f $LOGFETCH ]
    then
       grep "^DISKSTAT:" $LOGFETCH | cut -d":" -f2 \
                                   | while read NEW_DEF
                                     do
                                        $NEW_DEF
                                     done
fi

> $TEMPFILE  # Make sure it's empty
/usr/bin/iostat -xrn $DURATION 2 > $TEMPFILE.raw  # And collect some data to work with.
# We have to collect 2 sets, because the first set is the average since boot.

# Define where the second set of data starts
LINE=$(cat $TEMPFILE.raw | grep -n ",device$" | tail -1 | cut -d":" -f1)
# take the second set, and massage it into usable data
cat $TEMPFILE.raw | awk "NR>$LINE" \
                  | sed "s/,/ /g" \
                  | awk '{ print $NF" "$0 }' \
                  | awk '{ $NF="";print }' > $TEMPFILE.data
rm $TEMPFILE.raw
count=1
# Now we format the data and send it off to the server
for subtest in reads writes kreads kwrites wait actv wsvc svct pw pb
do
   ((count=count+1))
   echo "" >> $TEMPFILE
   cat $TEMPFILE.data | cut -d" " -f1,$count \
                      | while read DEVICE VAL
                        do
                           echo "$DEVICE" | grep ":/" > /dev/null
                           if [ $? -eq 0 -a "$SHOW_NFS" = "no" ]
                           then
                              break
                           else
                              DEVICE=$(echo $DEVICE | tr : - )
                           fi
                           echo "${DEVICE}:${VAL}" >> $TEMPFILE
                        done
                        echo "" >> $TEMPFILE
                        $BB $BBDISP "data $MACHINE.diskstat-${subtest} $(echo; cat $TEMPFILE ;echo "" ;echo "ignore this" )"
                        # Without the last echo "ignore this", it seems to not graph the last entry.
                        # Odd really, but that seems to fix it.
                        rm $TEMPFILE
done
rm $TEMPFILE.data

2010-09-21 - Found and fixed a bug. (Left out the wsvc_t stat.)

All bugs are currently unknown.

If you find any, let me know, and I will see what I can do to fix them.

Was toying with the idea of having some of the values appear as alerts, with standard red/yellow/green alert thresholds and all the rest, but not sure why?

Might be useful to watch the average service time?

However, to be of concern, high iostat figures need to be sustained. Disk usage is expected to peak from time to time, so is it really suitable for alerts? And even if it does peak, sustained, what exactly can you do about it?

Your comments on the back of $100 bills only.

This all started because a piece of software is crashing on one of my servers every month or so. The application admin is blaming me (and my server)

I said it's not the server, but after some constructive googling, I found a link which hinted that it might be disk performance.

I decided to monitor disk performance, and get some graphs for when it crashes again.

So all credit for this goes to really poorly written mail server that doesn't do single instancing. (Name of application withheld to protect the guilty)