====== DiskStat ====== ^ Author | [[ everett.vernon@gmail.com | Vernon Everett ]] | ^ Compatibility | Tested on Solaris 10 | ^ Requirements | Nothing special | ^ Download | None | ^ Last Update | 2010-09-21 | ===== Description ===== Graphs of iostat output designed to appear on the trends page. Really useful to see what disks are being hit hard, and getting an idea of where your bottlenecks are. I called it diskstat, instead of iostat, for two reasons. 1. There was already an iostat graph definition, and I had no idea what it was for 2. Since it appears in the trends, it really makes no difference what it's called. By default, it ignores NFS disks, but you can change that with the following in the appropriate section of clientlocal.cfg (or just hack the code) DISKSTAT:SHOW_NFS=yes ===== Installation ===== === Client side === 1. Copy diskstat.ksh to ~$HOME/client/ext 2. Edit the client/etc/clientlaunch.cfg and insert the following text: [diskstat] ENVFILE $HOBBITCLIENTHOME/etc/hobbitclient.cfg CMD $HOBBITCLIENTHOME/ext/diskstat.ksh LOGFILE $HOBBITCLIENTHOME/logs/diskstat.log INTERVAL 5m === Server side === 3. Add this to TEST2RRD= in hobbitserver.cfg diskstat-reads=ncv,diskstat-writes=ncv,diskstat-kreads=ncv,diskstat-kwrites=ncv,diskstat-wait=ncv,diskstat-actv=ncv,diskstat-svct=ncv,diskstat-wsvc=ncv,diskstat-pw=ncv,diskstat-pb=ncv 4. Add this to GRAPHS= in hobbitserver.cfg diskstat-reads::7,diskstat-writes::7,diskstat-kreads::7,diskstat-kwrites::7,diskstat-wait::7,diskstat-actv::7,diskstat-svct::7,diskstat-wsvc::7,diskstat-pw::7,diskstat-pb::7 # ::7 indicated number of lines per graph. (Default 4) Flavour to taste 5. Add this to hobbitserver.cfg SPLITNCV_diskstat-pb="*:GAUGE" SPLITNCV_diskstat-reads="*:GAUGE" SPLITNCV_diskstat-writes="*:GAUGE" SPLITNCV_diskstat-kreads="*:GAUGE" SPLITNCV_diskstat-kwrites="*:GAUGE" SPLITNCV_diskstat-wait="*:GAUGE" SPLITNCV_diskstat-actv="*:GAUGE" SPLITNCV_diskstat-wsvc="*:GAUGE" SPLITNCV_diskstat-svct="*:GAUGE" SPLITNCV_diskstat-pw="*:GAUGE" 6. Add this hobbitgraph.cfg [diskstat-reads] FNPATTERN diskstat-reads,(.*).rrd TITLE Disk Reads per Second YAXIS Reads -l 0 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n [diskstat-writes] FNPATTERN diskstat-writes,(.*).rrd TITLE Disk Writes per Second YAXIS Writes -l 0 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n [diskstat-kreads] FNPATTERN diskstat-kreads,(.*).rrd TITLE Disk Reads per Second in Kb YAXIS Kb -l 0 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n [diskstat-kwrites] FNPATTERN diskstat-writes,(.*).rrd TITLE Disk Writes per Second in Kb YAXIS Kb -l 0 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n [diskstat-wait] FNPATTERN diskstat-wait,(.*).rrd TITLE Average Number of Transactions Waiting YAXIS Total -l 0 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n [diskstat-actv] FNPATTERN diskstat-actv,(.*).rrd TITLE Average Number of Transactions Active YAXIS Total -l 0 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n [diskstat-svct] FNPATTERN diskstat-svct,(.*).rrd TITLE Average Response Time of Transaction YAXIS Milliseconds -l 0 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n [diskstat-wsvc] FNPATTERN diskstat-wsvc,(.*).rrd TITLE Average Number of Transactions Waiting YAXIS Total -l 0 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n [diskstat-pw] FNPATTERN diskstat-pw,(.*).rrd TITLE Percent of Time Waiting YAXIS % -l 0 -u 100 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n [diskstat-pb] FNPATTERN diskstat-pb,(.*).rrd TITLE Percent of Time Disk Busy YAXIS % -l 0 -u 100 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n ===== Source ===== ==== diskstat.ksh ==== #!/bin/ksh TEMPFILE=$BBTMP/diskstat.tmp SHOW_NFS=no # Set this to yes on server side clientlocal.cfg to change it # DISKSTAT:SHOW_NFS=yes DURATION=10 # The duration of the iostat sample # This can be updated in the same way as above # Now we redefine some variables, if they are set in clientlocal LOGFETCH=${BBTMP}/logfetch.$(uname -n).cfg if [ -f $LOGFETCH ] then grep "^DISKSTAT:" $LOGFETCH | cut -d":" -f2 \ | while read NEW_DEF do $NEW_DEF done fi > $TEMPFILE # Make sure it's empty /usr/bin/iostat -xrn $DURATION 2 > $TEMPFILE.raw # And collect some data to work with. # We have to collect 2 sets, because the first set is the average since boot. # Define where the second set of data starts LINE=$(cat $TEMPFILE.raw | grep -n ",device$" | tail -1 | cut -d":" -f1) # take the second set, and massage it into usable data cat $TEMPFILE.raw | awk "NR>$LINE" \ | sed "s/,/ /g" \ | awk '{ print $NF" "$0 }' \ | awk '{ $NF="";print }' > $TEMPFILE.data rm $TEMPFILE.raw count=1 # Now we format the data and send it off to the server for subtest in reads writes kreads kwrites wait actv wsvc svct pw pb do ((count=count+1)) echo "" >> $TEMPFILE cat $TEMPFILE.data | cut -d" " -f1,$count \ | while read DEVICE VAL do echo "$DEVICE" | grep ":/" > /dev/null if [ $? -eq 0 -a "$SHOW_NFS" = "no" ] then break else DEVICE=$(echo $DEVICE | tr : - ) fi echo "${DEVICE}:${VAL}" >> $TEMPFILE done echo "" >> $TEMPFILE $BB $BBDISP "data $MACHINE.diskstat-${subtest} $(echo; cat $TEMPFILE ;echo "" ;echo "ignore this" )" # Without the last echo "ignore this", it seems to not graph the last entry. # Odd really, but that seems to fix it. rm $TEMPFILE done rm $TEMPFILE.data ===== Known Bugs and Issues ===== 2010-09-21 - Found and fixed a bug. (Left out the wsvc_t stat.) All bugs are currently unknown. If you find any, let me know, and I will see what I can do to fix them. ===== To Do ===== Was toying with the idea of having some of the values appear as alerts, with standard red/yellow/green alert thresholds and all the rest, but not sure why? Might be useful to watch the average service time? However, to be of concern, high iostat figures need to be sustained. Disk usage is expected to peak from time to time, so is it really suitable for alerts? And even if it does peak, sustained, what exactly can you do about it? Your comments on the back of $100 bills only. ===== Credits ===== This all started because a piece of software is crashing on one of my servers every month or so. The application admin is blaming me (and my server) I said it's not the server, but after some constructive googling, I found a link which hinted that it might be disk performance. I decided to monitor disk performance, and get some graphs for when it crashes again. So all credit for this goes to really poorly written mail server that doesn't do single instancing. (Name of application withheld to protect the guilty) ===== Changelog ===== * **2010-09-09** * Initial release * **2010-09-21** * Fairly major bug fix. (Left out the wsvc_t stats)