====== db_cpu.ksh ====== ^ Author | [[ everett.vernon@gmail.com | Vernon Everett ]] | ^ Compatibility | Tested on Xymon 4.3.10, Solaris 9, 10 & 11. On Xymon 4.2.3 the stack graph doesn't work| ^ Requirements | Perl, Oracle Database, Solaris | ^ Download | None | ^ Last Update | 2013-04-23 2014-02-27 | ===== Description ===== How much CPU is that Oracle database instance using? This came about from an interesting discussion with our DBAs. \\ They run multiple databases on a single host, and he wanted to know which of his databases was taking up the lion's share of the CPU. \\ Memory usage is easy, because you can limit it on the database, but CPU not so much. The database isn't a single process.\\ After playing about, I figured out a way to tally up the CPU usage of all the processes related to a single database instance.\\ Once I had done that, I realised that we could do this for all databases, and present the results to Xymon for graphing. The usage trends have proven very useful. We are using the results for job scheduling, capacity planning, and license reduction by reducing the number of available CPUs. \\ \\ So, that's what it does.\\ It plots the CPU usage of every database instance it can find.\\ It can also alert if one is using too much, and for fun, it can add the "idle" and "other" CPU figures, to complete the 100% and make a stack graph. ===== Installation ===== === Client side === - Copy the 2 scripts to ~/client/ext/ - chown root prustat.pl - chmod 5744 prustat.pl - Add the following to ~/etc/clientlaunch.cfg [db_cpu] ENVFILE $XYMONCLIENTHOME/etc/xymonclient.cfg CMD $XYMONCLIENTHOME/ext/db_cpu.ksh LOGFILE $XYMONCLIENTHOME/logs/db_cpu.log INTERVAL 5m === Server side === ==Edit xymonserver.cfg== Add to the TEST2RRD variable\\ db-cpu=ncv Add to GRAPHS variable\\ db-cpu::100 Add the following config entry. SPLITNCV_db-cpu="*:GAUGE" ==Add the following code to graphs.cfg== This is the code for a stack graph. Probably works better if you include the idle and other usage figures. [db-cpu] FNPATTERN db-cpu,(.*).rrd TITLE Database %CPU Utilisation YAXIS % -l 0 -u 100 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE AREA:p@RRDIDX@#@COLOR@:@RRDPARAM@:@STACKIT@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n For a line graph, use this code. If you are not including the idle and other CPU usage, then this is probably a better option. [db-cpu] FNPATTERN db-cpu,(.*).rrd TITLE Database %CPU Utilisation YAXIS % -l 0 -u 100 DEF:p@RRDIDX@=@RRDFN@:lambda:AVERAGE LINE2:p@RRDIDX@#@COLOR@:@RRDPARAM@ GPRINT:p@RRDIDX@:LAST: \: %5.1lf (cur) GPRINT:p@RRDIDX@:MAX: \: %5.1lf (max) GPRINT:p@RRDIDX@:MIN: \: %5.1lf (min) GPRINT:p@RRDIDX@:AVERAGE: \: %5.1lf (avg)\n ===== Source ===== ==== prustat.pl ==== This is a slightly modified version of Brendan Gregg's excellent dtrace script.\\ I take no credit for it at all. (As much as I would love to)\\ Go look [[http://www.brendangregg.com|here]] for more of his awesome work. \\ I changed the PATH definition to hard coded in the script, instead of inherited from the environment, which could cause security issues because the script is running as root. #!/usr/bin/perl # # prustat - Process utilisation stats: %CPU, %Mem, %Disk, %Net. Solaris 10. # Needs to run as root. This is a demonstration release - check for # newer optimised versions. Uses Kstat, DTrace and procfs. # # 12-Mar-2005, ver 0.50 (demonstration release, http://www.brendangregg.com) # # # USAGE: # prustat [-cehinuwxz] [-p PID] [-s sort] [-t top] [interval] [count] # # prustat # %Utilisation # prustat -i # + I/O stats # prustat -u # + USR/SYS times # prustat -x # + Context Switchs # prustat -c # Clear screen # prustat -w # Wide output # prustat -z # Skip zero lines # prustat -e # Extra precision # prustat -p PID # this PID only # prustat -s sort # sort on pid|cpu|mem|disk|net|utime|vctx|... # prustat -t lines # print top number of lines only # eg, # prustat 2 # 2 second samples (first is historical) # prustat 10 5 # 5 x 10 second samples # prustat -t 8 10 5 # 5 x 10 second samples, top 8 lines only # prustat -ct 20 5 # 20 lines with screen refresh each 5 seconds # prustat -iuxct 5 10 # multi output, all reports every 10 seconds # prustat -ct 22 -s cpu 5 # 22 lines, sort by cpu, every 5 secs # prustat -ct 22 -s mem 5 # 22 lines, sort by mem, every 5 secs # prustat -ct 22 -s net 5 # 22 lines, sort by network, every 5 secs # prustat -ct 22 -s disk 5 # 22 lines, sort by disk, every 5 secs # # FIELDS: # PID Process ID # CPU Percent CPU # Mem Percent RAM # Disk Percent Disk # Net Percent Network # MAJF Major Page Faults (disk I/O) # INBLK In Blocks (disk I/O reads) # OUBLK Out Blocks (disk I/O writes) # CHAR-kb Character I/O Kbytes # COMM Command name # USR User Time # SYS System Time # WAIT Wait for CPU Time # VCTX Voluntary Context Switches (I/O bound) # ICTX Involuntary Context Switches (CPU bound) # SYSC System calls # # WARNING: This program will run DTrace to gather Disk and Network data. # This has not been fully tested on different environments to study the # impact of these extra measurements. For now this is a demonstration # release - best to run in development for short periods. Check for # newer versions and updates to this message. # # NOTE: There is no historical values for Disk or Network utilisation percent, # the first sample for these will always show zero. # # REFERENCES: /usr/include/sys/procfs.h # # SEE ALSO: iosnoop, psio, prusage # process Disk I/O # socketsnoop.d # process TCP # prstat -m # USR/SYS times, ... # # COPYRIGHT: Copyright (c) 2005 Brendan Gregg. # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software Foundation, # Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. # # (http://www.gnu.org/copyleft/gpl.html) # # Author: Brendan Gregg [Sydney, Australia] # # 12-Mar-2005 Brendan Gregg Created this. use Getopt::Std; use Sun::Solaris::Kstat; my $Kstat = Sun::Solaris::Kstat->new(); # # --- Default Variables --- # $INTERVAL = 1; # seconds to sample $MAX = 1; # max count of samples $NEW = 0; # skip summary output (new data only) $WIDE = 0; # print wide output (don't truncate) $SCHED = 0; # print PID 0 $TOP = 0; # print top many only $CLEAR = 0; # clear screen before outputs $ZERO = 0; # if 1, skip zero entries (all 0.00) $STYLE_UTIL = 1; # default output style, utilisation $STYLE_IO = 0; # output style, I/O $STYLE_CTX = 0; # output style, Context Switches $STYLE_TIME = 0; # output style, Times $STYLE_EXTRA = 0; # output style, Extra precision $MULTI = 0; # multi report, multiple styles $TARGET_PID = -1; # target PID, -1 means all $TIME_BEGIN = 1; # start of interval, ns $TIME_END = 1; # end of interval, ns $count = 1; # current iteration $NIC_DEF = 100_000_000; # default NIC speed (100 Mbps) ### Network card instance names @Network = qw(dmfe bge be ce eri ge hme le ppp qfe rtls sppp iprb); $Network{$_} = 1 foreach (@Network); # # --- Command Line Arguments --- # &Usage() if $ARGV[0] eq "--help"; getopts('cehinuwxzp:s:t:') || &Usage(); &Usage() if $opt_h; $NEW = 1 if $opt_n; $WIDE = 1 if $opt_w; $CLEAR = 1 if $opt_c; $ZERO = 1 if $opt_z; $STYLE_IO = 1 if $opt_i; $STYLE_CTX = 1 if $opt_x; $STYLE_TIME = 1 if $opt_u; $STYLE_EXTRA = 1 if $opt_e; $STYLE_IO = 1 if $opt_i; $STYLE_UTIL = 0 if $opt_i || $opt_x || $opt_u || $opt_e; $TOP = $opt_t if defined $opt_t; $SORT = $opt_s if defined $opt_s; $TARGET_PID = $opt_p if defined $opt_p; $MAX = 2**32 if @ARGV == 1; $INTERVAL = shift(@ARGV) || $INTERVAL; $MAX = shift(@ARGV) || $MAX; $CLEARSTR = `clear` if $CLEAR; $MULTI = 1 if ($STYLE_IO + $STYLE_CTX + $STYLE_TIME) > 1; # # --- Determine Network Capacity --- # my ($error,$time,$module,$instance,$name); my ($bytes,$rbytes,$wbytes); my (%Modules,%Instances,%Names); $NIC_SPEED = 0; # sum of Mbps across all NICs ### Loop over all NICs foreach $module (keys(%$Kstat)) { next unless $Network{$module}; $Modules = $Kstat->{$module}; foreach $instance (keys(%$Modules)) { $Instances = $Modules->{$instance}; foreach $name (keys(%$Instances)) { $Names = $Instances->{$name}; if (defined $$Names{ifspeed}) { $NIC_SPEED += $$Names{ifspeed}; } else { $NIC_SPEED += $NIC_SPEED; } } } } $NIC_SPEED = $NIC_DEF if $NIC_SPEED == 0; # # --- Open DTrace --- # @Dscript = ; $dscript = join('',@Dscript); $ENV{"PATH"} = "/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/usr/perl5/bin:/usr/openwin/bin:/usr/X11/bin:/usr/ccs/bin:/usr/dt/bin/:/usr/local/samba/bin:/opt/sfw/bin"; open(DTRACE,"$dscript 2>/dev/null |") || die("ERROR1: Can't open dtrace: $!\n"); ### Cleanup on signals $SIG{INT} = \&Cleanup; $SIG{QUIT} = \&Cleanup; $SIG{TERM} = \&Cleanup; $SIG{PIPE} = \&Cleanup; # # --- Main --- # for (;$count <= $MAX; $count++) { ### Get CPU and Mem data &GetProcStat(); next if $NEW && $count == 1; ### Preprocess PID &ProcessPID(); ### Print data print $CLEARSTR if $CLEAR; &PrintUtil($SORT) if $STYLE_UTIL; &PrintExtra($SORT) if $STYLE_EXTRA; &PrintIO($SORT) if $STYLE_IO; &PrintCtx($SORT) if $STYLE_CTX; &PrintTime($SORT) if $STYLE_TIME; ### Cleanup memory undef %Comm; undef %PID; $TIME_BEGIN = $TIME_END; ### Get Disk and Net data for ($pause = 0; $pause < $INTERVAL; $pause++) { &GetDTraceStat(); } } close(DTRACE); # # --- Subroutines --- # # GetProcStat - Gets /proc usage statistics and saves them in %PID. # This can be run multiple times, the first time %PID will be # populated with the summary since boot values. # This reads /proc/*/usage and /proc/*/prstat. # sub GetProcStat { my $pid; chdir "/proc"; ### Main PID Loop foreach $pid (sort {$a<=>$b} <*>) { next if $pid == $$; next if $pid == 0 && $SCHED == 0; next if $TARGET_PID > -1 && $pid != $TARGET_PID; ### Read usage stats open(USAGE,"/proc/$pid/usage") || next; read(USAGE,$usage,256); close USAGE; ### Unpack usage values ($pr_lwpid, $pr_count, $pr_tstamp, $pr_create, $pr_term, $pr_rtime, $pr_utime, $pr_stime, $pr_ttime, $pr_tftime, $pr_dftime, $pr_kftime, $pr_ltime, $pr_slptime, $pr_wtime, $pr_stoptime, $filltime, $pr_minf, $pr_majf, $pr_nswap, $pr_inblk, $pr_oublk, $pr_msnd, $pr_mrcv, $pr_sigs, $pr_vctx, $pr_ictx, $pr_sysc, $pr_ioch, $filler) = unpack("iia8a8a8a8a8a8a8a8a8a8a8a8a8a8a48LLLLLLLLLLLLa40",$usage); ### Process usage values $New{$pid}{utime} = timestruct2int($pr_utime); $New{$pid}{stime} = timestruct2int($pr_stime); $New{$pid}{ttime} = timestruct2int($pr_ttime); $New{$pid}{ltime} = timestruct2int($pr_ltime); $New{$pid}{wtime} = timestruct2int($pr_wtime); $New{$pid}{slptime} = timestruct2int($pr_slptime); $New{$pid}{minf} = $pr_minf; $New{$pid}{majf} = $pr_majf; $New{$pid}{nswap} = $pr_nswap; $New{$pid}{inblk} = $pr_inblk; $New{$pid}{oublk} = $pr_oublk; $New{$pid}{vctx} = $pr_vctx; $New{$pid}{ictx} = $pr_ictx; $New{$pid}{sysc} = $pr_sysc; $New{$pid}{ioch} = $pr_ioch; # and a couple of my own, $New{$pid}{blks} = $pr_inblk + $pr_oublk; $New{$pid}{ctxs} = $pr_vctx + $pr_ictx; ### Read psinfo stats open(PSINFO,"/proc/$pid/psinfo") || next; read(PSINFO,$psinfo,256); close PSINFO; ### Unpack psinfo values ($pr_flag, $pr_nlwp, $pr_pid, $pr_ppid, $pr_pgid, $pr_sid, $pr_uid, $pr_euid, $pr_gid, $pr_egid, $pr_addr, $pr_size, $pr_rssize, $pr_pad1, $pr_ttydev, $pr_pctcpu, $pr_pctmem, $pr_start, $pr_time, $pr_ctime, $pr_fname, $pr_psargs, $pr_wstat, $pr_argc, $pr_argv, $pr_envp, $pr_dmodel, $pr_taskid, $pr_projid, $pr_nzomb, $pr_poolid, $pr_zoneid, $filler) = unpack("iiiiiiiiiiIiiiiSSa8a8a8Z16Z80iiIIaa3iiiiii",$psinfo); ### Process psinfo values $PID{$pid}{pctcpu} = $pr_pctcpu / 0x8000; $PID{$pid}{pctmem} = $pr_pctmem / 0x8000; $PID{$pid}{uid} = $pr_uid; $New{$pid}{size} = $pr_size; $New{$pid}{rssize} = $pr_rssize; ### Save command name $Comm{$pid} = $pr_fname; } ### Turn incrementals into values foreach $pid (keys %New) { # save PID values, foreach $key (keys %{$New{$pid}}) { $PID{$pid}{$key} = $New{$pid}{$key} - $Old{$pid}{$key}; } } undef %Old; ### Remember old value foreach $pid (keys %New) { # save old values, foreach $key (keys %{$New{$pid}}) { $Old{$pid}{$key} = $New{$pid}{$key}; } } } # GetDTraceStat - read detals from a DTrace connection until a heartbeat # is read (happens every second). # sub GetDTraceStat { my ($line,$cmd,$rest,$uid,$pid,$size,$name,$delta); while ($line = ) { chomp($line); ($cmd,$rest) = split(' ',$line,2); ### Start $TIME_BEGIN = $rest if $cmd eq "B"; ### Heartbeat if ($cmd eq "T") { $TIME_END = $rest; last; } ### Network traffic if ($cmd eq "N") { ($uid,$pid,$size,$name) = split(' ',$rest); next if $TARGET_PID > -1 && $pid != $TARGET_PID; $PID{$pid}{netrw} += $size; unless (defined $Comm{$pid}) { $Comm{$pid} = $name; $PID{$pid}{uid} = $uid; } } ### Disk traffic if ($cmd eq "D") { ($uid,$pid,$delta,$size,$name) = split(' ',$rest); next if $TARGET_PID > -1 && $pid != $TARGET_PID; $PID{$pid}{dtime} += $delta; unless (defined $Comm{$pid}) { $Comm{$pid} = $name; $PID{$pid}{uid} = $uid; } } } } # ProcessPID - pre process %PID before printing. # This calculates values such as sumpct for sorting. # sub ProcessPID { my ($pid,$cpu,$mem,$disk,$net,$sample); my ($factorcpu,$factormem,$factordisk,$factornet); ### Factors for %util conversions $sample = $TIME_END - $TIME_BEGIN || 1; $factorcpu = 100; $factormem = 100; $factordisk = 100 / $sample; $factornet = 800 / ($NIC_SPEED * ($sample / 1_000_000_000)); ### Process %PID foreach $pid (keys(%PID)) { $cpu = $PID{$pid}{pctcpu} * $factorcpu; $mem = $PID{$pid}{pctmem} * $factormem; $disk = $PID{$pid}{dtime} * $factordisk; $net = $PID{$pid}{netrw} * $factornet; $PID{$pid}{cpu} = $cpu; $PID{$pid}{mem} = $mem; $PID{$pid}{disk} = $disk; $PID{$pid}{net} = $net; $PID{$pid}{all} = $cpu + $mem + $disk + $net; } } # PrintUtil - print a report on utilisation. # sub PrintUtil { my $sort = shift || "all"; my $top = $TOP; my ($pid,$cpu,$mem,$disk,$net,$all); ### Print header printf("%5s %6s %6s %6s %6s %s\n","PID", "%CPU","%Mem","%Disk","%Net","COMM"); ### Print report foreach $pid (&SortPID("$sort")) { # Fetch utilisations $cpu = $PID{$pid}{cpu}; $mem = $PID{$pid}{mem}; $disk = $PID{$pid}{disk}; $net = $PID{$pid}{net}; $all = $PID{$pid}{all}; # Skip zero lines if needed if ($ZERO && ($all < 0.02)) { next; } # Print output printf("%5s %6.2f %6.2f %6.2f %6.2f %s\n",$pid, $cpu,$mem,$disk,$net,trunc($Comm{$pid},33)); last if --$top == 0; } print "\n" if $MULTI; } # PrintExtra - print a report on utilisation, with extra decimal places. # sub PrintExtra { my $sort = shift || "all"; my $top = $TOP; my ($pid,$cpu,$mem,$disk,$net,$all); ### Print header printf("%5s %8s %8s %8s %8s %s\n","PID", "%CPU","%Mem","%Disk","%Net","COMM"); ### Print report foreach $pid (&SortPID("$sort")) { # Fetch utilisations $cpu = $PID{$pid}{cpu}; $mem = $PID{$pid}{mem}; $disk = $PID{$pid}{disk}; $net = $PID{$pid}{net}; $all = $PID{$pid}{all}; # Skip zero lines if needed if ($ZERO && ($all < 0.02)) { next; } # Print output printf("%5s %8.4f %8.4f %8.4f %8.4f %s\n",$pid, $cpu,$mem,$disk,$net,trunc($Comm{$pid},33)); last if --$top == 0; } print "\n" if $MULTI; } # PrintIO - print a report with I/O statistics: minf, majf, inblk, oublk, ioch. # sub PrintIO { my $sort = shift || "blks"; my $top = $TOP; my ($pid,$cpu,$mem,$disk,$net,$all); ### Print header printf("%5s %6s %6s %6s %6s %8s %8s %9s %s\n","PID", "%CPU","%Mem","%Disk","%Net","INBLK","OUBLK", "CHAR-kb","COMM"); ### Print report foreach $pid (&SortPID("$sort")) { # Fetch utilisations $cpu = $PID{$pid}{cpu}; $mem = $PID{$pid}{mem}; $disk = $PID{$pid}{disk}; $net = $PID{$pid}{net}; $all = $PID{$pid}{all}; # Skip zero lines if needed if ($ZERO && ($all < 0.02)) { next; } # Print output printf("%5s %6.2f %6.2f %6.2f %6.2f %8d %8d %9.0f %s\n", $pid,$cpu,$mem,$disk,$net,$PID{$pid}{inblk}, $PID{$pid}{oublk},$PID{$pid}{ioch}/1024, trunc($Comm{$pid},33)); last if --$top == 0; } print "\n" if $MULTI; } # PrintTime - print a report including usr, sys and wait times. # sub PrintTime { my $sort = shift || "cpu"; my $top = $TOP; my ($pid,$cpu,$mem,$disk,$net,$all); ### Print header printf("%5s %6s %6s %6s %6s %8s %8s %8s %s\n","PID", "%CPU","%Mem","%Disk","%Net","USR","SYS", "WAIT","COMM"); ### Print report foreach $pid (&SortPID("$sort")) { # Fetch utilisations $cpu = $PID{$pid}{cpu}; $mem = $PID{$pid}{mem}; $disk = $PID{$pid}{disk}; $net = $PID{$pid}{net}; $all = $PID{$pid}{all}; # Skip zero lines if needed if ($ZERO && ($all < 0.02)) { next; } # Print output printf("%5s %6.2f %6.2f %6.2f %6.2f %8d %8d %8d %s\n", $pid,$cpu,$mem,$disk,$net,$PID{$pid}{utime}, $PID{$pid}{stime},$PID{$pid}{wtime}, trunc($Comm{$pid},33)); last if --$top == 0; } print "\n" if $MULTI; } # PrintCtx - print a report on context switches: vctx, ictx and sys calls. # sub PrintCtx { my $sort = shift || "ctxs"; my $top = $TOP; my ($pid,$cpu,$mem,$disk,$net,$all); ### Print header printf("%5s %6s %6s %6s %6s %8s %8s %9s %s\n","PID", "%CPU","%Mem","%Disk","%Net","VCTX","ICTX", "SYSC","COMM"); ### Print report foreach $pid (&SortPID("$sort")) { # Fetch utilisations $cpu = $PID{$pid}{cpu}; $mem = $PID{$pid}{mem}; $disk = $PID{$pid}{disk}; $net = $PID{$pid}{net}; $all = $PID{$pid}{all}; # Skip zero lines if needed if ($ZERO && ($all < 0.02)) { next; } # Print output printf("%5s %6.2f %6.2f %6.2f %6.2f %8d %8d %9d %s\n", $pid,$cpu,$mem,$disk,$net,$PID{$pid}{vctx}, $PID{$pid}{ictx},$PID{$pid}{sysc}, trunc($Comm{$pid},33)); last if --$top == 0; } print "\n" if $MULTI; } # SortPID - sorts the PID hash by the key given as arg1, returning a sorted # array of PIDs. # sub SortPID { my $sort = shift; ### Sort numerically if ($sort eq "pid") { return sort {$a <=> $b} (keys %PID); } else { return sort {$PID{$b}{$sort} <=> $PID{$a}{$sort}} (keys %PID); } } # timestruct2int - Convert a timestruct value (64 bits) into an integer # of seconds. # sub timestruct2int { my $timestruct = shift; my ($secs,$nsecs) = unpack("LL",$timestruct); my $time = $secs + $nsecs * 10**-9; return $time; } # trunc - Returns a truncated string if required. # sub trunc { my $string = shift; my $length = shift; if ($WIDE) { return $string; } else { return substr($string,0,$length); } } # Cleanup - subroutine for signal management. # sub Cleanup { close(DTRACE); exit(0); } # Usage - print usage message and exit. # sub Usage { print STDERR <dev = args[0]->b_edev; this->blk = args[0]->b_blkno; start_uid[this->dev,this->blk] = curpsinfo->pr_euid; start_pid[this->dev,this->blk] = pid; start_comm[this->dev,this->blk] = (char *)curpsinfo->pr_fname; last = timestamp; } /* ** Process completion */ io:::done { /* fetch entry values */ this->dev = args[0]->b_edev; this->blk = args[0]->b_blkno; this->delta = timestamp - last; this->suid = start_uid[this->dev,this->blk]; this->spid = start_pid[this->dev,this->blk]; this->scomm = start_comm[this->dev,this->blk]; /* memory cleanup */ start_uid[this->dev,this->blk] = 0; start_pid[this->dev,this->blk] = 0; start_comm[this->dev,this->blk] = 0; last = timestamp; } /* ** Print event details */ io:::done { printf("D %d %d %d %d %s\n", this->suid,this->spid,this->delta,args[0]->b_bcount, this->scomm == 0 ? "." : stringof(this->scomm)); } /* ** --- NETWORK ---- */ /* ** Store Write Values */ fbt:ip:tcp_output:entry { self->uid = curpsinfo->pr_euid; self->pid = pid; self->comm = (char *)curpsinfo->pr_fname; self->size = msgdsize(args[1]); self->ok = 1; } /* ** Store Read Values */ fbt:sockfs:sotpi_recvmsg:entry { self->uid = curpsinfo->pr_euid; self->pid = pid; self->comm = (char *)curpsinfo->pr_fname; /* We track the read request (man uio), */ self->uiop = (struct uio *) arg2; self->residual = self->uiop->uio_resid; /* The following ensures the type is AF_INET (sys/socket.h), */ this->sonode = (struct sonode *)arg0; self->ok = (int)this->sonode->so_type == 2 ? 1 : 0; } fbt:sockfs:sotpi_recvmsg:return /arg0 != 0 && self->ok/ { /* calculate successful read size */ self->size = self->residual - self->uiop->uio_resid; } /* ** Print output */ fbt:ip:tcp_output:entry, fbt:sockfs:sotpi_recvmsg:return /self->ok/ { printf("N %d %d %d %s\n",self->uid,self->pid, self->size,stringof(self->comm)); self->ok = 0; self->uid = 0; self->pid = 0; self->comm = 0; self->size = 0; self->residual = 0; self->uiop = 0; } ' ==== db_cpu.ksh ==== This is my contribution.\\ It calls the perl script above. #!/usr/bin/ksh COLOUR=green type top >> /dev/null ERR=$? if [ $ERR -eq 0 ] then TOP=$(type top| cut -d\( -f2 | cut -d\) -f1) else [ -x /usr/local/bin/top ] && TOP=/usr/local/bin/top fi CPU_tmp=$XYMONTMP/db_cpu_list TEMPFILE=$XYMONTMP/db_cpu NO_DB_ALERT=true NO_DB_COLOUR=yellow # Some typesets to make formatting easier. typeset -L15 DB typeset -R10 USAGE # To include the idle and "other" CPU utilisations in your graph, set this # to a value of 1. Set this to anything else to only show the database usage SHOW_IDLE=1 # To alert on excessive CPU usage, set this to 1, or anything else if you don't EXCESS_ALERT=1 # And set the alert levels. Salt to taste. # These will be ignored if EXCESS_ALERT is not set to 1 RED=75 YELLOW=50 # Start making the page. echo "$(date +%a" "%b" "%d" "%T" "%Z" "%Y) - Database CPU Usage" > $TEMPFILE echo >> $TEMPFILE echo "Database CPU % Usage" >> $TEMPFILE echo "-------- -----------" >> $TEMPFILE # Get the idle figure. Run top -d4, and take the last value to prevent # a possible skewed result from the other xymon tasks that have just kicked off. if [ -x $TOP ] then # We have a working version of top IDLE=$($TOP -n -d4 | grep "^CPU states" | cut -d"%" -f1 | sed "s/CPU states: //g" | tail -1) IDLE="${IDLE}000" else # No working top. We will have to resort to iostat. # Pity. Top is at least accurate to 1 decimal place. IDLE=$(iostat -p 2 5 | awk '{ print $NF }' | tail -1) IDLE="${IDLE}.0000" fi # Set DBID to your oracle identifier. You can use the userID or the username # or any value that will identify your oracle processes in a ps | grep DBID="^ 0000100" # Get a list of all Databases running on the system DBLIST=$(ps -efa | grep "$DBID" | grep pmon | cut -d"_" -f3 | sort | uniq) if [ -z "$DBLIST" ] then echo "No databases found!" >> $TEMPFILE [ "$NO_DB_ALERT" = "true" ] && COLOUR=$NO_DB_COLOUR DB_USED="0.0000" # A reasonable assumtion if there are no databases else # Get a list of all the processes and their CPU utilisation. # This is a slightly modified version of Brendan Gregg's excellent dtrace script. # I take no credit for it at all. (As much as I would love to) # Go look here for more of his awesome work. http://www.brendangregg.com $XYMONCLIENTHOME/ext/prustat.pl -we 2>/dev/null | awk '{ print $1" x " $2 }' > $CPU_tmp 2>/dev/null DB_USED=0 OTHER=0 for tDB in $DBLIST do [ -z "$tDB" ] && continue # Do not bother if there are empty strings LIST="0" DB=$tDB # We can't use $DB for the egrep, because of the formatting. It adds extra spaces. for PROC in $(ps -efa | egrep "oracle${tDB} |ora_...._${tDB}$"| grep -v " grep " | awk '{ print $2 }') do LIST=$(grep "^${PROC} x " $CPU_tmp | cut -d" " -f3)+$LIST done # We need to use bc for our calculations, because shell maths is limited # to integers, and we need a higher level of precision for the # results to "add up". USAGE=$(echo $LIST | sed 's/++*/+/g' | sed 's/^+//g' | bc) USAGE=$(echo $USAGE | sed 's/ \./0\./g') [ $USAGE -eq 0 ] && USAGE="0.0000" if [ $EXCESS_ALERT -eq 1 ] then # We want to alert on excessive usage. [ $USAGE -gt $YELLOW ] && COLOUR=yellow [ $USAGE -gt $RED ] && COLOUR=red fi DB_USED=$(echo "$DB_USED+$USAGE" | bc) USAGE=$(echo $USAGE | sed 's/ \./0\./g') echo "$DB $USAGE" >> $TEMPFILE.2 done fi if [ $SHOW_IDLE -eq 1 ] then OTHER=$(echo "100-(${DB_USED}+${IDLE})" | bc) # Sometimes, with low "Other" usage, and because we collect the samples at # different times, we can end up with negative values for "OTHER" # This is obviously incorrect, so we do a bit of magic to make it "look" right. # We only change the IDLE and OTHER values though, not the database values. # Those are real data. if [ $OTHER -lt 0 ] then IDLE=$(echo ${IDLE}+${OTHER} | bc) OTHER="0.0000" fi [ $IDLE -eq 0 ] && IDLE="0.0000" echo "-------- -----------" >> $TEMPFILE.2 DB="Other " USAGE=$OTHER USAGE=$(echo "$USAGE" | sed "s/ \./0\./g") echo "$DB $USAGE" >> $TEMPFILE.2 DB="Idle" USAGE=$IDLE USAGE=$(echo "$USAGE" | sed "s/ \./0\./g") echo "$DB $USAGE" >> $TEMPFILE.2 fi cat $TEMPFILE.2 >> $TEMPFILE # Send the status $XYMON $XYMSRV "status $MACHINE.db-cpu $COLOUR $(cat $TEMPFILE)" # Clear the TEMPFILE again, so we can send just the data for graphing. > $TEMPFILE grep -v -- "^-----" $TEMPFILE.2 | while read a b do echo "$a : $b" >> $TEMPFILE done # Now send the graph details $XYMON $XYMSRV "data $MACHINE.db-cpu $(echo; cat $TEMPFILE; echo; echo "Ignore this")" rm $CPU_tmp rm $TEMPFILE $TEMPFILE.2 ===== Known Bugs and Issues ===== ==The Numbers== The numbers are not 100% accurate. They can't be. \\ I am getting the database numbers and the idle numbers at different times. Only seconds apart, but a lot can change in a second.\\ The database numbers are about as accurate as I can get them, and they are all collected at the same time, so relative to each other, they are accurate.\\ Accurate to 4 decimal places of a percent. But this means that anything using less than 0.0001% of the CPU will not be counted, and enough of these could add up.\\ That said, I think 4 decimal places should be OK. If not, maybe we can get Brendan Gregg to write a 6 decimal place version of his script.\\ I am also calculating the value for "other", and then fudging it a little when I need to.\\ The numbers are not 100% accurate!\\ However, within a 1% uncertainty range, I think it's good enough.\\ ==Graphs== I have provided 2 graph types. Line2 and stack.\\ The line2 graph doesn't really make sense if you have SHOW_IDLE=1 in the script.\\ The stack graph doesn't work with older versions on Xymon.\\ However, I would suggest you try the options, and see what works best for you.\\ Just remember, if you have SHOW_IDLE=1 and then change it, you will have to manually remove the "idle" and "other" rrd files from the server to prevent them from hanging about and potentially making your graphs look crappy. rm ~/data/rrd/<$HOST>/db-cpu,Other.rrd rm ~/data/rrd/<$HOST>/db-cpu,Idle.rrd ==Real Bugs== The following bugs have been fixed.\\ * Bug with grep causing additional processes to be included under specific conditions where a database instance name is a sub-string of another instance name. * Top is not part of a standard Solaris install. Updated to use sar is top is not available.\\ * Now it can handle the case of no databases running.\\ \\ At present I am unaware of any further bugs, but if you find any, let me know, and I will do my best to fix them. ===== To Do ===== Rest.\\ Where are my laurels? ===== Credits ===== Much of the credit for this must go to [[http://www.brendangregg.com/|Brendan Gregg]].\\ Without his magic, I would never have been able to create this script.\\ \\ Buy Brendan's books. [[http://www.bookdepository.com/search?searchTerm=brendan+gregg&search=Find+book|Available here]]\\ \\ \\ And some credit goes to a curious pair of DBAs who actually seem to care what their database is doing to the system. Or maybe they just love pretty graphs.\\ \\ Thanks to Andrey Chervonets for pointing out a possible issue with the grep command, which could cause a dramatic skewing of the CPU usage.\\ \\ Credit also to Nick Pettefar for reminding me that SMCtop is not part of a standard Solaris install, and is not available everywhere. ===== Changelog ===== * **2013-03-01** * Initial release * **2013-03-07** * Updated grep to ensure databases are correctly identified when you have a instance name that is a sub-string of another instance name. e.g. testdb and testdb2 * Updated to use sar if top is unavailable * **2013-04-23** * Updated to handle the case of no databases running. Previously, it just didn't bother doing anything, but now it will report there are no databases, and can trigger a colour change if there are no databases found. * **2014-02-27** * Updated to run with Solaris 11. * If top isn't available, (Solaris 10 and earlier) will use iostat to get idle stats. * Fixed a few display bugs.