Tuesday, January 3, 2012

The all-seeing eye of DTrace

I was recently involved with a problem related to backup software running on Solaris, as part of a general health check of the system I stumbled on something interesting that was not visible using conventional tools.

This tuned out to be a good opportunity to put my DTrace skill to work together with a few finished scripts. Once again it struck what how amazing this tool is, you can really see everything that is going on in your system and as it turns out, you can even see problems that does not even exist. Since this was so much fun and a good example I will walk through the steps again:

The thing that caught my eye was the output from errinfo of the DTrace toolkit. There are a very high rate of system calls returning in error, namely close() with -9 "Bad file number", as seen with errinfo:
whoami        ioctl   22     13  Invalid argument                 
init ioctl 25 212 Inappropriate ioctl for device
awk stat64 2 520 No such file or directory
java lwp_cond_wait 62 3492 timer expired
processx close 9 102073391 Bad file number
Syscall errors are in itself normal can be seen on any systems, but usually not several thousand per second. As the error message indicates this happens when a close() is issued on a file handle (Integer) that does not represent an open file for that process, which at first look seems like a quite useless operation.

We can also see that close() is by far the most used system call here:
# dtrace -q -n 'BEGIN { close=0;total=0 } syscall::close:entry \
{ close = close + 1 } syscall:::entry { total = total + 1 } END \
{ printf("%d close calls of %d total calls\n",close,total) }'

309530 close calls of 426212 total calls
Looking at which file descriptor the process is trying to close shows that there is an even distribution of close between 0 and 65536 and the only successful calls where to numbers lower than 1024 where numbers normally used unless a process has a very high amount of open files.
# dtrace -n 'syscall::close:entry { this->fd = arg0 } syscall::close:return \
/ errno!= 0 / { @failed = lquantize(this->fd,0,65536,16384) } syscall::close:return \
/errno == 0/ { @good = lquantize(this->fd,0,65535,1024) }
dtrace: description 'syscall::close:entry ' matched 3 probes
value ------------- Distribution ------------- count
< 0 | 7
0 |@@@@@@@@@@@@@ 414811
16384 |@@@@@@@@@ 294912
32768 |@@@@@@@@@ 294912
49152 |@@@@@@@@@ 294912
>= 65536 | 0

value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 12459
1024 | 0
The processes responsible only lives only a short while, but by using dtruss i could trace system calls based on the process name:
 15889/1:  fork1()   = 0 0
15889/1: lwp_sigmask(0x3, 0x0, 0x0) = 0xFFFF 0
11385/1: getpid(0x0, 0x1, 0x1CD0) = 15889 0
11385/1: lwp_self(0x0, 0x1, 0x40) = 1 0
11385/1: lwp_sigmask(0x3, 0x0, 0x0) = 0xFFFF 0
11385/1: fcntl(0xA, 0x9, 0x0) = 0 0
11385/1: schedctl(0xFFFFFFFF7F7361B8, 0xFFFFFFFF7F738D60, 0x11A340)
= 2139062272 0
11385/1: lwp_sigmask(0x3, 0x0, 0x0) = 0xFFFF 0
11385/1: sigaction(0x12, 0xFFFFFFFF7FFDEE20, 0x0) = 0 0
11385/1: sigaction(0x2, 0xFFFFFFFF7FFDEE20, 0x0) = 0 0
11385/1: sigaction(0xF, 0xFFFFFFFF7FFDEE20, 0x0) = 0 0
11385/1: getrlimit(0x5, 0xFFFFFFFF7FFDED90, 0x0) = 0 0
11385/1: close(0x3) = 0 0
11385/1: close(0x4) = 0 0
11385/1: close(0x5) = 0 0
11385/1: close(0x6) = 0 0
11397/1: close(0xFFFF) = -1 Err#9
The process is issuing close on all numbers between 0x3 to 0xFFFF in a loop, as expected the first few are actually open and closed correctly but the other was majority is returning error -9.

If we look in the beginning of the trace we can see a fork, followed a little later by getrlimit(0x5,...), if we look at what that arguments to getrlimit means:
# egrep "RLIMIT.*5" /usr/include/sys/resource.h
#define RLIMIT_NOFILE 5 /* file descriptors */
The process is checking the limit of file descriptors and then closes the whole possible range which seems a little unnecessary since almost none of them are open. But this was just after a fork, and a forked process inherits all the open files of it's parent, this might not be what you want so a close is in order. There are however no easy way of getting a list of all used file descriptors so what we see here is a brute-force approach of making sure none are open before continuing. This would probably not have been noticed if it weren't for the unusual high limit of file descriptors.
# plimit $(pgrep processx|head -1) | grep nofiles
nofiles(descriptors) 65536 65536
Perhaps a iteration with close on the contents of /proc/${PID}/fd would have been less resource consuming in this scenario.

All of this was done in a production system without impact to applications which is crucial, you must be able to trust that it will never bring your system down. This is something DTrace can be trusted with where some platforms lacking it but tries to provide somewhat similar observability fails, read Brendans blog: using systemtap or the older but entertaining DTrace knockoffs.

Download the DTracetoolkit here.

ZFSSA/S7000 major update

The first major software update of S7000/ZFSSA/Fishwork in over a year is now available. With the original version "2011.Q1" it seems a bit delayed, perhaps due to the departure several key persons behind the software post Oracle acquisition of Sun. New features in this release:
  • ZFS RAIDZ read performance improvements
  • Significant fairness improvements during ZFS resilver operations
  • Significant zpool import speed improvements
  • Replication enhancements - including self-replication
  • Seval more including bug fixes.
There are alsoa few integration features for Oracle including RMAN and Infiniband if you happen to have a Exadata around.

The improved RAIDZ performances is the hybrid raidz/mirror allocator in zpool version 29.

The ZFSSA is a fantastic product with probably the best interface and analytics available. But the development seems to have stagnated a bit the last year, so have the blog post with useful information and performance comparison by the people behind it. And I still miss one feature badly; synchronous replication of datasets, continuous replication is not always good enough.

Release Notes