Tuesday, January 3, 2012

The all-seeing eye of DTrace

I was recently involved with a problem related to backup software running on Solaris, as part of a general health check of the system I stumbled on something interesting that was not visible using conventional tools.

This tuned out to be a good opportunity to put my DTrace skill to work together with a few finished scripts. Once again it struck what how amazing this tool is, you can really see everything that is going on in your system and as it turns out, you can even see problems that does not even exist. Since this was so much fun and a good example I will walk through the steps again:

The thing that caught my eye was the output from errinfo of the DTrace toolkit. There are a very high rate of system calls returning in error, namely close() with -9 "Bad file number", as seen with errinfo:
whoami        ioctl   22     13  Invalid argument                 
init ioctl 25 212 Inappropriate ioctl for device
awk stat64 2 520 No such file or directory
java lwp_cond_wait 62 3492 timer expired
processx close 9 102073391 Bad file number
Syscall errors are in itself normal can be seen on any systems, but usually not several thousand per second. As the error message indicates this happens when a close() is issued on a file handle (Integer) that does not represent an open file for that process, which at first look seems like a quite useless operation.

We can also see that close() is by far the most used system call here:
# dtrace -q -n 'BEGIN { close=0;total=0 } syscall::close:entry \
{ close = close + 1 } syscall:::entry { total = total + 1 } END \
{ printf("%d close calls of %d total calls\n",close,total) }'

309530 close calls of 426212 total calls
Looking at which file descriptor the process is trying to close shows that there is an even distribution of close between 0 and 65536 and the only successful calls where to numbers lower than 1024 where numbers normally used unless a process has a very high amount of open files.
# dtrace -n 'syscall::close:entry { this->fd = arg0 } syscall::close:return \
/ errno!= 0 / { @failed = lquantize(this->fd,0,65536,16384) } syscall::close:return \
/errno == 0/ { @good = lquantize(this->fd,0,65535,1024) }
dtrace: description 'syscall::close:entry ' matched 3 probes
value ------------- Distribution ------------- count
< 0 | 7
0 |@@@@@@@@@@@@@ 414811
16384 |@@@@@@@@@ 294912
32768 |@@@@@@@@@ 294912
49152 |@@@@@@@@@ 294912
>= 65536 | 0

value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 12459
1024 | 0
The processes responsible only lives only a short while, but by using dtruss i could trace system calls based on the process name:
 15889/1:  fork1()   = 0 0
15889/1: lwp_sigmask(0x3, 0x0, 0x0) = 0xFFFF 0
11385/1: getpid(0x0, 0x1, 0x1CD0) = 15889 0
11385/1: lwp_self(0x0, 0x1, 0x40) = 1 0
11385/1: lwp_sigmask(0x3, 0x0, 0x0) = 0xFFFF 0
11385/1: fcntl(0xA, 0x9, 0x0) = 0 0
11385/1: schedctl(0xFFFFFFFF7F7361B8, 0xFFFFFFFF7F738D60, 0x11A340)
= 2139062272 0
11385/1: lwp_sigmask(0x3, 0x0, 0x0) = 0xFFFF 0
11385/1: sigaction(0x12, 0xFFFFFFFF7FFDEE20, 0x0) = 0 0
11385/1: sigaction(0x2, 0xFFFFFFFF7FFDEE20, 0x0) = 0 0
11385/1: sigaction(0xF, 0xFFFFFFFF7FFDEE20, 0x0) = 0 0
11385/1: getrlimit(0x5, 0xFFFFFFFF7FFDED90, 0x0) = 0 0
11385/1: close(0x3) = 0 0
11385/1: close(0x4) = 0 0
11385/1: close(0x5) = 0 0
11385/1: close(0x6) = 0 0
11397/1: close(0xFFFF) = -1 Err#9
The process is issuing close on all numbers between 0x3 to 0xFFFF in a loop, as expected the first few are actually open and closed correctly but the other was majority is returning error -9.

If we look in the beginning of the trace we can see a fork, followed a little later by getrlimit(0x5,...), if we look at what that arguments to getrlimit means:
# egrep "RLIMIT.*5" /usr/include/sys/resource.h
#define RLIMIT_NOFILE 5 /* file descriptors */
The process is checking the limit of file descriptors and then closes the whole possible range which seems a little unnecessary since almost none of them are open. But this was just after a fork, and a forked process inherits all the open files of it's parent, this might not be what you want so a close is in order. There are however no easy way of getting a list of all used file descriptors so what we see here is a brute-force approach of making sure none are open before continuing. This would probably not have been noticed if it weren't for the unusual high limit of file descriptors.
# plimit $(pgrep processx|head -1) | grep nofiles
nofiles(descriptors) 65536 65536
Perhaps a iteration with close on the contents of /proc/${PID}/fd would have been less resource consuming in this scenario.

All of this was done in a production system without impact to applications which is crucial, you must be able to trust that it will never bring your system down. This is something DTrace can be trusted with where some platforms lacking it but tries to provide somewhat similar observability fails, read Brendans blog: using systemtap or the older but entertaining DTrace knockoffs.

Download the DTracetoolkit here.


Brian Utterback said...

I've run into this scenario a couple of times in the past. It was not uncommon in times past to request the highest fileno and then close them all, iterating from one past the highest you wanted all the way up to the largest. That used to be the only way in fact. Then at some later date the application runs out of file descriptors, so the limit is increased to 64K and now it takes a very long time just to start.

Even better than iterating over /proc yourself is to use the closefrom(3C) function call. It does the iteration for you. Why re-invent the wheel?

Anonymous said...

"you can even see problems that does not even exist."

Yes, it's always good to see problems that don't even exist. :)

"Perhaps a iteration with close on the contents of /proc/${PID}/fd would have been less resource consuming in this scenario."

That's not portable. If you want a non-portable approach, just use closefrom(). Otherwise, for portability, which is the reason tools do this, you have to brute force all fd's.