Tuesday, January 3, 2012

The all-seeing eye of DTrace

I was recently involved with a problem related to backup software running on Solaris, as part of a general health check of the system I stumbled on something interesting that was not visible using conventional tools.

This tuned out to be a good opportunity to put my DTrace skill to work together with a few finished scripts. Once again it struck what how amazing this tool is, you can really see everything that is going on in your system and as it turns out, you can even see problems that does not even exist. Since this was so much fun and a good example I will walk through the steps again:

The thing that caught my eye was the output from errinfo of the DTrace toolkit. There are a very high rate of system calls returning in error, namely close() with -9 "Bad file number", as seen with errinfo:
whoami        ioctl   22     13  Invalid argument                 
init ioctl 25 212 Inappropriate ioctl for device
awk stat64 2 520 No such file or directory
java lwp_cond_wait 62 3492 timer expired
processx close 9 102073391 Bad file number
Syscall errors are in itself normal can be seen on any systems, but usually not several thousand per second. As the error message indicates this happens when a close() is issued on a file handle (Integer) that does not represent an open file for that process, which at first look seems like a quite useless operation.

We can also see that close() is by far the most used system call here:
# dtrace -q -n 'BEGIN { close=0;total=0 } syscall::close:entry \
{ close = close + 1 } syscall:::entry { total = total + 1 } END \
{ printf("%d close calls of %d total calls\n",close,total) }'

309530 close calls of 426212 total calls
Looking at which file descriptor the process is trying to close shows that there is an even distribution of close between 0 and 65536 and the only successful calls where to numbers lower than 1024 where numbers normally used unless a process has a very high amount of open files.
# dtrace -n 'syscall::close:entry { this->fd = arg0 } syscall::close:return \
/ errno!= 0 / { @failed = lquantize(this->fd,0,65536,16384) } syscall::close:return \
/errno == 0/ { @good = lquantize(this->fd,0,65535,1024) }
dtrace: description 'syscall::close:entry ' matched 3 probes
value ------------- Distribution ------------- count
< 0 | 7
0 |@@@@@@@@@@@@@ 414811
16384 |@@@@@@@@@ 294912
32768 |@@@@@@@@@ 294912
49152 |@@@@@@@@@ 294912
>= 65536 | 0

value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 12459
1024 | 0
The processes responsible only lives only a short while, but by using dtruss i could trace system calls based on the process name:
 15889/1:  fork1()   = 0 0
15889/1: lwp_sigmask(0x3, 0x0, 0x0) = 0xFFFF 0
11385/1: getpid(0x0, 0x1, 0x1CD0) = 15889 0
11385/1: lwp_self(0x0, 0x1, 0x40) = 1 0
11385/1: lwp_sigmask(0x3, 0x0, 0x0) = 0xFFFF 0
11385/1: fcntl(0xA, 0x9, 0x0) = 0 0
11385/1: schedctl(0xFFFFFFFF7F7361B8, 0xFFFFFFFF7F738D60, 0x11A340)
= 2139062272 0
11385/1: lwp_sigmask(0x3, 0x0, 0x0) = 0xFFFF 0
11385/1: sigaction(0x12, 0xFFFFFFFF7FFDEE20, 0x0) = 0 0
11385/1: sigaction(0x2, 0xFFFFFFFF7FFDEE20, 0x0) = 0 0
11385/1: sigaction(0xF, 0xFFFFFFFF7FFDEE20, 0x0) = 0 0
11385/1: getrlimit(0x5, 0xFFFFFFFF7FFDED90, 0x0) = 0 0
11385/1: close(0x3) = 0 0
11385/1: close(0x4) = 0 0
11385/1: close(0x5) = 0 0
11385/1: close(0x6) = 0 0
....
11397/1: close(0xFFFF) = -1 Err#9
The process is issuing close on all numbers between 0x3 to 0xFFFF in a loop, as expected the first few are actually open and closed correctly but the other was majority is returning error -9.

If we look in the beginning of the trace we can see a fork, followed a little later by getrlimit(0x5,...), if we look at what that arguments to getrlimit means:
# egrep "RLIMIT.*5" /usr/include/sys/resource.h
#define RLIMIT_NOFILE 5 /* file descriptors */
The process is checking the limit of file descriptors and then closes the whole possible range which seems a little unnecessary since almost none of them are open. But this was just after a fork, and a forked process inherits all the open files of it's parent, this might not be what you want so a close is in order. There are however no easy way of getting a list of all used file descriptors so what we see here is a brute-force approach of making sure none are open before continuing. This would probably not have been noticed if it weren't for the unusual high limit of file descriptors.
# plimit $(pgrep processx|head -1) | grep nofiles
nofiles(descriptors) 65536 65536
Perhaps a iteration with close on the contents of /proc/${PID}/fd would have been less resource consuming in this scenario.

All of this was done in a production system without impact to applications which is crucial, you must be able to trust that it will never bring your system down. This is something DTrace can be trusted with where some platforms lacking it but tries to provide somewhat similar observability fails, read Brendans blog: using systemtap or the older but entertaining DTrace knockoffs.

Download the DTracetoolkit here.

ZFSSA/S7000 major update

The first major software update of S7000/ZFSSA/Fishwork in over a year is now available. With the original version "2011.Q1" it seems a bit delayed, perhaps due to the departure several key persons behind the software post Oracle acquisition of Sun. New features in this release:
  • ZFS RAIDZ read performance improvements
  • Significant fairness improvements during ZFS resilver operations
  • Significant zpool import speed improvements
  • Replication enhancements - including self-replication
  • Seval more including bug fixes.
There are alsoa few integration features for Oracle including RMAN and Infiniband if you happen to have a Exadata around.

The improved RAIDZ performances is the hybrid raidz/mirror allocator in zpool version 29.

The ZFSSA is a fantastic product with probably the best interface and analytics available. But the development seems to have stagnated a bit the last year, so have the blog post with useful information and performance comparison by the people behind it. And I still miss one feature badly; synchronous replication of datasets, continuous replication is not always good enough.

Release Notes

Tuesday, December 27, 2011

Solaris 11, illumos and the source

Over a year ago Oracle closed the source for OpenSolaris, leaving contributors outside of Oracle left in the cold. That was a huge problem for the adoption of Solaris which had finally begun to rise again. Recently Solaris 11 was released, but without the source, this was likewise a huge problem, but now also for enterprise customers who are using and paying for Solaris. DTrace have lost part of it's value for Solaris 11 compared to the now dead OpenSolaris.

A perhaps even large problem is that when Oracle closed Solaris, many, many of the core developers left Oracle. Several of them now works outside of Oracle contributing to illumos, but these changes can not be ported back into Solaris 11 as long as Oracle keeps the source closed.

Solaris 11 has features not available in illlumos, but I chose to use illumos instead of Solaris 11 since I have access to the source and I am not locked to one OS-distribution. Also the licensing for Solaris 11 does not allow me to use it for small deployments without buying a whole support contract. If I "upgrade" a zpool to use new features available only in Solaris 11 I will be unable to import the pool using the free ZFS implementation that in illumos based distributions such as OpenIndiana or Nexenta or other operating systems such as FreeBSD.

I think this is a terrible move by Oracle, not only are the alienating new customers, they are also locking out great engineers who have implemented revolutionary features into Solaris. As Bryan Cantrill pointed out in his LISA '11 speech, Oracle has not made any official announcement about what they have done to OpenSolaris or what their future plans for the source are, this is very troubling and ignorant.

Solaris 11 is a great OS but it being treated terribly by Oracle, Oracle seems to think that the best way to make a profit out of Solaris is to keep it closed for everyone else, I don't agree.

Saturday, December 10, 2011

Solaris history and illumos

A great presentation by Bryan Cantrill from LISA11 is now available. Besides describing the illumos project it is also a good insight in the history of Solaris and what happened after Oracle acquired Sun. A must read for anyone interested in illumos or Solaris in general.



Update: The video is now also available.

Saturday, December 3, 2011

LDOM supported by Oracle

Oracle VM for SPARC is finally supported by Oracle, you can now partition a system with with logical domains and license hardware available to the domain. This is something Oracle customers have been wanting for a while now and I must say I believed it would have been solved the day Oracle acquired Sun. Nevertheless it's now finally supported at the same time the first T4 systems are being delivered to customers, which must be considered good timing.

Server/Hardware Partitioning
Important change regarding OVM/SPARC and Licensing

Thursday, November 10, 2011

First impressions of Solaris 11 11/11

I have had a few hours to try the final Solaris 11 release, overall I think it is far more stable and polished than the previous "Early Adaptors" release. Besides the fact that I am unable to use semi-old SPARC gear to test the release since only the latest generations of hardware are supported I have found few real problem so far.

The new packaging system finally fixes what I believe have been the biggest problem for Solaris the last five years or so, it is now repository based and it is simple to install software and dependencies are automatically solved. No more hassle of downloading and installing multiple software packages from SunFreeware to resolve dependencies. This also makes packaging faster and safer, brining the whole system to a known level and always with a safe recover option since it is used in conjunction with ZFS clones.

Zones are a fantastic tool for security/workload separation and virtualization so it's good to see that so many enhancements have been done in this area. The perhaps most noticeable is of corse that they now also use the new IPS system for packages and that makes a vanilla zones very lightweight without the hassle of a sparse zone. NFS service can now finally be provided from inside a zone. There is a tight integration with the new crossbow network virtualization making is possible to limit bandwidth to zones, use DHCP in a zone without having a separate NIC and build internal networks between zones inside a single Solaris 11 instance.

Imagine the power and flexibility of an T4-4 with 256 CPU threads and 1TB of memory running 50 zones with several high bandwidth/low latency networks inside the machine with no latency or overhead caused by virtualization.

Unfortunately the X86 version with the graphical desktop seems to be somewhat unstable compared to the Express release, I think it's related to the upgrade of the X server. I have been unable to use my laptop with to displays with the final release.

Solaris 11 is however focused on usage in servers and it seems stable for that, I have only found one disturbing problem so far, sharing ZFS filesystems does not seem to work ( zfs set -o sharenfs=on), but you can share each individual filesystem with the share command. Sadly if you are evaluating this you will probably have to wait until next year when there is a full release of Solaris since no updates are provided without a service contract. If you work for Oracle this is something you might want to fix for everyone, or tell me what I'm doing wrong.

Solaris 11 have many other new features such as per-user encryption of home directories with ZFS crypto, a new mirror/raidz hybrid bloc allocator for ZFS, numerous security enhancements among other thinks. I have only named a few of the changes I will probably keep posting Solaris 11 stuff as I find something interesting that is not directly highlighted it Oracle own what's new documents.

Update: As pointed out in the comments sharing of NFS together with ZFS works a bit differently now. If you share an existing dataset you have to set the share property. However if you set the sharenfs property when creating the dataset it works as in previous versions of Solaris 11 Express, OpenSolaris etc. Move information available in the documentation here.

Wednesday, November 9, 2011

Solaris 11 released

Solaris 11 is available for download "SunOS Release 5.11 Version 11.0", based on build snv_175b.

There are of course many changes since Solaris 10, most of them have been available in the latest build of OpenSolaris but there are some new that are unique to the final release of Solaris 11.

Install images are available for download and works on all current SPARC machines which is the T and M-series. There are also images available for X86-based machines which also can be used in VirtualBox. Here is a quick reference for the brand new packaging system: IPS one liners.

I will post more detailed follow-up after I've had time to test it for more than a few hours.

Oracle Solaris 11 11/11 – What’s new
Download Oracle Solaris 11
Future features of Solaris 11