I've recently encountered some strange ZFS behavior on my OpenSolaris laptop. It was installed about a year ago with OpenSolaris 2008.11, I have since then upgraded to about every developer release available. Every time it's been upgraded lots of updated packages are downloaded. The Image Packaging System caches downloaded data in a directory that can become quite large over time. In my case the "/var/pkg/download" directory had half a million files in it consuming 6.6 gigabytes of disk space.
Traversing trough all these files using du(1B) took about 90 minutes, a terrible long time even for half a million files. Performing the same operation once more took about the same time, which raised two questions. First, why is it so terribly slow to begin with? Second, why doesn't the ARC cache the metadata so that the second run is much faster?
Looking at the activity on the machine the CPU was almost idle but the disk was close to 100 percent busy for the whole run. Arcstat reported that the space used by the ARC was only 1/3 of the target.
ZFS can get problems with fragmentation if a pool is allowed to get close to full. This pool had been upgraded 12 times the last year, every upgrade creates a clone of the filesystem and performs a lot of updates to the /var/pkg/download directory structure. Fragmentation could explain the extremely slow initial run, but the ARC should cache the data for a fast second pass.
Replicating the directory to another machine running the very same OSOL development release(b124) and doing the same test performs much better:
Initial runtime: source ~ 90m repl ~3m
Second runtime: source ~90m repl ~15s
Reads/s initial: source: ~ 200 repl: 5-6K
Reads/s second run: source: ~200 repl: 46k
If we presume that we are correct about the fragmentation problem, all data has now been rewritten at the same time to a pool with plenty of free space. This can explain why the initial pass is now much faster. But why does the ARC speed up the second run on this machine? Both machines are of the same class (~2GHz x86, 2.5" SATA boot disk) but the second machine has more memory. But this shouldn't matter since we have plenty of room left in the ARC even on the slow machine? Some digging shows that there is a limit for the amount of metadata cached in the ARC.
# echo "::arc" | mdb -k |grep arc_meta
arc_meta_used = 854 MB
arc_meta_limit = 752 MB
arc_meta_max = 855 MB
This is something to look into. What's happening is that the ARC is of no use at all in this situation, it is based on Least Recently Used(LRU) and most frequently used lists. Everything under this directory is read the same number of times each pass and in the same order, filling the ARC pushing out entries before they can be used. The arc_metadata_max is set to 1/4 of the ARC which is to small on this 4GB system, lets try raising the limit to 1GB and run the tests again:
# echo "arc_meta_limit/Z 0x4000000" | mdb -kw
Traversing the directory now takes about 10 minutes after the initial run, better but still terribly slow for cached data and the machine is still using the disks extensively. This is caused by access time updates on all files and directories, and remember that the the filesystem is terribly fragmented which of course is a problem for this operation also. We can turn off the atime updates for a filesystem in ZFS:
# zfs set atime=off rpool/ROOT/opensolaris-12
Now, a second pass over all the data takes well under a minute, the ARC problem is solved and the fragmentation problem can be fixed by making some room in the pool and copying the directory so that all data is rewritten. Note that this data could also be removed without any problems, more on this here.
There is no defragmentation tool or function in ZFS (yet? It would depend on bp-rewrite as everything else) and for most of the time it's not needed. ZFS is a Copy On Write(COW) filesystem that is fragmented by design and should deal with it in a good way. It works perfectly fine for most of the time, i've never had any issues with fragmentation on ZFS before. In the future I will make sure to keep some more free space in my pools to minimize risks of something similar happening again.
I plan to write more in detail about the fragmentation part of this issue in a later post, stay tuned.
Talks I have given
2 weeks ago