Monday, November 3, 2008

Follow the white rabbit

Yesterday i encountered a strange problem which was very entertaining to debug and dtrace came to the rescue as as always. I had done a fresh install of Solaris Nevada build 101 on a AMD64 machine at home. The installation went fine and i started to configure on the host, but after a while i begun to get error messages like: "couldn't set locale correctly". I checked the SYSV package containing my locale, en_US.8859-1:

# pkgchk SUNWlang-en-extra
ERROR: /usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3
pathname does not exist


The installation logs showed that the package had been installed correctly. I tried to reinstall the package which temporarily fixed the problem but after 10 minutes or so it the file was missing again. I repeated this twice to see if there was any time pattern, but it gave me no clue to why the files was removed. I decided to use both BSM auditing and dtrace to see what might be causing the removal of the file. I waited and waited, but the file stayed there so i started to use the system as ordinary, that somehow triggered the removal of the file!?

A dtrace script created by Chris Gerhard showed me what caused the removal of the file:

# ./delete.d
dtrace: script './trace.d' matched 2 probes
CPU ID FUNCTION:NAME
0 75683 unlink:return man prctl
UID 0 PPID 11244 sh


Audit also logged the unlink of the file, following the trail it could have shoved me that man was responsible:

header,242,2,unlink(2),,ollespappa,2008-11-03 02:52:01.096 +01:00
path,/usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3
attribute,100555,root,bin,65538,320199,18446754073709451615
subject,arne,root,root,root,root,15984,2799982011,1330 5632 xx.xx.xx.xx
return,success,0


So, man(1) unlinked the file, that did not make any sense at all. A dtrace one liner gave me more information:

# dtrace -n 'pid$target:libc:unlink:entry { printf("unlink(%s)\n",copyinstr(arg0)); ustack(); }' -c 'man ls'
CPU ID FUNCTION:NAME
1 76165 unlink:entry unlink(/usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3)
libc.so.1`unlink
man`format+0xd54
man`searchdir+0x14d
man`mandir+0x194
man`manual+0x1c1
man`main+0x5d4
man`_start+0x7d


The unlink was called from the format function of man, Looking at the source I identified which line was responsible for the unlink:
2754 (void) unlink(tmpdir);

After some more investigation I realized that the argument string to the unlink command might not get initialized, it then points out in space. For this system it pointed at a string which represented the path to a file which could be unlinked. This occurs only if the manual page is in another format than SGML since the variables are initialized inside an SGML-related if-statement :

2612 if (sgml_flag == 1) {
2613 if (check_flag == 0) {
2614 strcpy(tmpdir, "/tmp/sman_XXXXXX");


This bug seems to have been in Solaris for ages, but the uninitialized string might never have come to point to anything useful to unlink, it could perhaps be the switch to SunStudio 12 in snv100 that triggered this behavior.

This is now CR 6767074.

No comments: