Tuesday, March 15, 2011

Sort memory capping out

I had a curios problem today, on a otherwise lightly loaded machine with plenty of memory left login as the Oracle user took tens of seconds to complete. Since it was the souring of a shell script that was stalling identify where the time was spent:

$ set -x
$ . /path/to/oraenv
+ nawk { print $2 }
+ sort

The whole delay was spent in sort(1) for no obvious reason. Truss shows the system call:
24.8167 sysconfig(_CONFIG_AVPHYS_PAGES) = 1500736
0.0011 sysconfig(_CONFIG_PAGESIZE) = 8192
0.0008 getpid() = 24178 [24177]

It took 24 seconds to get the number of available memory pages, a operation that worked fine in the global zone. The sysconfig source shows us that the call is very different for a global zone and for a memory capped zone:

if (!INGLOBALZONE(curproc) &&
curproc->p_zone->zone_phys_mcap != 0) {
pgcnt_t cap, rss, free;
vmusage_t in_use;
size_t cnt = 1;

cap = btop(curproc->p_zone->zone_phys_mcap);
if (cap > physinstalled)
return (freemem);

if (vm_getusage(VMUSAGE_ZONE, 1, &in_use, &cnt,
FKIOCTL) != 0)

If there is a physical memory capping set for the zone that is less than the amount of physical memory in the machine vm_getusage will be called, it will in turn look at every memory segment for every process , this can take quite a while on larger if the zone is a heavy allocator of memory, in this case the zone was using about 50GB of memory. This is not something you want to do every time a shell script calls $(sort). If you have ever used prstat -Z with large local zones you have seen the effects of this, it can take a long time.

Comment from the source:
"This file implements the getvmusage() private system call.
getvmusage() counts the amount of resident memory pages and swap
reserved by the specified process collective. A "process collective" is
the set of processes owned by a particular, zone, project, task, or user."

The source of the problem in sort was in utility.c:
size_t phys_total = sysconf(_SC_PHYS_PAGES) * sysconf(_SC_PAGESIZE);
It seems like no other utilities in solaris uses sysconf(_SCPHYS_PAGES) which is why we had no other problems.

The short time solution was to disable the physical memory cap for these zones:
# rcapadm -z zone01 -m 0


Brian Utterback said...

Something seems wrong here. Having a memory cap on the zone implies that every time the system wants to allocate a new physical page to a zone, it must check if the cap has been reached. It cannot possibly do the call used here, since a 24 second delay on each page allocation would be untenable. There has to be a counter somewhere that keeps track already.

Henkis said...

Yes, something is wrong, we don't have VM2.0 yet ;)

What you say is true for capping of the whole virtual address space for a zone (or project), but the physical capping is asynchronous and do no checks when allocating memory. But rcapd can later push the memory out on swap if the limit has been exceeded. Solaris has no counter for how much memory is resident in memory, it has to be calculated and it is usually not a problem for user applications except when they make this call which triggers the expensive computation.

SUNWfrk said...

We also saw this bug on our production systems.. On Solaris 10 update 9 we don't have this problem anymore.