Tuesday, February 16, 2010

Thread performance on modern SPARC processors

All currently sold and supported Sun systems have some kind of multi threading on the chip, T2 or Niagara 2 uses Chip Multi Threading (CMT) which currently has 8 strands (HW threads) per core, four strands per integer execution unit and the SPARC64-VII that uses Vertical Multi Threading (VMT) with two strands per core.

A strand is in itself unable to execute any data, it's only the registers needed for a software thread to execute on a CPU. But all strands can be executed on the CPU without the cost of a context switch, every strand can have a thread bound to it and a let them execute every other instruction without overhead.

One reason I write this is that I have several time encountered a misunderstanding that processors with multiple strands per core only could run threads at a clock frequency divided by the number of strands. This would then severely impact single thread performance. E.g. a 1.6Ghz T2+ could only run a thread at 400Mhz or a thread on a 2.52GHz SPARC64-VII which has two strands per core could only run at 1.26GHz.

This not true, if there is only one thread running on a CPU the strands of a core, it will get all cycles it can consume. CMT is based on the assumption that threads is spending some of the execution time just waiting for the slower memory, during this time it's more efficient to execute another thread, given that no context switch is needed, hence several strands it's own registers.

But if we execute several threads on one CMT CPU, what will then happen? Each thread will then not be able to run at full speed, but that would more or less be the case anyway, the scheduler in Solaris would switch the thread on and off a single hardware thread. Sure the later thread will run at full speed for short fractions of time, but it will be stalled and moved off the CPU if not finished. There are of course some workloads were this is better, for example if a transaction or job can be finished in this short time. In the first case we will be able to take advantage of the CMT design, when one of the threads is waiting for memory the others could possibly execute instructions instead of wasting the cycles.

Solaris normally handles the scheduling of threads on different strands quite well, it there are more CPU resources available than the running threads can consume, they are spread out on stands on different physical cores. There is then no performance impact on threads compared to running on a system with one thread per core.

One thing that probably have had its part in creating this misconception is that the Niagara design (T2,T2+..) is built using simpler and weaker individual cores (non-superscalar) which also runs at a lower frequency. This was done to be able to have more strands per core, more cores per chip and to have lower power consumption. So while the claim of lower frequency is not really true, the Niagara class of processor do have weaker single thread performance, but not for that reason. That said, for many throughput workloads or for encryption the Niagara processors can excel and perform better than any SPARC64 processor. So as always, it depends on your workload but in general application servers and web servers works well with CMT and also many database or otherwise highly threaded workloads.

A quick test in real life on a M3000 (2 strands per core) where we see that using a single thread just has negative impact of the execution time this simple single threaded workload (md5). If they would indeed have their own cycles for each strand there would be no big impact of adding another job which runs on the second strand.
1 core, 2 strand, 1 job:  6.48s
1 core, 1 strand, 1 job: 6.52s
1 core, 1 strand, 2 jobs: 10.33s, 13.07s
1 core, 2 strands, 2 jobs: 10.85s, 10.92s
Here is a example where we can se that a single thread is consuming more than one forth of the cycles available on a T2 processor core:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 137 241 32 18 18 3 17 1 12385 74 26 0 0
Core Utilization for Integer pipeline
Core,Int-pipe %Usr %Sys %Usr+Sys
------------- ----- ----- --------
0,0 56.07 4.37 60.44
And two threads consuming 80 percent of the cycles:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 76 260 43 263 73 10 33 1 9345 80 20 0 0
1 0 0 17 74 0 284 74 9 30 0 8453 79 21 0 0
Core Utilization for Integer pipeline
Core,Int-pipe %Usr %Sys %Usr+Sys
------------- ----- ----- --------
0,0 76.42 5.15 81.57

Worth a read:
The UltraSPARC T2 Processor and the Solaris Operating System
SPARC Enterprise architecture
Niagara 2 opens the floodgates
Glenn Fawcett's Weblog, guidelines for Oracle on CMT.

A utility to show the core utilization on CMT/VMT processors:

No comments: