Analyzing Manycore Scaling with Simics

By Jakob Engblom

Engblom_lgIn my previous blog post on multicore scaling investigations with Simics, I tested a simple parallel program on a variety of machines. The scaling obtained was not particularly impressive, especially not on a 60-core target machine. In this post, we will use the Simics timeline view to look a bit closer at what is going on inside the target machines. In particular, with respect to operating system scheduling of the target threads. 

Before we look at the runs that indicated a lack of scaling, we should look at a well-behaved case to make sure we have something to compare to. For this, I brought up the 5-core heavy load experiment again, and reran it. Some things in the target setup had changed compared to the last blog post, so the scaling of the length 100 line is a bit different (once again showing the sensitivity of this program to noise and initial conditions just like it would be on physical hardware). If we look at the behavior for a nicely scaling line and compare it to the timeline view plotting the active threads, we can see that nice scaling does indeed correspond to good parallelism.

Qsp-multicore-scale 5core40 200 pkt len in timeline

In this plot, each processor core in the target has its own color, and we see which threads run when on which core. In the picture, we see the cases with two to five worker threads. For two and three threads, we have each thread using a single processor core for the duration of the program. For four threads, where the scaling is a bit less ideal, we can see that two threads share a single processor. Finally, for five threads, we can see that we use four target cores on average across the run, and that the OS decides to have two threads share a core. Thus, we can conclude that scaling in the overall throughput graph does indeed correspond to parallel execution (no great surprise).

Next, I looked at the behavior on the 60-core target for the threads counts where scaling was essentially flat. Here, we have a rather different picture from the above. At six worker threads, we see a scheduling that only makes use of two cores, in a fairly erratic manner. It looks quite impressive, but it is not what you want from a scalable program.

Rule30 6 worker threads two cores

Things get even more interesting as we go and look at many cores. When we use 19 worker threads, we see a regular pattern develop in the execution:

Rule30 19 worker threads totally serial

This was quite surprising. Note how each thread gets to run in order, and how sometimes we get two threads running in parallel and sometimes just a single one. It appears that the handling of the shared lock on the work queue is mostly FIFO, where a thread gets a unit of work and then goes to the end of the queue. This gives rise to the regular pattern seen, as each thread gets to run in turn (with some exceptions, indicating that the system does suffer from a little noise). Clearly, we are not getting anywhere near the theoretical parallelism that this problem offers.

All this indicates that the communication between threads needs to be fixed in order to increase performance. This is completely expected – as we expand the available hardware parallelism, software needs to be rewritten to minimize communication between threads and maximize independent execution.

If we plot the behavior of our program as a speedup over the case with one worker thread, we can see that it actually does not get worse as we add threads, it just plateaus at a speedup between 3 and 4. Thus, this program in its current form would not benefit from a platform that is any wider than quadcore.

Qsp-multicore-60-load-10 speedup

So what does this give us?

We have determined that the program in its current incarnation, on a Linux OS, does not scale beyond four worker threads with any kind of benefit. Thus, if we want to run this program on a wider target machine (to either run bigger workloads or use more slower processors in order to save power), we need to first re-architect the software. Throwing hardware at the problem will have no positive effect.

We did this by exploiting the configurability of a virtual platform to change the number of cores, and the insight of the virtual platform to gather statistics and information about how the program executes without changing its behavior.


For additional information from Wind River, visit us on Facebook.