Testing Manycore Scaling with Simics

A few years ago, I did a Simics demo where I tested the
scalability of a multithreaded program as the target hardware went from two to
four to eight cores. Unfortunately, I could not take it beyond that point,
since the hardware platform that I used simply did not allow for more than
eight cores. Now, with the Simics Quick Start Platforms (QSP), the situation is
different. I picked up the demo again, and pushed it to sixty (60) cores
with ease.

The QSP platform that comes with Simics scales out to 128
processors by default
(and more if you are
willing to do some modifications to the model). It also comes with a Linux
image that works for 128 processors, providing the infrastructure for testing threaded programs. Armed with this, I compiled my original
pthread-based Linux program for the QSP, and ported my old demo scripts to the
QSP – taking the chance to generalize things and add some new bells and
whistles to the setup in the process. 

To understand what I am going to show here, you need some understanding about the program I am testing. The program uses a worker pool architecture, with a single
data generator thread that pushes work units into a shared queue, and multiple worker threads pulling work units from this queue and computing on them. The amount of computation needed to process a unit of
work is highly scalable and tune-able, it can easily be increased 100-fold from one run to
another. This allows us to explore the trade-off between communication and
computation in the program. In the graphs that you will see below, we scale one of the
parameters for the workload from 100 to 1000, but there is also a hidden
parameter inside the experiment, and this hidden parameter is multiplied by the visible parameter to produce the total amount of computation needed for each unit of work by a worker.

If
you know anything about parallel programming, that single queue probably sounds
like a bad design – and bear with me, we will see how this works out in the
experiments.

In my experiments, I used a five-core, a sixty-core, and
later a twenty-core machine. I booted them, loaded the test application over the simicsfs
host-access system, and saved a checkpoint of the prepared machines. I then
created a set of scripts that started with a checkpoint, and ran experiments
with various parameters:

  • Which machine to run on (five-core,
    twenty-core, or sixty-core)
  • The thread counts to test
  • The computational workload per unit of work

The experimental scripts use Simics OS awareness to detect
the start and end of each run of the test program, and computes the performance
of each run as the number of work units (packets) processed per (virtual) second.
It then plots the values in a graph, with the number of worker threads on the
horizontal axis and the performance in packets per second on the vertical axis. With ideal scaling, you would expect a nice straight line that ends up being N times the initial value for N worker threads being used.

The first experiment used the five-core machine,
testing from one to five worker threads. Note that this means that the program
contains at most six active threads – the five workers plus the data generator
thread. The resulting graph looked like this:

Qsp-multicore-5-load-1

A bit jumpy, but at least things are improving from left to right. The jumpiness can be attributed to two
factors. First, we only do a single run for each data point. Second, we are
using a very light load, which increases the program variability as overheads and
OS fluctuations dominate over the core program functionality. Still, what clearly can see here is that the Simics simulation is not behaving like an ideal system. Rather, it behaves like a real machine would, showing variations induced by the environment in which a program runs. As a point of comparison, I once built an AMP (separate OS on each core) setup for this program, and there the scaling was totally linear and
ideal since there was no threading package, no synchronization of threads needed, and no OS parallelism involved.

The non-ideality here is really a feature, indicating that
I have to improve my experiments. The improved experiment increases the computation per packet
by a factor of 40 (using the hidden parameter), to emphasize the scalability of the core algorithms over the
system noise. The result looks like this:

Qsp-multicore-5-load-40

This does indeed result in much smoother graphs. The crossing between the 1000
and 800 lines does not matter that much, they are pretty close to
each other in load. Overall, it seems that we scale about 3x to 4x when
going out to five cores, which is pretty decent.

But what happens if we try to push this program to the
limit? To try this, I used the QSP with 60 cores (a nice, round number). For most
hardware that I know, going to 60 cores is either infeasible (strict limit on
how many cores an SoC or interrupt architecture can support) or prohibitively
expensive (exotic multi-socket motherboards with highest-end Xeons). In Simics,
we just turn a virtual knob. The result looks like this, using a compute load per packet that is ten times the base load, so we should expect pretty smooth scaling (I skipped the 100 packet length in this
experiment as it does not add much information).

Qsp-multicore-60-load-10

However, this plot indicates that my program is
not that scalable at all once we move beyond the simple world of a few cores. What we seem to
get is about a 3x speedup from 60 worker threads on 60 cores. We get the best
performance around five to six threads, and then goes down and flattens out. This
program clearly needs to be rewritten to work for manycore machines – which is
what we wanted to test.

However, maybe not everything can be blamed on poor coding
in the program. To investigate the impact of the operating system itself, I
brought up a 20-core QSP and reran the experiments. This did indeed shine some
new light on the behavior of the program, as you can see:

Qsp-multicore-20-load-10

This plot does not show the step
effects of the 60-core run, and much less jumpiness (except the known jumpy
length 100 line). Note that this is the same program, with the same number of
threads as above. But the hardware size is different, and therefore the OS
behavior can be different for operations that involve scheduling and sychronization between cores. It indicates that make a scalable program, I might have to both update my own code, as well as check if the operating system itself can be tuned in some way to work better with my workload.

To conclude, this blog post has shown how flexible virtual
hardware can help you understand the scalability of software to multiple cores.
Having
a flexible easily-configurable virtual platform that is unconstrained by the
limits of physical hardware designs is a very useful tool to explore the
fundamental behavior of software – both for operating system code and user-level code. 

Final note: If you wonder what limited the original demo, it was the
use of an eight-bit field at the top of a 32-bit word to enable and disable
processors
. After we used these eight bits, there simply was no way to add a
ninth bit and thus a ninth core, without redesigning the hardware and the BSP.
At the time, that felt like serious overkill for what we were trying to
accomplish.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>