By Jakob Engblom
A few years ago, I did a Simics demo where I tested the scalability of a multithreaded program as the target hardware went from two to four to eight cores. Unfortunately, I could not take it beyond that point, since the hardware platform that I used simply did not allow for more than eight cores. Now, with the Simics Quick Start Platforms (QSP), the situation is different. I picked up the demo again, and pushed it to sixty (60) cores with ease.
The QSP platform that comes with Simics scales out to 128 processors by default (and more if you are willing to do some modifications to the model). It also comes with a Linux image that works for 128 processors, providing the infrastructure for testing threaded programs. Armed with this, I compiled my original pthread-based Linux program for the QSP, and ported my old demo scripts to the QSP – taking the chance to generalize things and add some new bells and whistles to the setup in the process.
To understand what I am going to show here, you need some understanding about the program I am testing. The program uses a worker pool architecture, with a single data generator thread that pushes work units into a shared queue, and multiple worker threads pulling work units from this queue and computing on them. The amount of computation needed to process a unit of work is highly scalable and tune-able, it can easily be increased 100-fold from one run to another. This allows us to explore the trade-off between communication and computation in the program. In the graphs that you will see below, we scale one of the parameters for the workload from 100 to 1000, but there is also a hidden parameter inside the experiment, and this hidden parameter is multiplied by the visible parameter to produce the total amount of computation needed for each unit of work by a worker.
If you know anything about parallel programming, that single queue probably sounds like a bad design – and bear with me, we will see how this works out in the experiments.
In my experiments, I used a five-core, a sixty-core, and later a twenty-core machine. I booted them, loaded the test application over the simicsfs host-access system, and saved a checkpoint of the prepared machines. I then created a set of scripts that started with a checkpoint, and ran experiments with various parameters:
- Which machine to run on (five-core, twenty-core, or sixty-core)
- The thread counts to test
- The computational workload per unit of work
The experimental scripts use Simics OS awareness to detect the start and end of each run of the test program, and computes the performance of each run as the number of work units (packets) processed per (virtual) second. It then plots the values in a graph, with the number of worker threads on the horizontal axis and the performance in packets per second on the vertical axis. With ideal scaling, you would expect a nice straight line that ends up being N times the initial value for Nworker threads being used.
The first experiment used the five-core machine, testing from one to five worker threads. Note that this means that the program contains at most six active threads – the five workers plus the data generator thread. The resulting graph looked like this:
A bit jumpy, but at least things are improving from left to right. The jumpiness can be attributed to two factors. First, we only do a single run for each data point. Second, we are using a very light load, which increases the program variability as overheads and OS fluctuations dominate over the core program functionality. Still, what clearly can see here is that the Simics simulation is not behaving like an ideal system. Rather, it behaves like a real machine would, showing variations induced by the environment in which a program runs. As a point of comparison, I once built an AMP (separate OS on each core) setup for this program, and there the scaling was totally linear and ideal since there was no threading package, no synchronization of threads needed, and no OS parallelism involved.
The non-ideality here is really a feature, indicating that I have to improve my experiments. The improved experiment increases the computation per packet by a factor of 40 (using the hidden parameter), to emphasize the scalability of the core algorithms over the system noise. The result looks like this:
This does indeed result in much smoother graphs. The crossing between the 1000 and 800 lines does not matter that much, they are pretty close to each other in load. Overall, it seems that we scale about 3x to 4x when going out to five cores, which is pretty decent.
But what happens if we try to push this program to the limit? To try this, I used the QSP with 60 cores (a nice, round number). For most hardware that I know, going to 60 cores is either infeasible (strict limit on how many cores an SoC or interrupt architecture can support) or prohibitively expensive (exotic multi-socket motherboards with highest-end Xeons). In Simics, we just turn a virtual knob. The result looks like this, using a compute load per packet that is ten times the base load, so we should expect pretty smooth scaling (I skipped the 100 packet length in this experiment as it does not add much information).
However, this plot indicates that my program is not that scalable at all once we move beyond the simple world of a few cores. What we seem to get is about a 3x speedup from 60 worker threads on 60 cores. We get the best performance around five to six threads, and then goes down and flattens out. This program clearly needs to be rewritten to work for manycore machines – which is what we wanted to test.
However, maybe not everything can be blamed on poor coding in the program. To investigate the impact of the operating system itself, I brought up a 20-core QSP and reran the experiments. This did indeed shine some new light on the behavior of the program, as you can see:
This plot does not show the step effects of the 60-core run, and much less jumpiness (except the known jumpy length 100 line). Note that this is the same program, with the same number of threads as above. But the hardware size is different, and therefore the OS behavior can be different for operations that involve scheduling and sychronization between cores. It indicates that make a scalable program, I might have to both update my own code, as well as check if the operating system itself can be tuned in some way to work better with my workload.
To conclude, this blog post has shown how flexible virtual hardware can help you understand the scalability of software to multiple cores. Having a flexible easily-configurable virtual platform that is unconstrained by the limits of physical hardware designs is a very useful tool to explore the fundamental behavior of software – both for operating system code and user-level code.
Final note: If you wonder what limited the original demo, it was the use of an eight-bit field at the top of a 32-bit word to enable and disable processors. After we used these eight bits, there simply was no way to add a ninth bit and thus a ninth core, without redesigning the hardware and the BSP. At the time, that felt like serious overkill for what we were trying to accomplish.