Simics 5 Multicore Accelerator - Boosting Performance, Adding Parallelism

Simics 5 Multicore Accelerator - Boosting Performance, Adding Parallelism

jakob-engblom-intro-picture

One of the headline news of Simics 5 is Multicore Accelerator. With multicore accelerator, we can use a multicore host to simulate a tightly-coupled multicore target. Since the launch of Simics 4.0 in 2008, Simics has been able to simulate multiple discrete target machines or racks on multiple host cores, and Simics 5 takes this to the logical next step of splitting apart the boards and SoCs.

The logical progression of execution methods is illustrated in the picture below (and yes, this picture was also shown in the original announcement of Simics 5):

mca-1

A we can see, the key is splitting up the target system into progressively smaller pieces in order to make sure that we can get through the most computationally-intense parts of the workload faster. Multimachine accelerator works very well in many cases, but if we have a system where a single heavily loaded manycore SoC is on the critical path, it basically falls back to serial execution. With multicore accelerator, that workload can now be split across multiple host threads, moving the execution ahead faster. It is a very simple idea, but one that has been fairly difficult to implement efficiently and correctly.

It is worth pointing out that even though a multicore host is a key enabler for Simics accelerator, we are really running Simics on multiple host threads in a worker pool arrangement. There is no static assignment of target cores to host cores, but rather dynamic scheduling of computational work on the available host threads. The system has quite a few layers to it, as the below picture illustrates:

mca-2

See how the target cores generate simulation work that can be of very different size depending on their current load, and that this work is then scheduled onto host threads by the Simics dynamic scheduling mechanism. The host threads are then executed on hardware cores and hyperthreads by the host operating system.

So, how well does it scale? Pretty well, actually. There are many types of workloads, targets, target operating systems, and memory systems to be explored, but for compute-bound workloads we have seen almost linear scaling (7x speedup) when simulating an 8-core Linux SMP target on a host with 8 real cores available to run Simics. In my personal experiments, I often produced 2x speed-ups on a humble Haswell Core i5 (2 cores with 2 threads each), and close to 4x on a Core i7 (4 cores with 2 threads each).

Thus, we can see that MCA provides the ability to increase simulation performance and scaling to tackle workloads that used to be bottlenecks for Simics. It is not a panacea, but a very useful additional performance technique to apply when multicore targets are on the critical simulation path. Even for mostly serial workloads like operating system boots, we have seen cases with spots of parallelism – and multicore accelerator providing speedups of 25% or so, which is still quite useful.

Playing around with MCA, you learn some interesting things about performance. Parallelism and parallel performance really isn’t all that easy to understand and especially not predict.

What was particularly striking was that the better a target runs on the basic Simics JIT, the less value the hyperthreads on the (Intel Architecture) host have.  It seems that something like this happens: when a target workload works well with JIT compiler, the JIT compiler will create code that has very few pipeline stalls and cache misses, and which keeps the processor pipeline busy. With many such threads running in parallel, there is no slack that can be exploited by the hyperthreads. Remember that in current hardware, the hyperthreads are essentially two streams of instruction sharing the same execution core, which usually works really well as most code suffer cache misses and other blocking events where the other hyperthread can get in and keep the pipeline working (thus improving performance significantly with a small additional investment in hardware resources). With Simics, this beneficial case seems to be fairly rare. The hyperthreads sometimes speed things up, but just as often they appear to be tripping over each other’s feet in their rush to get things through the pipeline – running a 4-core target using 2 host threads was faster than when using 4  host threads on a 2-core, 2-threads-per-core Core i5, for example. The paradoxical fact here is that if the fundamental JIT had been doing a worse job, the MCA scaling would have been better, as there would have been more slack for the hyperthreads to exploit – but overall performance would have been worse, of course.  Fun stuff.

From the user perspective, it is necessary to note that Multicore Accelerator needs be enabled for each processor architecture individually, since the target parallel synchronization semantics have to be mapped to host primitives. On launch, Simics 5 supports Intel, ARM, and Power architectures.

Multicore Accelerator is not deterministic, unlike the established multimachine accelerator. This is a necessary trade off to maximize performance, since a deterministic implementation suffered a noticeable performance penalty. In practice, this means that in order to use Simics reverse debugging and replay debugging, you need to run without MCA – use MCA to position the workload and get through heavy code, and then use standard Simics to analyze and debug.

Increased performance is always important in Simics, since the performance determines what  kinds of software loads can be usefully run in simulation within the time frames given by interactive usage, automatic testing, or continuous integration turn-around times.