Investigating Uncore Error Resilience using a Million Simics Runs: An Interview with Hyungmin Cho

Investigating Uncore Error Resilience using a Million Simics Runs: An Interview with Hyungmin Cho

jakob-engblom-intro-picture

One great thing about working with Simics is learning about new and incredible ways that the tool has been used. The interview in this blog post provides such an example – Hyungmin Cho, a PhD student at Stanford, is using Simics to systematically study error resilience in hardware and software, and has so far performed more than one million Simics runs. That is an amazing number. The Simics setup is also very interesting, involving an RTL simulator being used to simulate hardware at the gate level, as well as checkpointing to get to the interesting part of a run, and extensive automation. Amazing stuff, and I hope you will enjoy learning about it just as much as I did.

Jakob Engblom: Please introduce yourself!

hc-Hyungmin

Hyungmin Cho: I am Hyungmin Cho, a Ph.D. student at Stanford University. I’m a member of Robust Systems Group, and my adviser is Professor Subhasish Mitra.

JE: What is the topic of your current research?

HC: We are studying various hardware reliability issues, including soft errors, voltage instability, and circuit aging problem. In this work, we focused on soft errors on hardware logic components. We first characterize application-level effects of those soft errors and provide efficient solutions to protect the system from soft errors.  To provide cost-effective solutions, we explore cross-layer solutions that combine error protection techniques over multiple layers of system design (e.g., circuit, architecture, and application-level).

Actually, there exist large amount of study for processor core errors. However, little attention has been paid to so-called ‘uncore’ components, such as logic components of the memory subsystem or I/O controllers. In modern system-on-chips (SoCs), such uncore components account for large area footprints and power consumptions comparable to that of processor cores. Using the simulation capability of Simics simulator, we study soft errors on uncore components in large-scale SoCs.

JE: Could not agree more on the uncore part. That is clearly the direction we are going, with an ever-increasing aggregation of functionality  onto a single chip. Anyway; do you have any interesting results so far?

HC: We have preliminary results on memory subsystem errors. We found that soft errors on uncore components need special attention because 1) they do have significant effects on system-level reliability and 2) existing error protection techniques are not adequate for uncore components (e.g., fail to meet safety requirements). We are also working on expanding our research focus to cover other types of uncore components, especially I/O components.

JE: Just to be clear, a soft error is an error that changes the state of the system once, but not permanently?

HC: Yes, the main cause of the error is radiations, such as alpha articles or cosmic rays. For large-scale data arrays, such as cache SRAM or DRAM, efficient error correction coding (ECC) techniques are widely used to address soft errors. For logic components, however, it is difficult to provide low-cost error protection mechanisms. Therefore, it is important to correctly understand the application-level effects of errors to explore cross-layer error resilience techniques.

While the main focus is soft errors, we also investigate other types of hardware errors as well. Using the platform, we also model various failure modes such as voltage droop and delay faults, even permanent failures.

JE: How do you use Simics in your research?

HC:  One of the difficulties for studying uncore components is accurately modeling a large-scale system while maintaining the accuracy of the simulation. Uncore components, such as shared L2 cache controller or crossbar interconnect, are connected to multiple processor cores and other uncore components, and an erroneous behavior of an uncore component may affect multiple other components in the system. Therefore, we need to model a full scale system to correctly observe the effects of uncore component errors. We simulate the OpenSPARC T2 SoC design as an example, which supports 64 hardware threads in their 8 processor cores along with a variety of uncore components. To model such a large-scale system with real-world applications, fast simulation capability is required. However, existing simulation techniques for large-scale systems are not capable of accurately modeling low-level errors, such as bit-flips in hardware components. On the other hand, RTL simulation is extremely slow to model a full-scale SoC using real-world applications. For example, RTL simulation of the OpenSPARC T2 design can only simulate 100 cycles per second.

Our solution is connecting Simics to a low-level RTL simulator. The rest of the system is simulated on Simics, while the error injection target is simulated in an RTL simulator. Any access to the target uncore component is captured and redirected to the RTL simulator, and the response from the uncore component is sent back to Simics. After error injection modeling is finished, Simics disconnects slow RTL simulation and continues to complete the application execution to observe the application-level effects of the injected errors.

hc-uncore

JE: So essentially, you run the simulation inside of RTL until all effects are propagated to memory contents, and then switch back to fast simulation?

HC:  Yes, we created Simics modules to perform fast high-level simulations of the uncore components. These high-level models are functionally identical to the actual RTL modules, but they do not model low-level details for error simulation. However, once we can map the effects of propagated errors onto the simulated high-level states, slow RTL simulation is no longer needed and we can continue with fast simulation.

JE: What effect can an injected error have on the application executing?

HC: Depending on the domain of the application, the effects of error can have different meanings. As a general rule, we classify the application-level outcomes into five categories.

  1. Vanished: The injected error does not affect the final output of the application.
  2. Output Mismatch: Due to the injected error, application produces an incorrect output, but no explicit indication of error is reported to the user. This type of error is also known as Silent Data Corruption (SDC). Since the user does not aware that the output is corrupted, this could be the most critical type of error that needs attention. For example, IBM targets less than 1 undetected error per 1,000 years per system.
  3. Output Not Affected: The injected error corrupts some architectural states of the system, but the final output of the application is error-free. However, this type of error has potential to create unwanted side effects.
  4. Unexpected Termination: The application fails with an explicit error indication, such as application crash, OS kernel panic, or application exit with an error code.
  5. Hang: The application does not generate an output even after the specified timeout limit.

This classification is an example, and soft errors can affect the reliability of the system in various ways. For example, an Output Not Affected type of error may alter the user experience of interactive applications even though the final output is okay.

JE: I got the impression you are doing a very large number of runs with Simics?

HC: Yes, to obtain statistically significant results, we need a large number of error injection samples. Basically, we cannot simulate all possible error scenarios (error injection target x error injection time), which well over 10^14 cases. Instead, we randomly sample a large number of error injection scenarios that can provide us a reasonable confidence in the results. For each error injection target component, we sample at least 40,000 random error injection runs per each application in a benchmark suite. In total, we have generated more than a million error injection runs so far.

JE: One million experimental runs… that’s a lot of runs. How long is each run, typically?

HC: An error injection run with a PARSEC benchmark application takes more than 24 hours if we run from the beginning to the completion. It takes longer on our setup since we simulate uncore models on Simics as well. To reduce the overhead, we use many strategies. For example, since we only need to model how the injected error affects the application, we do not need to simulate the rest of the application once we observe the injected error gets vanished without any error propagation. Also, we do not have to simulate the execution from the beginning of the application every time. Using one of the Simics checkpoints that we create during a one-time error free execution, we can skip the simulation efforts until the actual error injection time for each run. So, an error injection run takes only a few minutes if the error has limited effects, but it may take several hours to completely model the application effects for some error injection instances. On average, an error injection run takes about 10 minutes.

JE: I get that you generate a series of checkpoints during a pilot run, and then start from one of these checkpoints for your experiments? That is a nice technique! How many checkpoints do you lay down for each benchmark?

HC: We create a checkpoint every million cycles, and a benchmark application has about a thousand checkpoints on average. There is a trade-off between the storage requirement and overhead for each individual error injection run. We figured out that going below one million cycles only gives a small benefit at each runtime since other overhead sources start to dominate, such as starting and connecting to the RTL simulator.

JE: How do you start these runs? I assume that you have not done a million runs by hand?

HC: We use BASH and Python scripts to instantiate multiple error injection runs in parallel, as well as to process the output logs.

JE: How many experiments can be running in parallel?

HC: Thanks to the Simics university program, we were able to get enough licenses for the error injection runs. Currently, we are running 150 instances in parallel on our servers. The current bottleneck is the number of licenses for the RTL simulator.

JE: Where could people go if they want to read more about your results?

HC: We don’t have a published paper on this project, but the following are some relevant links

  1. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6560694 – This paper is one of our prior work that discusses error injection techniques on processor cores (RTL simulation only).
  2. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6651907– This paper is another example from our group that uses Simics to run applications on large-scale CMPs.

JE: Thank you, that was very interesting!