HAL Id: hal-02927455
https://hal.archives-ouvertes.fr/hal-02927455
Submitted on 29 Jan 2021
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Fast Cross-Layer Vulnerability Analysis of Complex Hardware Designs
Joseph Paturel, Angeliki Kritikakou, Olivier Sentieys
To cite this version:
Joseph Paturel, Angeliki Kritikakou, Olivier Sentieys. Fast Cross-Layer Vulnerability Analysis of
Complex Hardware Designs. ISVLSI 2020 - IEEE Computer Society Annual Symposium on VLSI,
Jul 2020, Limassol, Cyprus. pp.328-333, �10.1109/ISVLSI49217.2020.00067�. �hal-02927455�
Fast Cross-Layer Vulnerability Analysis of Complex Hardware Designs
Joseph Paturel
Univ Rennes, Inria, CNRS, IRISA
Angeliki Kritikakou Univ Rennes, Inria, CNRS, IRISA
Olivier Sentieys
Univ Rennes, Inria, CNRS, IRISA
Abstract—Simulation-based fault injection is commonly used to estimate system vulnerability. Existing approaches either partially model the fault masking capabilities of the system under study, losing accuracy, or require prohibitive estimation times. This work proposes a vulnerability analysis approach that combines gate-level fault injection with microarchitecture- level Cycle-Accurate and Bit-Accurate simulation, achieving low estimation times. Single and multi-bit faults both in sequential and combinational logic are considered and fault masking is mod- eled at gate-level, microarchitecture-level and application-level, maintaining accuracy. Our case-study is a RISC-V processor.
Obtained results show a more than 8% reduction in masked errors, increasing more than 55% system failures compared to standard a fault injection approach.
Index Terms—Fault injection, vulnerability analysis, RISC-V I. I NTRODUCTION
Over the past few years, the Soft Error Rate (SER) of hardware designs has significantly increased due to reduced technology node sizes and lower supply voltages [1]. Circuits designed using modern deep-submicron technologies have become more and more sensitive to environmental sources [2], such as ionization, radiation, and high-energy electromagnetic interferences, leading to temporary reliability violations, called soft-errors. These sources can affect a memory cell of a design, corrupting its contents (flip the bit it contains). This type of fault is known as Single-Event-Upset (SEU). As technology node sizes have significantly reduced, several memory cells can be corrupted at the same time, leading to Multi-Bit- Upsets (MBUs) [3]. Moreover, when a combinational cell is affected, a logic transient, known as Single-Event-Transient (SET), is generated. The SET is propagated in the forward cone of the impacted cell, and it can eventually reach one or several flip-flops and corrupt their values. Until recently, soft errors occurring in memory circuits have been considered as the major contributors to the SER due to their large sizes and high-density, making them statistically more likely to be affected. However, soft errors occurring within logic circuits have become more and more important and combinational logic cannot be considered negligible anymore [4], [5].
As a result, the vulnerability of a design towards faults has to be evaluated as part of the development process. To do so, techniques that force the appearance of faults (fault injection) in different structures of the design are often employed. To obtain an accurate vulnerability analysis, the masking effects occurring at each layer of the design must be taken into account: from the logical and timing masking occurring at the
gate-level, the microarchitectural masking introduced by the way computational blocks are arranged, up to the application masking resulting from the application tolerance to errors. To achieve that, fault injection has to be performed at a low layer, such as the gate-level. However, when complex designs are considered, prohibitive estimation times are required. As shown by our experimental results (Table VI-Section III), a gate-level fault injection on a RISC-V processor running an 4x4 matrix multiplication requires more than four months CPU time for a statistically representative vulnerability analysis.
To speed up the process, some approaches ignore the im- plementation details of the design. For instance, fault injection at the application level [6] is only able to alter the data held in the variables of the application under study. However, by ignoring the underlying executing hardware, this leads to less accurate vulnerability analysis [7]. The majority of approaches considering the hardware of the design focus only on single-bit faults occurring in the sequential logic [8]–[11]. For example, in the cross-layer vulnerability analysis flow of [11], the appearance and propagation of faults are modelled as SEUs, occurring in selected design regions. The flow uses an RTL simulator for a predefined number of cycles. Then, the state of the design is handled to a functional simulator for faster analysis. However, these approaches discard the presence of faults in combinational logic and the possibility that several bits of a design can be affected at the same time. In [12], several flip-flops of the design are simultaneously flipped through an almost exhaustive fault injection in the design registers, which requires high estimation time. Applying this approach to estimate the vulnerability of only the register file of the RISC-V processor for the 4x4 matrix multiplication ap- plication requires 4 hours 18 minutes. The technique presented in [13] considers faults occurring both in the sequential and the combinational logic. Authors use an Instruction-Set Simulator (ISS) to execute the application. To inject a fault, the ISS invokes a gate-level simulator, that injects the fault in a cell of the design. The corruptions, induced by the fault, are obtained and given back to the ISS. As a gate-level simulation is not used for each application cycle, the estimation time is reduced.
However, ISSs do not consider the hardware implementing the instruction set and are unaware of the design registers. There- fore, they cannot model the microarchitecture-level masking of the design, reducing accuracy. Table I summarises the aforementioned representative works.
This work addresses the above limitations by proposing
Reference Masking type Fault model Gate-level µArch Application SEU MBU
[6] - -
[8]–[11] - -
[12] -
[13] -
Proposed
TABLE I: Comparison of representative state-of-the-art a fast cross-layer vulnerability analysis method, applicable to entire complex hardware designs, providing statistically representative vulnerability analysis. It considers faults both in the sequential and combinational logic, as well as masking at the gate-level, microarchitecture-level and application-level.
More precisely, the contributions of this work are:
• A vulnerability analysis method combining two fault in- jection campaigns at different hardware levels, i.e, gate- level and microarchitectural-level. The gate-level campaign is applied once per design. It statistically analyses re- alistic faults both at the combinational and sequential logic, providing a higher level of accuracy. It produces error patterns that depict the impact of gate-level mask- ing. The microarchitectural-level campaign is applied per application under study. It is based on a fast Cycle- Accurate and Bit-Accurate (CABA) simulator, able to eval- uate microarchitectural-level and application-level masking.
The injection is driven by the gate-level error patterns.
• The population of SEUs and MBUs occurring due to re- alistic SETs on the combinational and sequential logic is analyzed by tuning the design operating frequency and SET pulse width. Results highlight the notable impact of MBUs on vulnerability analysis.
• A vulnerability analysis of a RISC-V processor is provided.
The obtained results show, on average, a 8.32% reduction in masked faults, which leads to an increase in the number of crashes (57.81%), hangs (24.76%), and data corruptions (5.78%) compared to standard SEU-based approaches.
The paper is organized as follows. Section II presents the proposed vulnerability analysis method. Section III analyzes the experimental analysis results on a RISC-V processor.
Finally, a discussion on the accuracy of this work as well as a conclusion and future works are presented in Section V.
II. C ROSS - LAYER V ULNERABILITY A NALYSIS
To obtain a fast and accurate vulnerability estimation, we propose a cross-layer method based on two fault injection campaigns, at different hardware abstraction levels of the Design Under Test (DUT). Figure 1 depicts the overview of our method. The first campaign is performed at the gate level to statistically obtains error patterns independently of the application to be characterized. The error patterns describe the number of occurrences and the geometry of errors, due to SETs in the combinational logic and SEUs directly on the flip- flops of the DUT, considering gate-level masking. To evaluate the impact of these patterns on an application, considering microarchitectural-level and application-level masking, the
Fig. 1: Overview of proposed cross-layer vulnerability analysis second campaign is performed at the microarchitecture level, using a fast CABA simulator of the DUT microarchitecture.
A. Single-Cycle Gate-Level Analysis
The aim of the gate-level analysis is to create statistical models of the SEU and MBU occurring due to soft-errors, in both combinational logic and flip-flops of the DUT. To do so, the netlist of the DUT is enhanced with an injection block (built with a XOR gate and a multiplexer), attached to the output of each netlist cell, allowing the modification of the cell’s outputs. These injection blocks allow for the appearance of an SET anywhere and at any time in the netlist. To not affect the timing characteristics of the DUT, the propagation delays of all the injection blocks are set to zero.
Once the netlist modified, it is used by the fault injection campaign described in Algorithm 1. Since the emergence of SEUs and MBUs, propagated from SETs through the combi- national logic, depends on the values of the DUT inputs, we explore N input different sets of values as inputs. For instance, assuming that the DUT is the execution pipeline stage of a processor, these sets of values consist of instructions randomly selected from the processor’s Instruction Set Architecture with random operands. For each new set of values (L.2), the DUT is executed without any fault (L.3) in order to obtain the golden reference (L.4), which will be used to detect errors.
Then, N gl faults are forced into the netlist for this input set (L.5). The number of faults N to be injected is defined based on the required confidence level of the statistical analysis as
Algorithm 1: Gate-level fault injection campaign
L1
for inputData in inputDataSets[0...N
input-1] do
L2
newInputSet();
L3
executeClockCycle();
L4
goldenReference = collectOutput();
L5
for ninj in [0...N
gl-1] do
L6
targetCell = selectNextCell();
L7
offset = random(0, Tclk);
L8
duration = getWidthFromDistribution(targetCell);
L9
wait(offset);
L10
flipCellOutput(targetCell);
L11
if isCellCombinational(targetCell) == true then
L12
wait(duration);
L13
flipCellOutput(targetCell);
L14
else
// The target cell is a flipflop, the error is latched
L15
waitForClockRisingEdge();
L16
error = compareDUTOutput();
L17
log(error);
Confidence 95% (t=1.96) 99% (t=2.5758) 99.8% (t=3.0902)
e = 5% 385 664 955
e = 1% 9604 16587 23874
e = 0.1% 960400 1658687 2387335
TABLE II: Number of faults required to be injected.
N = t
2×p×(1−p) e
2(1), where t is the critical value related to the statistical confidence interval, e the error margin, and p the percentage of the possible fault population individuals that are assumed to lead to errors. p is usually set to 0.5 to maximize the sample size N. Eq. (1) is derived from [14], where the population size is set to +∞ [15]. Table II presents the evolution of the number of faults to inject in the DUT in order to meet certain statistical characteristics. Hence, at the gate level, the fault injection campaign performs N = N input ×N gl injections. During a fault injection cycle, the netlist cell (L.6), the time offset (L.7), and the duration of the fault to be injected (L.8), are selected. The cell that will be subject to injection is chosen function of its area – the bigger the cell, the higher its probability to be selected. The SET start time is selected randomly inside one clock cycle. The duration can be fixed, random, or provided by tools, that characterize the particles’
impact over the netlist cells, e.g., MUSCA [16]. The campaign waits until the SET start is reached (L.9) and the injection block of the selected cell is activated, flipping its output (L.10).
If the selected cell is combinational (L.11), an SET has been injected and has to remain active for the selected duration (L.12). After this duration has passed, it is deactivated (L.13).
Otherwise, the injected fault is an SEU affecting directly a flip- flop cell (L.14). At the end of the cycle (L.15), the contents of the flip-flops provide the output, which is compared (through a XOR operation) with the golden reference (L.16). The errors and the number of times they have been observed are logged (L.17). With this information, we obtain the error patterns and their frequencies. They will be used at the microarchitecture- level injection campaign to perform an accurate vulnerability analysis. If the affected cell has several flip-flops in its forward cone, the SET propagation can lead to MBUs.
The gate-level analysis models the propagation of SETs during a single cycle. Hence, complex designs, that behave on a multi-cycle basis (such as pipeline circuits), have to be split into sections. Each section is a part of the design, where the output of combinational logic is computed and stored to flip-flops in a single cycle. To estimate the vulnerability of the complete design, the impact of the gate-level error patterns needs to be simulated on a platform, that is aware of the microarchitecture and able to accurately execute the applica- tion. To achieve that, we perform fault injection campaign at microarchitecture level, as described in next section.
B. Microarchitecture-Level Analysis
This analysis estimates the vulnerability of the com- plete DUT, taking into account microarchitecture-level and application-level masking. To achieve that, we extend a CABA simulator in order to inject the error patterns, collected at the gate level per DUT section, at the microarchitecture level.
The microarchitectural fault injection campaign is summa- rized in Algorithm 2. Initially, the application under study is executed on the DUT without faults in order to collect a set of golden references (L.1): i) application output, ii) system state (memory and registers), and iii) number of execution cycles.
The campaign is executed N ml times per application (L.2), each time injecting a fault in the DUT. In order to obtain statistically correct vulnerability estimations, the number of faults required to be injected is computed using Eq. (1).
During the application execution, the cycle to inject the fault is chosen randomly between the first cycle and the total number of cycles considering the fault-free execution (L.3).
Then, the type of logic, where the fault is injected, is selected based on the area of the logic (L.4). For instance, if 40% of the DUT area is combinational logic, then it has a 40% chance of being chosen, and thus a 60% chance for the sequential logic. If the selected logic is sequential (L.5), an SEU is injected in a random flip-flop of the complete DUT (L.6), similar to common fault injection campaigns [8]–[10]. If the selected logic is combinational (L.7), the injected fault is a gate-level error pattern of a DUT section. The selection of the DUT section is area driven (L.8) – the larger the DUT section, the more chances it has to be selected. Then, an error pattern is selected as a function of its frequency – the more often a pattern has been observed during gate-level analysis, the higher the chances of being selected. The selected error pattern is injected (L.9) and the simulation is resumed. When simulation ends (L.10), the outcome belongs into one of the following error classes (L.11):
• Masked: The application has executed without any error or mismatch compared to the golden reference.
• Hang: The application has entered an infinite loop. A cycle counter is used to stop the current injection simulation, if the counted cycles exceed a threshold.
• Crash: The execution of the application has terminated unexpectedly and an exception has been raised (out of bound memory access, misaligned PC, hardware trap, etc.)
• Application Output Mismatch (AOM): Only the output of the application differs from the golden reference.
• Internal State Mismatch (ISM): Only the system state (mem- ory and registers) differs from the golden reference.
• ISM & AOM: Both the application output and the system
Algorithm 2: Microarchitecture-level fault injection
L1
output, state, cycles = collectGoldenReference();
L2
for fault in [0...N
ml-1] do
L3
waitUntilInjectionCycle();
L4
logicType = selectLogicTypeWithArea();
L5
if logicType == sequential then
L6
injectSEUInRandomSeqStructure();
L7
else
L8
combSection = selectCombSectionWithArea();
L9
injectAnErrorPattern(combSection);
L10
waitForSimulationEnd(timeout);
L11