HAL Id: tel-01208121
https://hal.archives-ouvertes.fr/tel-01208121
Submitted on 1 Oct 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Reliability of probabilistic circuits
Kaikai Liu
To cite this version:
Kaikai Liu. Reliability of probabilistic circuits. Engineering Sciences [physics]. TELECOM ParisTech, 2014. English. �tel-01208121�
2014-ENST-0065
EDITE - ED 130
Doctorat ParisTech
T H `
E S E
pour obtenir le grade de docteur d ´elivr ´e par
TELECOM ParisTech
Sp ´ecialit ´e - ´
Electronique et Communications
pr´esent´ee et soutenue publiquement par
Kaikai LIU
le 15 October 2014
Reliability of probabilistic circuits
Directeur de th `ese: Lirida NAVINER
Co-encadrement de th `ese: Jean-Franc¸ois NAVINER
Jury
Rapporteurs
M. Walter STECHELE, Professeur de Technische Universit ¨at M ¨unchen M. Ivan SARAIVA SILVA, Professeur de Universidade Federal do Piau´ı
Examinateurs
M. Habib MEHREZ, Universit ´e Pierre et Marie CURIE, LIP6 M. Jean-Marc DAVEAU, Docteur de STMicroelectronics/Crolles2
Directeurs de Th `ese
Mrs. Lirida NAVINER, Professeur de T ´el ´ecom ParisTech
M. Jean-Franc¸ois NAVINER, Maˆıtre de conf ´erences `a T ´el ´ecom ParisTech
T´el´ecom ParisTech
´ecole de l’Institut Mines T´el´ecom - membre de ParisTech
Reliability of probabilistic circuits
©
Le D´epartement Communications et Electronique (COMELEC) Institut Mines-T´el´ecom, T´el´ecom-ParisTech, LTCI-CNRS-UMR 5141 46 Rue Barrault, Paris CEDEX 13, 75634, France FRANCE
This thesis is set in Computer Modern 11pt, with the LATEX Documentation System ©Kaikai Liu 2014
Remerciements
All the people who helped me in the three-year thesis project should be acknowledged. Without their help and support, this thesis will never be accomplished.
First and foremost, Professor Lirida NAVINER introduced me to the interesting world of reliability of probabilistic circuits. She helped me a lot from the literature review, the set-up of different models, the simulation realization and even the text of publications. I really appreciated her help, enthusiasms, patience and trust to my work. She taught me a lot not only in the academic field but also the attitude to work, which I will benefit for my whole life.
I am very thankful to Professor Jean-Fran¸cois NAVINER and Professor Herv´e PETIT who helped me a lot in my thesis. During the thesis project, they helped me with the simulation set-up and some of the details in the reliability field. Without their help, the thesis could be harder for me.
And thanks to my dear colleagues who accompanied me all these years. Thank you Hao Cai, Ting An, Arwa Ben Dhia, Yi-meng Zhao, Yao Li, Yao Wu, Yunfei Wei, Wenlong Gupeng, Youlong Wu, Mengxing Li and Fengguo Zhang. Your kindnesses and support to me made these three-year of PhD more interesting. Specially, thank you Hao, Yi-meng and Wenlong, with your accompany of jogging in “Parc Montsouris”, I kept a healthy body.
Finally, my biggest gratefulness belongs to my father and my mother, Huangong Liu and Jinhua Liu in Shandong, China. Also my sincere gratefulness and love to my dear wife, Wei Wang, in Sheffield, UK. Without their love and support, I will never finish my thesis.
Abstract
The Moore’s Law has benefited us a lot since 1970s. As conceptualized by R. Dennard [1], one can obtain transistors with higher speed and lower energy consumption in reducing the critical dimensions and keeping the electrical field constant. This concept made the continuous trend of increase in integration density of CMOS (Complementary MetalOxide-Silicon) technology possible. Since 1970s, the number of components per chip has doubled every two years [2]. As mentioned in the ITRS (International Technology Roadmap for Semiconductors) 2007 [3], the operation frequency of a transistor will reach 12 GHz and the number of transistors contained in a single chip may be 12 billion in 2020.
With the dimension shrinking to nanometer, or Deep Submicron Meter (DSM), many challenges have risen, one of which is the dramatically increasing probability of soft errors induced by Single Event Upsets (SEUs) or Single Event Transients (SETs). Normally, these kinds of soft errors often happened in the memory cells and rarely occurred in the logic circuits on the ground level. However, the increased operation frequency, the lower supply voltage and the reduced noise margin have made the transient current pulses induced by alpha particles or high-energy neutrons much easier to be captured by the filp-flops at the outputs of logic circuits. Consequently, the reliability of logic circuits (including the evaluation methodologies) in nanometer regime have been an important design criterion in ICs design [4].
Another challenge accompany with the CMOS dimension scaling to nanometer is that the thermal noise is not neglectable any more. As pointed in [2], the RMS of thermal noise will increase during miniaturization. The probability of the thermal noise crossing the voltage threshold also steadily increases with the reduction of the noise margin [2]. Thus the noise-immunity of logic circuits is also an important concern in ICs designs.
Therefore, with CMOS technology scaling into DSM or nanometers, the behaviour of logic gates, the output of logic functions or even the services of systems are not determin-istic but probabildetermin-istic. The reliability of probabildetermin-istic circuits, induced by the soft errors and noise, draws much more attention in modern IC designs.
This thesis concentrates on the reliability evaluation methodology improvements and the general cost-effective noise-tolerant circuit designs.
An efficient evaluation methodology based on Probability Transfer Matrix (PTM) is proposed to obtain the accurate reliability of a combinational circuit. Compared with the traditional PTM, the proposed PTM (denoted as ECPTM) can dramatically reduce the penalty in time consumption and memory usage. A general analytic tool has been developed to get the reliability of a circuit with its netlist file. The efficiency improvement of the proposed method is verified with practical bench circuits.
This thesis also presents a general cost-effective noise-immune design structure of logic functions. This design structure is based on Markov Random Fields (MRF) and is suitable for all the basic logic gates as well as complex logic functions involving several logic gates. This structure has been verified through simulations done in SPICE using the Berkeley Predictive Technology Model (BPTM) 65nm CMOS Technology [5] and in Spectre based on ST 65nm CMOS models.
Furthermore, this thesis proposes a model to analyze the reliability issues induced by both SETs and noise. Mathematical conditions are derived in order to improve the SETs robustness and noise-immunity at the same time.
Contents
Remerciements i
Abstract iv
List of Tables ix
List of Figures x
List of Acronyms xiii
1 Introduction 1
1.1 Motivations . . . 1
1.2 Objectives . . . 2
1.3 Thesis Organization . . . 2
2 Reliability Evaluation Methodologies and Improvement Techniques 5 2.1 Preliminaries on Reliability . . . 5
2.1.1 Basic Concepts . . . 5
2.1.2 Radiation-induced Soft Errors. . . 8
2.1.3 The Influence of Thermal Noise . . . 11
2.2 Reliability Evaluation Methodologies . . . 13
2.2.1 Simulation-based Methodologies . . . 14
2.2.2 Analytical Methodologies . . . 15
2.2.3 Hybrid Methods of Reliability Analysis . . . 16
2.3 Reliability Improvement Technologies . . . 16
2.3.1 Space Redundancy . . . 17
2.3.2 Time Redundancy . . . 18
2.3.3 Other Methodologies . . . 19
2.4 Conclusions . . . 19
3 Efficient Evaluation Methodology Based on PTM 21 3.1 Probabilistic Transfer Matrix . . . 21
3.1.1 PTM Computation of Gates Connected in Series . . . 22
3.1.2 PTM Computation for Gates Connected in Parallel . . . 22
3.2 Efficient Computation of PTM (ECPTM) . . . 23
3.2.1 PTM Calculation of Complex Circuits . . . 23
3.2.2 Efficient Tensor Product Calculation . . . 24
3.3 The Efficiency of ECPTM . . . 27
3.3.1 A General Testbench Circuit . . . 27
viii CONTENTS
3.3.3 More Simulation Results of Testbench Circuits . . . 28
3.4 Conclusions . . . 30
4 CENT-MRF 33 4.1 Preliminaries and Review . . . 33
4.1.1 Preliminaries on MRF Theory . . . 33
4.1.2 Previous Noise-immune Circuit Design Structures . . . 34
4.2 Proposed General Logic Gates Design Structure. . . 37
4.3 Noise-immunity Simulation Results of Different Design Structures. . . 38
4.3.1 Quantifying the Noise-immunity . . . 38
4.3.2 Simulation Results (SPICE) . . . 39
4.3.3 Simulation Results (SPECTRE) . . . 40
4.4 Discussions and Conclusions . . . 42
5 Analysis of Noise and SETs 45 5.1 Reliability Analysis Model of CENT-MRF . . . 45
5.1.1 Modeling the Noise Analysis . . . 45
5.1.2 Modeling the Noise Analysis of CENT-MRF Structures . . . 46
5.2 Simulations With Hspice and Matlab . . . 47
5.2.1 Simulation Based on Hspice . . . 48
5.2.2 Simulation Based on Matlab . . . 48
5.2.3 Simulation Results . . . 48
5.3 Conclusions . . . 50
6 Reliability Analysis of Reed-Solomon Decoder 53 6.1 Preliminaries on RS Decoder . . . 53
6.2 Probabilistic Modeling Process . . . 54
6.2.1 Modeling Probabilistic Gates . . . 55
6.2.2 Modeling Arithmetic Operations of Galois Field . . . 55
6.3 Simulation Results . . . 55
6.3.1 Brief of Monte Carlo Method . . . 55
6.3.2 Simulations of the Probabilistic Gates . . . 56
6.3.3 Simulations of the Arithmetic Operations in Galois Field . . . 56
6.3.4 Simulations of the Decoding Modules . . . 57
6.4 Conclusions . . . 59
7 Conclusions and Perspectives 61 7.1 Conclusions . . . 61
7.2 Contributions . . . 62
7.3 Perspectives . . . 62
A Mathematical Derivations 65 .1 The mathematical derivations . . . 65
List of Tables
2.1 Contribution ratio of neutrons to alpha particles for SER for different CMOS elements [6]. 9 2.2 The trend of supply voltage and frequency according to ITRS reports 2013 version [7]. 12
2.3 Characteristics of reliability evaluation methodologies . . . 20
2.4 Characteristics of reliability enhancement techniques . . . 20
3.1 Time consumption comparisons of PTM and ECPTM . . . 28
3.2 Memory usage comparisons of PTM and ECPTM. . . 30
4.1 All the possible input-output states of a two-input NAND gate . . . 35
4.2 Transistor numbers for different bench circuits . . . 41
List of Figures
2.1 Different levels of unreliability. . . 6
2.2 Relationship of R(t), F(t) and f(t). . . 7
2.3 Critical charge of different technology generations. . . 10
2.4 SERs contributed of individual elements.. . . 11
2.5 Bit-flip rate due to thermal noise.. . . 12
2.6 The probabilistic gate model. . . 13
2.7 A generic combinational logic circuit. . . 14
2.8 General scheme of a saboteur. . . 17
2.9 A generalized TMR system with MVC.. . . 18
2.11 Process of reliability analysis . . . 20
3.1 Logic gate NAND with its PTM and ITM.. . . 21
3.2 Structure of combinational circuits. . . 23
3.3 PTM matrices, Ml k. . . 26
3.4 A general structure of benchcircuit.. . . 27
3.5 Time consumption of PTM and ECPTM. . . 28
3.6 Memory usage of PTM and ECPTM.. . . 29
3.7 Time consumption efficiency. . . 29
3.8 Memory usage efficiency. . . 30
4.1 Markov random field graph. . . 33
4.2 Lgic NAND gate and its MRF graph. . . 34
4.3 The two-input MRF NAND gate implementation.. . . 36
4.4 The MAS MRF design structure. . . 36
4.5 The two-input MAS MRF NAND gate implementation. . . 37
4.6 The general proposed CENT-MRF circuit design. . . 38
4.7 The proposed CENT-MRF circuit design for a NAND gate. . . 39
4.8 The simulation process of a general logic combinational circuit. . . 40
4.9 The simulation result of a two-input NAND gate. . . 40
4.10 The KLDs of different design structures of a two-input NAND gate. . . 41
4.11 The KLDs of different design structures of a one-bit FA. . . 41
4.12 The KLDs of different design structures of a four-bit RCA. . . 42
4.13 The KLDs of different design structures of a two-input XOR gate. . . 42
4.14 The KLDs of different design structures of (8, 4) Hamming Decoder. . . 43
5.1 The noise model. . . 45
5.2 The noise model with noisy input signal. . . 46
5.3 The noise model of CENT-MRF. . . 46
xii LIST OF FIGURES
5.5 Simulation scheme: Hspice. . . 48
5.6 Simulation scheme: Matlab. . . 49
5.7 The probability of noise being logic ”1” of a FA and NI module. . . 49
5.8 The simulated RSN of FA with P1 ≥ 12. . . 50
5.9 The simulated RSN of FA with P1 ≤ 12. . . 50
6.1 The strcture of the RS Decoder . . . 54
6.2 The Probabilistic Gate Model . . . 55
6.3 The Simulation Results of Probabilistic Gates (2 × 104 times) . . . 56
6.4 The Simulation Results of Arithmetic Operations (2 × 104 times) . . . 57
6.5 The Simulation of the decoding process . . . 58
6.6 The Simulation Reliabilities of Different Modules (2 × 104 times) . . . 58
List of Acronyms
SER soft error rate
CENT-MRF Cost-Effective Noise-Tolerant Markov Random Fields CMOS Complementary Metal Oxide Semiconductor
FIT Failure in Time IC Integrated Circuit
IEEE Institute of Electrical and Electronics Engineers
ITRS International Technology Roadmap for Semiconductors LUT Lookup Tables
MAS-MRF Master-and-Slave MRF MC Monte-Carlo
MOSFET Metal Oxide Semiconductor Field Effect Transistor MRF Markov Random Field
MTTF Mean Time to Failure
NMOS N-Channel Metal Oxide Semiconductor NMR N-tuple Modular Redundancy
PMOS P-Channel Metal Oxide Semiconductor SET Single Event Transient
SEU Single Event Upsets SNR Signal-to-Noise Ratio SoCs Systems-on-Chip
TMR Triple Module Redundancy TTF Time to Failure
W transistor Width
xiv List of Acronyms
DRAM dynamic random access memory SRAM static random access memory PTM Probability Transfer Matrix MAS master and slave
ECC error correcting codes CTMR Cascade TMR
DCVS Differential Cascode Voltage Switch ECPTM efficient computation based on PTM MeV million electron volts
SEE single event effect
IBM International Business Machine PDF probability density function BPSG borophosphosilicate glass MeV mega-electron-volt
GeV giga-electron-volt FO4 fan-out-of-4 RMS root-mean-square
SPR signal probability reliability analysis PBM probability binomial model
PGM probabilistic gate model
BDEC Boolean Difference-based Error Calculator FIFA fault-inject-fault-analysis
ADD Algebraic Decision Diagrams SPR-MP Multi Pass SPR
PBR Probabilistic Binomial Reliability
BDEC Boolean Difference-based Error Calculator FIFA fault-injection-fault-analysis
RTL register transfer level PGM Probabilistic Gate Model PMC probabilistic model checking PDF Probability Density Function BN Bayesian Network
List of Acronyms xv
RSN reliability induced by both SETs and noise RS Reed-Solomon
Chapter
1
Introduction
1.1
Motivations
Moore’s Law has been the driven force of integrated circuits (ICs) industry since 1970s. The number of components (transistors or bits) per chip has increased dramatically for decades. This continuous trend of increase in integration density of CMOS (Complemen-tary Metal-Oxide-Silicon) technology was made possible by dimension scaling technology conceptualized by R. Dennard [1]: in reducing the critical dimensions and keeping the electrical field constant, one can obtain transistors with higher speed and lower energy consumption. As mentioned in the International Technology Roadmap for Semiconduc-tors (ITRS) 2007 [3], the operation frequency of a transistor will reach 12 GHz and the number of transistors contained in a single chip may be 12 billion in 2020. The main-stream geometric scale now is 22 nanometers in 2013 and will reach its theoretical limitation in years.
With the dimension shrinking to nanometer regime, many challenges has risen, one of which is the reliability of nanometer-scale devices or circuits. The industry has benefited a lot from the trend of dimension scaling. The devices and systems attain high density, faster operation speed, lower supply voltage, etc.. However, in order to keep the energy consumption a constant or at a relatively low level, the supply voltage of the transistors should shrink linearly, which reduces the noise margin. The reduction of noise margin makes the transistors and devices more sensitive to noise perturbations and consequently makes the logic output of a single gate or a circuit no longer deterministic but probabilistic. Historically, the soft errors occurred mostly in the storage cells (e.g., flip-flops, latches and memory cells), or the devices in aerospace. The alpha particles and atmospheric neutrons existing in the earth atmosphere are able to generate a current pulse [8] which will induce a soft error if this current pulse sufficient enough in amplitude and time duration. Normally, these current pulses can not introduce soft errors for logic circuits because of the existence of masking effects: logical masking effect, electrical masking effect and latching-window masking effect [9] [10] [11] [12]. However, with the dimension scaling into nanometer domain, the drastic reduction of noise margin, the much higher operating frequency and the lower threshold voltage [13] have improved the probability of a current pulse captured by a latch and consequentially introduces a soft error in logic circuits, even on ground level [14]. As a result, the issue of reliability has become an important criterion in ICs design [15].
In order to design reliable electronic systems with unreliable transistors, we should evaluate the reliability of a circuit first. To achieve this, effective and efficient methodolo-gies should be proposed, one of which is Probability Transfer Matrix (PTM) [16]. PTM is widely utilized as a reference methodology for its accuracy evaluate result. However, it is
2 Introduction
not suitable for large circuits because of its high time and storage consumption. Thus we need to optimize PTM to improve its efficiency.
Another issue in the ICs design is the noise-immunity. For the intrinsic random nature of noise, traditional fault-tolerant design methodologies based on hardware redundancy, e.g. Triple-Modular-Redundancy (TMR) [17] and Cascade TMR (CTMR) [18], are not capable to obtain noise-immunity. The noise interferes the input signal of each module and degrades the right judgment of the majority voter. The NAND-multiplexing method-ology proposed by Von Neumann [19] can produce the reliable result using unreliable components. However, it needs an extremely high degree of redundancy [20] [21]. The re-configuration technology is more effective to deal with manufacturing defects or permanent faults and requires enormous amounts of redundancy [22]. Therefore, the traditional ap-proaches are not effective to attain noise-immunity for the random and dynamic nature of noise. Probabilistic-based technologies are more suitable to deal with this problem [23–25]. One of the promising noise-tolerant probabilistic-based designs is proposed in [26], which is based on Markov random field (MRF) [27]. According to Nepal et al. [26], the reliability or the noise-immunity of a circuit can be improved by maximizing the joint probability of valid input-output pairs with a cost of hardware redundancy. Furthermore, it was optimized in [28] in order to reduce its area penalty. In [29], a Master-and-Slave MRF (MAS-MRF) design structure was proposed by Wey et al., which can obtain nearly the same noise-immune ability as structures in [26,28] but with fewer transistors. In addition to the area over-cost, another disadvantage of approaches proposed in [26,28,29] is that they did not propose a general design structure applicable to all the basic logic gates. In other words, we should design every single logic gate specially. Also another approach based on MRF was proposed in [30], which is based on Differential Cascode Voltage Switch (DCVS). However, this methodology is just desirable for an inverter. A general noise-tolerant design should be proposed with less area penalty.
Moreover, in prior works, the reliability or the noise-immunity criterion is individu-ally concerned [17–19,26,28,29]. In other words, these methodologies only considered the influences of single event transients (SETs) or the noise-immunity. A model which takes account of the SETs and noise at the same time should be proposed and the trade-off between the reliability sole to SETs and noise-immunity should be obtained.
1.2
Objectives
The main objectives of this thesis focus on the novel reliability evaluation methodology and the enhancement techniques for probabilistic circuits.
• Optimize the existing reliability analysis methodologies (PTM) for better perfor-mances (less time and storage consumption);
• Achieve efficient and effective methods to improve the reliability of probabilistic circuits;
• Set up a model to analyse the soft errors and noise-immunity at the same time.
1.3
Thesis Organization
The thesis is organized as follows:
Chapter 2 presents the basic definitions and concepts related to the issue of reliability in combinational logic circuits: the increasing SER (soft-error rate) on the ground-level, methods to evaluate the reliability of a combinational circuit, design structures to enhance
1.3 Thesis Organization 3
reliability, etc. This chapter will focus on the PTM methodology and discuss its drawbacks. It also studies the reliable circuit design with unreliable transistors, or the noise-tolerant circuit designs with noisy transistors. It reviews the basic design concept and structure based on Markov Random Fields (MRF). Also some improved design structure will be discussed.
Chapter 3 proposes some efficient methods to evaluate the reliability of combinational circuits. These methods are mostly compared with PTM [31] [32] [33], and their efficiencies are mainly discussed in aspect of time consumption and memory usage. A general tool is designed for the reliability evaluation of a circuit with its corresponding netlist. A general circuit is designed for the efficiency verification. Some practical bench circuits are also utilized to describe the efficiency improvements of the proposed methods.
Chapter 4 proposes a general cost-effective noise-tolerant MRF (CENT-MRF) method for a better trade-off between the area penalty and noise-immunity. It is a general design structure which can be easily applied to all the basic logic gates and functional blocks. Its noise-immunity is assessed with KLD for ST 65 nm CMOS library. Simulations are done in SPICE using the Berkeley Predictive Technology Model (BPTM) 65nm CMOS Technology [5] and in Spectre based on ST 65nm CMOS models.
Chapter 5 introduces a model to analyze the reliability induced by both SETs and noise. It derives the constraints for the reliability enhancement of logic circuits for allowing design circuits with both better noise-immunity and higher tolerance to soft errors. Simulation combining Hspice and Matlab are given to verify the proposed constraints.
Chapter 6 describes the reliability evaluation of a specific application operator, the Reed Solomon (RS) decoder. It utilizes the Monte Carlo Methodology to set up a model to simulate the probabilistic gate behavior, the arithmetic operations of Galois Field GF (23) and the decoding modules of the Reed Solomon Decoder,with the gate error probability from 0 to 10−1. With the simulation results, we can obtain which module(s) is the most sensitive one(s) to the gate error probability and most effective one(s) if we want to improve the reliability of the decoder.
Chapter 7 shows conclusions of this work and presents perspectives and future research directions.
Chapter
2
Reliability Evaluation Methodologies and
Improvement Techniques
Reliability becomes an important design criterion when CMOS technology scales into Deep Submicron Meter (DSM) or even nanometers [4,8]. With higher operational frequency, lower supply voltage and steady reduction in noise margin, the transistors are more fault-prone and only able to obtain the right output with a probability. Thus how to evaluate the reliability of a logic circuit and design reliable devices with the unreliable or probabilistic components is a continuous task during these years.
This chapter firstly presents the basic preliminaries on reliability. It discusses the mech-anisms of soft errors induced by alpha particles or high-energy neutrons in logic circuits. Also the impacts of thermal noise will be presented. Then it introduces the reliability evaluation methods and improvement techniques in combinational logic circuits with their corresponding merits and drawbacks.
2.1
Preliminaries on Reliability
2.1.1 Basic Concepts
An electrical system can be characterized by several properties, e.g., cost, performance and dependability [34]. In these properties, dependability has many attributes (for instance, reliability, safety, etc.). Ucla et al. defined dependability as the ability to deliver service that can justifiably be trusted, or the ability to avoid service failures that are more frequent and more severe than is acceptable to the user(s) [34].
An electrical system suffers from multiple risks which may cause different impairments to its dependability. In [35] [36], the author classified these different impairments of de-pendability to six levels (or states). Figure2.1gives a hierarchical illustration.
• Defect/component level: impairments of deviant atomic parts. • Fault/logic level: impairments of deviant signals, paths or nodes.
• Error/information level: impairments of deviant information (data, internal states, etc.).
• Malfunction/system level: impairments of deviant functional behavior. • Degradation/service level: impairments of deviant performance. • Failure/result level: impairments of deviant outputs or actions.
6 Reliability Evaluation Methodologies and Improvement Techniques
Figure 2.1: Different levels of undependability and their transition [35] [36].
These impairments can transfer among different levels. For example, a transistor may be defective with the aging effects. It can produce a wrong logic output which will be exposed as a fault. If this fault is exercised and propagates to a latch-up successfully, an erroneous information bit is generated. This error may influence or even dominate other sub-systems and introduce a malfunction impairments with imperfect error-tolerant designs. Finally, the malfunction impairments may have a catastrophic effect on the system service and eventually lead to a system failure [35].
In general, the faults can be classified into two categories: permanent and temporary faults [37]. The permanent faults are those faults normally found during the off-line test-ing by the manufactures. However, temporary faults are the main concern after the ICs are implemented in special application environment. Two kinds of temporary faults can be further classified: transient faults and intermittent faults [38–40]. They manifest simi-larly but several criteria can be used to determine the category of faults [38]. Firstly, the transient faults are induced mainly by the environmental parameters at random locations with random time duration while the intermittent faults occur at the same location from time to time. Secondly, the replacement of the defective part can fix the errors induced by intermittent faults while can not remove the transient faults. The temporary faults do not induce errors necessarily while by contrast, the permanent faults always make errors occur.
While dependability is a concept hard to be defined quantitatively in a single metric, reliability is a property which can be expressed as “the continuity of correct service” [34], and can be measured with the probability to execute its expected function under specific environment within a given period of time [41]. Normally, reliability is denoted as R(t), or R. The given period time is often expressed as a limited time range [0, t]. Obviously, the original reliability R(0) equals to 1.
The reliability of an electrical system can be evaluated. In order to measure the R(t), another two items should be defined, i.e., the unreliability F(t) and the fault rate f(t). The unreliability is the probability that a system fails to execute its specified function under given conditions within a time interval [t0, t1]. The fault rate f(t) is the probability density function (PDF) describing at which rate the failure occurs during a time interval [t0, t1]. Normally t0= 0. The relationship of these three items is illustrated in Figure2.2.
2.1 Preliminaries on Reliability 7
Figure 2.2: Relationship of R(t), F(t) and f(t).
From the view of probability theory, we can obtain (2.1) and (2.2).
F (t) = Z t 0 f (t) dt, R(t) = 1 − F (t) = 1 − Z t 0 f (t) dt (2.1) f (t) = dR(t) dt (2.2)
The fault rate f(t) is often estimated by sampling efficient experiments data. For the reliability evaluation of electrical systems, the exponential f(t) is often utilized, as shown in (2.3).
f (t) = λe−λt, t > 0 (2.3)
Thus the reliability R(t) of a system with fault rate f(t) can be expressed as
R(t) = e−λt (2.4)
For integrated circuits, the fault rate is often characterized with failure in time (FIT), where one FIT stands for one failure in 109 hours [42]. Correspondingly, the reliability is often expressed by the mean time to failure, or MTTF. The MTTF is the expected time to failure of a system or, in other words, the mean of the time to failure (TTF). The relationship between them can be expressed with (2.5), with an assumption of the time interval T : [0, ∞]. M T T F = E(T ) = Z ∞ 0 tf (t) dt = Z ∞ 0 t dR(t) dt dt = −tR(t)|∞0 + Z ∞ 0 R(t) dt = Z ∞ 0 R(t) dt (2.5)
Note that limt→∞tR(t) = 0. The reliability of a system, especially with respect to integrated circuits, will be 0 at the end of lifetime (t → ∞) for the existing of aging effect,
8 Reliability Evaluation Methodologies and Improvement Techniques
the radiation-induced soft errors, the thermal noise, etc.. With (2.4), MTTF and the reliability can be expressed as
M T T F = Z ∞ 0 R(t) dt = 1 λ R(t) = e− 1 M T T Ft (2.6)
Two basic metrics of reliability evaluation exist according to the parameters in concern: functional reliability and signal reliability [43–45] [46]. The functional reliability concerns the probability of the continuous work without failures while on contrast, the signal reliabil-ity concerns the probabilreliabil-ity of right output of a component, e.g. a transistor, a sub-system block or a logic functional circuit.Many parameters can influence the reliability of modern integrated circuits, e.g. radiation (alpha particles, neutrons, etc.), thermal noise, aging effect and process variations. This thesis focuses on the radiation-induced soft errors and the errors induced by thermal noise, which will be discussed in details in section2.1.2 and 2.1.3.
2.1.2 Radiation-induced Soft Errors
Soft errors induced by radiation is now a threat to the modern ICs service time. The soft errors are events that the information datum are corrupted without a permanent device demage. In contrast, an error with a permanent device failure is called a hard error [47]. The soft errors can have different effects or influences on the system. On the one hand, it can corrupt the information bits of a system. On the other hand, these errors can induce a malfunction or even a device failure. Normally, the soft errors can be recovered by a reset, or diminished by a suitable fault-tolerant design.
The rate of soft errors or soft errors rate occurrence in a given environment is denoted as SER. Usually, SER is measured in FIT units. In semiconductor industry, the FIT/Mb or FIT/device is also used to express the sensitivity of an individual element. Typical SER values for electronic systems range between a few 100 and about 100,000 FIT [47].
Historically, the radiation-induced soft errors mostly occur in aerospace applications or in military environments. The first report of soft errors in space applications was published by Binder et al. in 1975 [48]. The first paper of soft errors occurring at sea level, a new physical mechanism for soft errors in DRAMs [49], was proposed by May and Woods of Intel at the International Reliability Physics Symposium (IRPS) in 1978. In this paper, the soft errors were defined as radiation-induced single-bit errors, which are random and nonrecurring [47,49], in memory cells. They were not caused by the electrical noise or electromagnetic interference but the radiation (or exactly, alpha particles). From then on, many researchers have investigated this issue for decades [47] [50] [51] [52]. In 1996, Ziegler et al. of IBM published a milestone paper [50]. In this paper, the authors analyzed the radiation sources, the SER induced by cosmic rays at different altitudes, the sensitivity and SER of different kinds of SRAM and DRAM, the SER of logic circuits, etc.. They concluded that the cosmic rays-induced neutrons with high-energy can also introduce a severe negative impact on the memories.
Since 1970s, different mechanisms of radiation-induced soft errors have been researched. With technology dimension scaling and fabricating process or materials usage improve-ment, the dominant source of radiation changes. For applications at territorial level, radiation-induced soft errors are mainly the result of the following two sources [47] [51]:
• Alpha particles emitted by radioactive impurities (uranium and thorium) in package materials [49]. This was proved to be the dominant source of soft errors in DRAM [53] in late 1970s.
2.1 Preliminaries on Reliability 9
Table 2.1: Contribution ratio of neutrons to alpha particles for SER for different CMOS elements [6].
Technology SRAM LATCH Combinational circuits
130 nm 1.4 3.9
-90 nm 1.1 5.6 > 10
65 nm 1.0 1.7 1.1
• High-energy neutrons (more than 1 mega-electron-volt (MeV)) generated by the high-energy cosmic rays (more than 2 GeV) interacting with the earths atmosphere [50] [47]. It was proved to be the main source of soft errors in DRAMs in mid-1990s [54]. In 1978, May and Woods of Intel discussed the soft error induced by alpha particles in DRAMs [49]. When alpha particles (emitted by the impurity in package materials) interact with silicon, electron-hole pairs are generated and collected by the depletion layers such that they can end up in the storage wells. If the number of accumulated electron-hole pairs finally exceeded the critical charge Qcirt (defined as the number of electrons that differentiates 0 and 1 in DRAMs), a soft error occurs. This kind of soft error is mostly caused by the impurities of the package materials, and the SER varies vastly in different application systems for a specific technology generation.
The high-energy neutrons generated by cosmic rays interacting with the atmosphere can strike a sensitive region, deposit electron-hole pairs which can pass through the p-n jup-nctiop-n. The combip-natiop-n of these electrop-n-hole pairs cap-n ip-nduce a short duratiop-n of current pulse in the internal node struck by the particle. If the sensitive region locates in a SRAM and the current pulse accumulates and may be sufficient enough to inverse the value stored in a memory cell, a soft error occurs. The minimum charge needed to flip the stored value is called the critical charge (Qcirt) of SRAM cells. Many researches have pointed out that the SER of a constant area SRAM will increase with the CMOS technology scaling [55–57].
The contributions of alpha particles and the neutrons change with different technology generations. The trend is that the alpha particle plays a more important role with CMOS dimension scaling from 130nm to 65nm, which is clearly pointed out in Table2.1[6].
A model to estimate the SER in SRAM CMOS has been proposed recently by Hazucha & Svensson [58]. The model can be expressed as
SER ∝ F × Adif f × exp −QQcrit s (2.7) where
F is the neutron flux with energy > 1 MeV, in particles/(cm2·s) Adiff is the area of sensitive region, in cm2
Qcrit is the critical charge of the SRAM CMOS, in f C Qs is the charge collect efficiency of the SRAM, in f C
(2.8)
The neutron flux F can be assumed to be constant in the SER estimation of different technologies. The sensitive area Adiff to neutrons decreases with the CMOS dimension scaling. The critical charge Qcrit depends on the characteristics of circuits, especially the supply voltage Vdd and the capacitance of the drain node. Qs measures the efficiency of current charge collection of the node and decreases with the decreasing feature size. The ratio of Qcritand Qsplays an important role to SER due to the exponential impact. While both of them decrease with the CMOS geometric scaling, the critical charge approaches the efficiency charge collection with moving-forward technology generations. Shivakumar et al. tested the Qcrit of SRAMs, latches, logic circuits of different technology generations and
10 Reliability Evaluation Methodologies and Improvement Techniques
showed this approaching trend clearly in Figure2.3[10]. Hazucha et al. utilized this model to estimate SERs of different technologies for a constant area SRAM array and pointed out that the SER-per-chip will increase linearly with the decreasing feature size [10,58].
Figure 2.3: Critical charge of different technology generations (figure from [10]).
While the mechanisms of alpha particles and neutrons to generate soft errors are dif-ferent, both of them induce soft errors by producing a transient current pulse sufficient enough [8] [59]. If this current pulse happens in a storage cell (e.g., a flip-flop, a latch or a memory cell) and is sufficiently strong, the logic state contained in the cell will be reversed, resulting in a SEU. If the transient current pulse happens in a logic circuits, it transforms to a transient voltage pulse (SET). If this SET is sufficient enough in amplitude and duration, it will be captured at a logic latch or flip-flop and produces a soft error.
Compared with SEU in memory cells, the SETs have different mechanisms to induce soft errors in combinational logic circuits. First, a SET is generated manifesting itself as a voltage glitch or a current disturbance. Then the SET should have the ability to propagate through sensitive paths and finally to be captured as a soft error at the latches or flip-flops. Three masking effects are proposed to reduce the probability of a SET captured [10] [9] [11]:
• Logical masking occurs when a SET at a node whose influence will be eliminated by its corresponding gate or other subsequent gates;
• Electrical masking occurs when a SET does not have sufficient amplitude or time duration so that this SET will diminish with the propagation to the output of the circuit;
• Latching-window masking occurs when a SET propagates to the latch cell or filp-flop but fails to be captured by the synchronous clock window.
With these masking effects, the SER in combinational logic circuits remains at a sig-nificantly low rate and even can be neglected [60]. However, with the feature size scaling down to DSM or even nanometer, the transistors become more susceptible to disturbances of radiation-induced current charge. The electrical masking effect is diminished because
2.1 Preliminaries on Reliability 11
the lower supply voltage and threshold voltage loosen the strict requirements on the am-plitude and duration of SET. Also, the higher clock frequency increases the probability of a SET captured by the memory cells and reduces the effect of latch-windowing masking.
With the shrinking dimension technology, these masking effects pose different impacts on the SER of logic circuits. In [10], the authors pointed out that the electrical masking has a significant effect on the SER of logic circuits of all technology generations and do not diminish with the scaling of feature size. On the contrary, the latching-window masking effect will be reduced with the dimension scaling and the speed increasing.
A critical path must exist from the SET location to a latch, or the SET will be logically masked by the logic circuits. What’s more, in order to propagate in logic circuits, the SET must have sufficient amplitude and duration. A model to simulate the transient fault propagation in combinational circuits was proposed in [61] [62], which also addressed the expressions of transient voltage pulse amplitude and duration sufficient enough to propagate.
Many authors have pointed out that the SER in combinational circuits increase linearly with the operational frequency [63] [47] [64]. The increase of clock frequency increases the probability the SET captured and reduces the restriction on time duration of the voltage glitch or current disturbance.
Mitra et al. [63] calculated the different SERs of individual elements (memory cells, combinational logic and sequential logic circuits) in the modern applications with high clock frequency (Figure2.4). Later, Gill et al. [64] compared the chip-level SER of combina-tional and sequential logic circuit at 32nm and pointed out that the SER of combinacombina-tional logic is not a dominant contributor in chip-level 32nm technology.
Figure 2.4: SERs contributed of individual elements (redrawn according to [63]).
2.1.3 The Influence of Thermal Noise
Even if soft errors induced by the radiation is a critical problem, a more serious issue is the increasing probability of thermal noise crossing the threshold voltage Vth. When the thermal noise cross Vth, a bit-flip occurs and may result in a soft error.
The thermal noise is a stationary Gaussian stochastic process with zero as its mean value. In integrated circuits, a transistor can be described as the RC model and according to Johnson–Nyquist formula, the root-mean-square (RMS) of the thermal noise voltage can be given as in (2.9), where the k is the Boltzmann constant, T is the temperature and C is the capacitance of gate. [2]
Un= r
kT
C (2.9)
From (2.9), we can imply that the RMS of thermal noise voltage Un will increase steadily during miniaturization. One of the reasons is that the increased integrated density
12 Reliability Evaluation Methodologies and Improvement Techniques
Table 2.2: The trend of supply voltage and frequency according to ITRS reports 2013 version [7].
Character Implementation Year
2013 2014 2015 2016 2017 2018 2019 2020
Supply Voltage (V) 0.86 0.85 0.83 0.81 0.80 0.78 0.77 0.75
On-chip Frequency (Ghz) 5.50 5.72 5.95 6.19 6.44 6.69 6.96 7.24
and power consumption produce more heat which can increase the local temperature T . The other reason is that, with the CMOS dimension scaling, the capacitance of a transistor decreases.
Meanwhile, the Rice formula [65,66] points out the mean frequency v of a Gaussian white noise crossing a threshold voltage Vth in the integrated circuits as in (2.10), where Vth is the threshold of the transistor, Un is the thermal noise voltage, and fc is assumed to be equal to the clock frequency. [2]
v = √2 3exp " −12× VthU n 2# × fc (2.10)
With (2.9) and (2.10), it can be pointed out that the increasing thermal noise voltage Un and the decreasing thermal voltage in the CMOS dimension shrinking process will make the bit-flip rate v increase rapidly. In [2], the author measured the bit-flip rate v due to thermal noise with different number of transistors under different clock frequencies. The results shown in Figure2.5points out that the ratio of Vth
Un has an experimental influence on the bit-flip rate.
8 8.5 9 9.5 10 10.5 11 11.5 12 10-15 10-10 10-5 100 105 1010 1015 2Ghz, 1 transistor 2Ghz, 108 transistors 2Ghz, 109transistors 2Ghz, 1010 transistors 20Ghz, 1010 transistors
Figure 2.5: Bit-flip rate due to thermal noise (figure from [2]).
Table 2.2 shows the main characteristics in CMOS technology as predicted in 2013 edition of ITRS report. In this section, we mainly focus on the supply voltage and the on-chip local frequency. With the scaling in feature size, the reduced supply voltage and the increasing thermal noise approach to each other and make the bit-flip rate of transistors nonnegligible.
2.2 Reliability Evaluation Methodologies 13
2.2
Reliability Evaluation Methodologies
As mentioned above, the reliability has been an important concern in ICs designs. The transistors, gates or even circuits are no longer deterministic but probabilistic with the impacts of radiation, thermal noise or cross-talk noise, etc.. Denote the error probability of a gate as ǫ, ǫ ∈ [0, 0.5]. A gate with error probability ǫ ∈ (0.5, 1] is not realistic because it means this gate is more prone to faults. In this case, an addictive NOT gate at the output node can make it more reliable.
Take the inverter as an example. Assume that the logic function of an ideal inverter is Y (t2) = X(t1), where t1is the starting time point of the CMOS switching and t2represents the ending time point of the CMOS switching. X(t1) is the input logic value at the time point t1 and Y (t2) is the output logic value at time point t2. For an ideal inverter, if X(t1) = 0 we can surely get Y (t2) = 1, and vice versa.
However, for a probabilistic inverter, if X(t1) = 0, Y (t2) has a probability to obtain a logic value of 0. And the probability is the error probability ǫ we defined before. The function of a probabilistic inverter can be expressed as (2.11)
Y (t2) =
X(t1) with probability 1 − ǫ
X(t1) with probability ǫ (2.11)
Generally, for an ideal logic gate, denote its logical function as Y = f (x1, x2, . . . , xm), where xi(i = 1, 2, . . . , m) are the m-inputs of the gate and Y is its ideal output. However, for a probabilistic gate, its behavior can be given as
Y =
f (x1, x2, . . . , xm) with probability ǫ f (x1, x2, . . . , xm) with probability 1 − ǫ
(2.12)
where ǫ is the error probability of the probabilistic logic gate.
Figure 2.6presents a model to address the behavior of a probabilistic logic gate. The probabilistic gate has the same input I as an ideal gate, but the probabilistic gate may change output to its opposite (0 → 1, or 1 → 0) with a probability q = 1 − p, where p is the probability of fault occurrence. In Figure 2.6, IdealGate produces the right 0 or 1. The Rand can produce a pseudorandom number m drawn from the standard uniform distribution on the open interval (0, 1). If m > p, meaning that an error happens at this gate, the block Com will give out a 1 as one of the input of the exclusive or gate. The output O will be the inverse of the output of IdealGate. Otherwise the Com will produce a 0 and the output O will remain the same with the output of IdealGate.
0 or 1
IdealGate
I
O
Com
0 or 1
mProbabilistic Logic Gate
Rand
Figure 2.6: The probabilistic gate model.
Based on the illustrations of probabilistic gate, the definition of probabilistic circuit can be given as [67]:
14 Reliability Evaluation Methodologies and Improvement Techniques
Definition 1 Consider a combinational circuit or component C that computes the map-ping F : I → O. C is called probabilistic circuit if each i ∈ I is mapped to each j ∈ O with some probabilities P (j|i), and PjP (j|i) = 1.
Figure 2.7 describes a probabilistic circuit C. Denote the input signals as xi, i = 1, 2, . . . , M , the output signals as yj, j = 1, 2, . . . , N and the circuit as C. Xk, k = 0, 1, 2, . . . , 2M − 1 represent the input vectors with each of the input signals at the logic state 0 or 1. Yt, t = 0, 1, 2, . . . , 2N− 1 represent the output vectors with each of the output signals at the logic state 0 or 1. The signal reliability can be addressed as follows:
Rj = 2M−1 X k=0 P ((yj = correct), Xk) = 2M−1 X k=0 P ((yj = correct)|Xk)P (Xk) (2.13)
where Rj is the signal reliability of output yj, j ∈ {1, 2, . . . , N}. Thus the circuit reliability RC can be given as
RC = N Y j=1
Rj (2.14)
Figure 2.7: A generic combinational logic circuit.
Many methods have been proposed to estimate the reliability of logic circuits. Basically, these methods can be divided into three groups: simulation-based methods, analytical methods and the hybrid of them. The simulation-based methods require quite a large amount of input vectors and the samples of the output vectors, which costs a lot of time and memory usage. Thus the simulation-based methods are not quite suitable for the realistic circuits, especially the larger ones. The analytical methods utilize mathematical models to represent the probabilistic behavior of probabilistic gates and transfer the reliability of circuits to a mathematical calculation. These methods are distinguished with each other by several characters: accuracy, scalability and penalty (time consumption, memory usage, etc.). Normally the accurate methods need more memory usage with larger time consumption. Thus a trade-off between the accuracy and the penalty is an important concern. The hybrid methods combine the simulation and the mathematical analysis to achieve a balance of the merits and drawbacks.
2.2.1 Simulation-based Methodologies
The simulation-based methodologies [68–71] are actually statistical methods based on a Monte-Carlo (MC) framework. A large amount of pseudo-random vectors are generated as the inputs to the simulator of circuits. The simulator of circuits are often injected with faults or a model which can simulate the probabilistic behavior of the circuit. The output vectors of the simulator of the circuits are then sampled and reliability of the circuit can
2.2 Reliability Evaluation Methodologies 15
be maintained. The complexity of these methods is dependent on the scale of the circuits, especially the number of input signals. Thus these methods are not realistic for large circuits.
2.2.2 Analytical Methodologies
Probabilistic Transfer Matrices (PTM) is a gate-level approach for accurately assess the reliability of a combinational circuit [16,72] [73,74]. Each conditional probability of every output vector given a particular input vector can be obtained accurately. Reliability calcu-lation based on PTM involves matrices multiplications. The main drawbacks of PTM are the excessive amount of memory and computational complexity required to manipulate matrices. For a circuit with N -inputs and M -outputs, these grow exponentially with the number of inputs and outputs as O(2M+N), making PTM unpractical for the analysis of large circuits. Some methods have been proposed to decrease the time consumption and memory usage. In [75,76], the authors proposed an approach to compress the memory space usage based on Algebraic Decision Diagrams (ADD). But it is effective only for square matrices, then dealing with non-square matrices requires zero-padding [75,77]. An-other improvement method is described in [78], which can effectively reduce the memory usage and time consumption by eliminating some useless but expensive inter-data.
Compared with PTM, Signal Probability Reliability Analysis (SPR) [79,80] is not based on the conditional probability of a output vector given a particular input vector but on the signal probability, which is defined as the probability of the value of the signal equals to 1. For a fault-prone logic signal, four states exist: signal=correct 0, signal=correct 1, signal=incorrect 0 and signal=incorrect 1. SPR calculates the reliability by computing the signal probability from the input to the output. In this process, the signal probabil-ity can be modified by the probabilistic logic gates and at the output, the accumulative probability of correct 0 and correct 1 is the signal reliability. The advantage of SPR is that its complexity is linear, which can reduce the time consumption and memory usage dramatically with comparison to PTM. The main drawback of this model is the signal correlations, which invalidates the straightforward computation of joint probabilities [80]. Thus it just attains an approximate result when signal correlations or fanout reconver-gences exist in the circuits. Many methods has been proposed to deal with the signal correlations or reconvergent fanouts, for example, WAA [81] and DWAA [82].
Multi Pass SPR (SPR-MP) [80] is an effective method proposed to solve the problem of signal correlations by calculating the partial reliabilities (each of them is obtained at a particular single signal state) and accumulating them to attain the circuit reliability. An accurate estimate of the circuit reliability can be achieved with SPRA-MP. The main drawback of this method is that the time consumption increases exponentially with the number of fanouts. If the number of the fanouts is F , the complexity of SPR-MP is approximately O(4F).
Another analytical method to overcome the impact of signal correlations is the Con-ditioned Probability Matrix (CPM) [83]. It decorrelates the correlate signals with condi-tional probabilities and accelerates the estimation process with direct SPR approach. Its complexity depends not on the size of the circuit but on the reconvergent sources, which are defined as the point at which signals correlates. Thus CPM can also deal with the scaliability problem with combinational circuits explosion.
Han et al. presented a Probabilistic Gate Model (PGM) in [84]. This model relates the probability of output node to the probabilities of input signals and the error probability of the logic gate. Two computational algorithms are proposed with different merits. The approximate algorithm obtains an approximate evaluation of the circuit reliability without considering the signal correlations, and increases linearly with the number of gates. The accurate algorithm, with taking account of the signal dependency, can attain an accurate
16 Reliability Evaluation Methodologies and Improvement Techniques
assessment of the reliability of a circuit but has an exponential complexity in the worst case.
Sellers et al. introduced the concept of analysing errors with boolean difference first in [85], including the concept of how to use boolean difference in error-detect and error-correct logic circuits design. Mohyuddin et al. extended its application to reliability analysis in [86]. The authors presented a gate-level probabilistic error propagation model Boolean Difference-based Error Calculator (BDEC) which takes the boolean function of the gate, the signal error probabilities of the gate inputs and the gate error probability as its input parameters and produces the error probabilities of the outputs. It can achieve reliability estimation with high accuracy and scalability with a liner complexity in the number of gates of the circuits.
The methodologies mentioned previously implement the Boolean logic to estimate the reliability of circuits. The probability of a signal or a node with value 0 or 1 is discrete. However, with the dimension scaling, especially with the reduction in noise margin, the thermal noise makes the discrete probability distribution unreasonable. A probabilistic-based methodology for nanoscale architecture design was proposed by Bahar et al. in [23]. The authors discussed a novel nanoscale architecture based on Markov Random Fields (MRF) [27]. Later, Lu et el. extended the concept of MRF-based design architecture and proposed a probabilistic logic to replace the Boolean logic for nanoscale devices in [87]. According to MRF, the conditional probability can be expressed in terms of a function contributed by its neighborhood. With the physical statistical physics [88], the Probability Density Function (PDF) of a node can be expressed by the energy level contributed by its neighboring nodes. Thus the probability of the output nodes can be achieved with the integration theory. Rejimon et al. proposed a probabilistic methodology based on Bayesian Network (BN) in [89]. The authors proposed a probabilistic error model based on BN and estimated the overall error probability of the output by comparing the fault-prone output to the ideal output. It was proved to be a compact and minimal framework to estimate the reliability of circuits.
2.2.3 Hybrid Methods of Reliability Analysis
Different from the models based on PTM and signal probability, Probabilistic Binomial Re-liability(PBR) [72,90] is based on the logic masking coefficients, which determines whether a fault of a gate or simultaneous faults of multiple gates can be propagated to the output. It poses the logical masking ability of the circuit to error(s) at different gate(s). The most important issue of this model is to find out these logical masking coefficients, which are usually obtained using simulators.
Naviner et al. proposed a fault-injection-fault-analysis-based (FIFA) tool for reliabil-ity assessment at register transfer level (RTL) in [91]. The authors proposed a model to simulate the fault-free and the fault-prone gates. Figure2.8describes the hardware imple-mentation. If ej = 0, the node is supposed to be fault-free. If ej = 1, the node is supposed to be fault-prone with a specific kind of faults defined by m[m1 : m0]. The specific fault can be SET or multiple transient faults, stuck-at-0 or stuck-at-1. The output of fault-free gates and fault-prone gates are compared to calculate the logic masking coefficients, with which the reliability can be attained with PBR.
2.3
Reliability Improvement Technologies
In order to mitigate the impacts of soft errors, many approaches of reliability enhance-ment have been proposed, such as Triple-Modular-Redundancy (TMR) [17], Cascade TMR (CTMR) [18], NAND-multiplexing methodology [19]. These methodologies can effectively
2.3 Reliability Improvement Technologies 17
Figure 2.8: General scheme of a saboteur.
improve the reliability at the cost of hardware redundancy. However, they can not deal with the impact of noise.
On the other hand, many noise-immune design approaches have been proposed. Nepal et al. proposed a structure based on Markov Random Fields (MRF), in which the noise-immunity can be improved by maximizing the joint probability of valid states at the cost of hardware redundancy [26]. Later, this solution was optimized in order to reduce the number of transistors [28,29].
2.3.1 Space Redundancy
In 1956, John von Neumann introduced the idea of utilizing the redundancy adding of unreliable components to realize a reliable system [19]. After that, N-tuple modular re-dundancy (NMR) [92] was developed and widely used to improve the reliability of a gen-eralized system. Especially, when N = 3, NMR is characterized as TMR (Triple Modular Redundancy).
The degree of redundancy of a system can be denoted as D. In order to mask E soft errors, D should satisfy the following expression [93]:
2E + 1 ≦ D ≦ (E + 1)2. (2.15)
For a classic TMR, a voter should exist in order to determine the final output of the redundant blocks. Several voting strategies for fault-tolerant systems have been proposed such as voting mechanism of the majority, the median or plurality [94], of which the majority voting mechanism is more popular.
Take the TMR system of Figure 2.9 as an example. As mentioned above, TMR can only mask one soft error. The number of combinations generated by selecting m elements out of n is given by Nc = n m = n! (n − m)!m! (2.16)
Thus the number of combinations of selecting 2 elements from 3 is Nc = 32= (3−2)!2!3! = 3. Assume that the reliability of a single block is q. Then with a fault-free majority voting circuit (MVC), the reliability of the TMR system is given by [19]:
Rc = q3+ 3(q2)(1 − q) = 3q2− 2q3. (2.17) If we assume that the reliability of the MVC is RM, the reliability of the TMR-MVC system is given by
18 Reliability Evaluation Methodologies and Improvement Techniques
B
B
B
C
M
V
I
O
Figure 2.9: A generalized TMR system with MVC.
Although TMR is an effective approach to improve the reliability of a block, it still has some drawbacks. The area overhead of TMR is 3 × Ablock, if we just take into account the block area Ablock. Many other efficient approaches have been proposed such as selec-tive TMR (STMR) [95], TMR based on significance [96], progressive TMR(PTMR) [97] and progressive mixed modular redundancy (PMMR) [97]. The basic idea behind these approaches is to find out the critical or unreliable sub-blocks and improve the reliability of these sub-blocks to reduce the area overhead, given a reliability constraint. With the benefit of area overhead reduction, the different approaches mentioned have different dis-advantages. The STMR just takes account of the gate property to judge the critical gates but ignores the circuit logical masking effect. The work reported in [96] proposes a more accurate approach by considering the logical masking effect of single gate. The example shown in [96] addressed that with the same area overhead, the TMR based on significance has a larger reliability improvement than STMR. The much more efficient approach is the PTMR or PMMR [97], whose objective is to minimize the area overhead with maximizing the reliability improvement.
2.3.2 Time Redundancy
A hardware-based error detection system was proposed in [98]. The time redundancy can just detect the an error happening but cannot correct it. For a transient error, it can only last for a time delay of δ. So if we compare the outputs of a circuit at the time of t and t + δ, we can detect whether there is an error or not. Figure2.10shows the structure of a time redundancy system.
I
Output
Clk
Comb
Circuit
Comparator
Latch
Latch
Error Indication
Clk
1 2Figure 2.10: A generalized structure of time redundancy error-detection system.
In Figure2.10, two latches or flip-flops used to store the outputs at different times of CLK1 and CLK2, where the CLK1 stands for the clock of of the functional latch and CLK2 = CLK1 + δ. As a matter of fact, this structure can detect the transient soft errors with duration time equal to or lower than δ. It also can detect some transient soft errors which can influence the output results with a duration time larger than δ, but
2.4 Conclusions 19
not all of them. On the other hand, not all the soft errors induced by SETs initiated at the internal gates of a circuit can be guaranteed to be detected even these SETs have duration lower than δ, especially when there are reconvergent paths in the circuit. In the propagation of the SET, the duration time may increase and when it reaches the circuit outputs, the duration may be larger than δ. The work reported in [99] also discussed the time redundancy and analyzed the scheme with different δ. Other designs combining the time and space redundancy have already been designed [100].
2.3.3 Other Methodologies
The CMOS dimension scaling technology has dramatically improved the performances of transistors and devices in the past decades. In order to keep the power dissipation constant or at a low degree, the supply voltage should scale linearly with the size [2]. In this situation, the reduction of noise margin makes the transistors and device working in a noisy signal environment. As a result, the transistors in nanometers are much more prone to soft errors, and thus the noise-immune ability of a logic gate or a circuit becomes an important design criterion [4].
For the intrinsic random nature of noise, traditional fault-tolerant design methodologies based on hardware redundancy, e.g. Triple-Modular-Redundancy (TMR) [17] and Cascade TMR (CTMR) [18], are not capable to obtain noise-immunity. The noise interferes the input signal of each module and degrades the right judgment of the majority voter. The NAND-multiplexing methodology proposed by Von Neumann [19] can produce the reliable result using unreliable components. However, it needs an extremely high degree of redun-dancy [20]. The reconfiguration technology is more effective to deal with manufacturing defects or permanent faults and requires enormous amounts of redundancy [22].
Therefore, the traditional approaches are not effective to attain noise-immunity for the random and dynamic nature of noise. Probabilistic-based technologies are more suitable to deal with this problem [23–25]. One of the promising noise-tolerant probabilistic-based designs is proposed in [26], which is based on Markov random field (MRF) [27]. Accord-ing to Nepal et al., the reliability or the noise-immunity of a circuit can be improved by maximizing the joint probability of valid input-output pairs with a cost of hardware redun-dancy. Furthermore, it was optimized in [28] in order to reduce its area penalty. In [29], a Master-and-Slave MRF (MAS MRF) design structure was proposed by Wey et al., which can obtain nearly the same noise-immune ability as structures in [26,28] but with fewer transistors. In addition to the area over-cost, another disadvantage of approaches pro-posed in [26,28,29] is that they did not propose a general design structure applicable to all the basic logic gates. In other words, we should design every single logic gate specially. Also another approach based on MRF was proposed in [30], which is based on Differen-tial Cascode Voltage Switch (DCVS). However, this methodology is just desirable for an inverter.
2.4
Conclusions
In this section, we reviewed the preliminaries on reliability, the methodologies of reliability evaluation and the reliability improvement techniques. Figure 2.11 illustrates the inner logic for the reliability analysis.
For the evaluation methodologies of reliability, Table 2.3 presents the characteristics of them with their merits and bottlenecks, correspondingly.
For the reliability enhancement techniques, Table 2.4illustrates the characteristics of them with their advantages and drawbacks, correspondingly.
20 Reliability Evaluation Methodologies and Improvement Techniques
Figure 2.11: The process of reliability analysis.
Table 2.3: Characteristics of reliability evaluation methodologies
Category Method Accuracy Speed Memory consumption Scalability Simulation-based methods Based on Monte Carlo [68–71] adaptive slow high no
Analytical methods
PTM [16,72] exact slow high no
SPR [79] low fast low yes
SPR-MP [80] exact slow high no
CPM [83] exact adaptive high no
PGM (approximate [84]) low fast low yes
PGM (accurate) [84] exact adaptive high no
BDEC [86] high fast low yes
BN [89] exact medium high no
Hybrid methods PBR [90,91] exact slow low no
Table 2.4: Characteristics of reliability enhancement techniques
Technique Concept Characteristics Drawbacks
TMR [92–94], CTMR [18] replicate the unreliable components and select the right result with a voter
easy to implement large penalty
NAND-multiplexing [19,20] a bulk of NAND gates with their outputs as the inputs of next bulk of NAND gates randomly
achieve reliable system with unre-liable components
extremely high re-dundancy STMR [95], PTMR [97], PMMR [97] selective the critical components and
im-plement TMR to them
reduce the area penalty
hard to choose the critical com-ponents MRF [26,28], MAS MRF [29] improve the reliability or the
noise-immunity of a circuit by maximizing the joint probability of valid input-output pairs with a cost of hardware redundancy
probabilistic-based technique suitable for noise-tolerant designs
large area penalty with specific structures for different logic gates
DCVS [30] generate the complementary signal with differential cascode voltage switch
small area penalty
only implemented to inverter
Chapter
3
Efficient Evaluation Methodology Based on
PTM
Probabilistic transfer matrix (PTM) is a gate-level approach for accurately assess the reliability of a combinational circuit [16,72]. The main drawbacks of PTM are the excessive amount of memory and computational complexity required to manipulate matrices. For a circuit with N -inputs and M -outputs, these grow exponentially with the number of inputs and outputs as O(2M+N), then making PTM unpractical for the analysis of large circuits. In this chapter, an efficient reliability evaluation methodology based on PTM (denoted as ECPTM) will be presented. Its algorithm and complexity will be discussed mathemat-ically. In order to express the penalty (time consumption or memory usage) comparison between traditional PTM and ECPTM, a general test bechcircuit is designed with a con-trollable size. At last, more simulations are implemented to specific benchcircuits and conclusions will be obtained.
3.1
Probabilistic Transfer Matrix
Assume a logic circuit with input x and output y, where x ∈ {x0, x1, · · · , xi, · · · , x2m−1} and y ∈ {y0, y1, · · · , yj, · · · , y2n−1}. The corresponding PTM, denoted M, has 2m× 2n elements and each (i, j) element is the probability of getting output y = yj given the occurrence of input x = xi, noted p(j|i). In the case of an ideal (i.e. fault free) circuit, the PTM contains only zeros and ones and is named ITM (Ideal Transfer Matrix).
Figure 3.1 shows the PTM for an NAND logic gate with error probability equals to 1 − q, where q represents the probability of getting a correct output. It should be noted that ITM is defined for q = 1.
a1 a0 b (a1, a0) 0 1 0 1 00 01 10 11 p q p q p q q p 0 1 0 1 0 1 1 0
Figure 3.1: Logic gate NAND with its PTM and ITM.
The reliability of a circuit is directly extracted from the respective P T M and IT M , according to (3.1), where p(i) denotes the probability that input x is xi.
R = X
IT M(i,j)=1