A backward/forward recovery approach for the preconditioned conjugate gradient method

(1)

HAL Id: hal-01354682

https://hal.inria.fr/hal-01354682

Submitted on 23 Aug 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are

pub-L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non,

A backward/forward recovery approach for the

preconditioned conjugate gradient method

Massimiliano Fasi, Julien Langou, Yves Robert, Bora Uçar

To cite this version:

Massimiliano Fasi, Julien Langou, Yves Robert, Bora Uçar. A backward/forward recovery approach for the preconditioned conjugate gradient method. Journal of computational science, Elsevier, 2016, 17 (3), pp.522-534. �10.1016/j.jocs.2016.04.008�. �hal-01354682�

(2)

A Backward/Forward Recovery Approach for the

Preconditioned Conjugate Gradient Method

I

Massimiliano Fasia, Julien Langoub, Yves Robertc, Bora U¸card,∗

a_{The University of Manchester, UK} b_{University Colorado Denver, USA}

c_{ENS Lyon, France & University Tennessee Knoxville, USA}

d_{LIP, UMR5668 (CNRS - ENS Lyon - UCBL - Universit´}_{e de Lyon - INRIA), Lyon, France}

Abstract

Several recent papers have introduced a periodic verification mechanism to de-tect silent errors in iterative solvers. Chen [PPoPP’13, pp. 167–176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every d iterations, and to checkpoint every c × d iterations. When a silent error is detected by the verification mechanism, one can rollback to and re-execute from the last checkpoint. In this paper, we also propose to combine checkpointing and verification, but we use algorithm-based fault toler-ance (ABFT) rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance model to compute the performance of all schemes, and we instantiate it using the preconditioned conjugate gradient algorithm. Finally, we validate our new approach through a set of simulations.

Keywords: Fault-tolerance, Silent errors, Algorithm-based fault tolerance, Checkpointing, Sparse matrix-vector multiplication,, Preconditioned conjugate gradient method.

2010 MSC: 68M15, 94C12, 65F08

1. Introduction

Silent errors (or silent data corruptions) have become a significant concern in HPC environments [2]. There are many sources of silent errors, from bit

I_{Preliminary version appeared in 16th Workshop on Parallel and Distributed Scientific and}

Engineering Computing 2015 (PDSEC 2015) [1].

∗_{Corresponding author}

Email addresses: massimiliano.fasi@postgrad.manchester.ac.uk (Massimiliano Fasi), julien.langou@ucdenver.edu (Julien Langou), yves.robert@inria.fr (Yves Robert), bora.ucar@ens-lyon.fr (Bora U¸car)

(3)

flips in cache caused by cosmic radiations, to wrong results produced by the arithmetic logic unit. The latter source becomes relevant when the computation is performed in the low voltage mode to reduce the energy consumption in large-scale computations. But the low levels of voltage dramatically reduces the dependability of the system.

The key problem with silent errors is the detection latency: when a silent error strikes, the corrupted data is not identified immediately, but instead only when some numerical anomaly is detected in the application behavior. It is clear that this detection can occur with an arbitrary delay. As a consequence, the de facto standard method for resilience, namely checkpointing and recovery, cannot be used directly. Indeed, the method of checkpointing and recovery applies to fail-stop errors (e.g., hardware crashes): such errors are detected immediately, and one can safely recover from the last saved snapshot of the application state. On the contrary, because of the detection latency induced by silent errors, it is often impossible to know when the error struck, and hence to determine which checkpoint (if any) is valid to safely restore the application state. Even if an unlimited number of checkpoints could be kept in memory, there would remain the problem of identifying a valid one.

In the absence of a resilience method, the only known remedy to silent errors is to re-execute the application from scratch as soon as a silent error is detected. On large-scale systems, the silent error rate grows linearly with the number of components, and several silent errors are expected to strike during the execution of a typical large-scale HPC application [3, 4, 5, 6]. The cost of re-executing the application one or more times becomes prohibitive, and other approaches need to be considered.

Several recent papers have proposed to introduce a verification mechanism to be applied periodically in order to detect silent errors. These papers mostly target iterative methods to solve sparse linear systems, which are natural can-didates to periodic detection. If we apply the verification mechanism every, say, d iterations, then we have the opportunity to detect the error earlier, namely at most d − 1 iterations after the actual faulty iteration, thereby stopping the progress of a flawed execution much earlier than without detection. However, the cost of the verification may be non-negligible in front of the cost of one iteration of the application, hence the need to trade off for an adequate value of d. Verification can consist in testing the orthogonality of two vectors (cheap) or recomputing the residual (cost of a sparse matrix-vector product, more expen-sive). We survey several verification mechanisms in Section 2. Note that in all these approaches a selective reliability model is enforced, where the parts of the application that are not protected are assumed to execute in a reliable mode.

While verification mechanisms speed up the detection of silent errors, they cannot provide correction, and thus they cannot avoid the re-execution of the ap-plication from scratch. A solution is to combine checkpointing with verification. If we apply the verification mechanism every d iterations, we can checkpoint every c × d iterations, thereby limiting the amount of re-execution considerably. A checkpoint is always valid because it is being preceded by a verification. If an error occurs, it will be detected by one of the c verifications performed before

(4)

the next checkpoint. This is exactly the approach proposed by Chen [7] for a variety of methods based on Krylov subspaces, including the widely used con-jugate Gradient (CG) algorithm. Chen [7] gives an equation for the overhead incurred by checkpointing and verification, and determines the best values of c and d by finding a numerical solution of the equation. In fact, computing the optimal verification and checkpoint intervals is a hard problem. In the case of pure periodic checkpointing, closed-form approximations of the optimal period have been given by Young [8] and Daly [9]. However, when combining check-pointing and verification, the complexity of the problem grows. To the best of our knowledge, there is no known closed-form formula, although a dynamic programming algorithm to compute the optimal repartition of checkpoints and verifications is available [10].

For linear algebra kernels, another widely used technique for silent error detection is algorithm-based fault tolerance (ABFT). The pioneering paper of Huang and Abraham [11] describes an algorithm capable of detecting and cor-recting a single silent error striking a dense matrix-matrix multiplication by means of row and column checksums. ABFT protection has been success-fully applied to dense LU [12], LU with partial pivoting [13], Cholesky [14] and QR [15] factorizations, and more recently to sparse kernels like SpMxV (matrix-vector product) and triangular solve [16]. The overhead induced by ABFT is usually small, which makes it a good candidate for error detection at each iteration of the CG algorithm.

The beauty of ABFT is that it can correct errors in addition to detecting them. This comes at the price of an increased overhead, because several check-sums are needed to detect and correct, while a single checksum is enough when just detection is required. Still, being able to correct a silent error on the fly allows for forward recovery. No rollback, recovery nor re-execution are needed when a single silent error is detected at some iteration, because ABFT can cor-rect it, and the execution can be safely resumed from that very same iteration. Only when two or more silent errors strike within an iteration we do need to rollback to the last checkpoint. In many practical situations, it is reasonable to expect no more than one error per iteration, which means that most roll-back operations can be avoided. In turn, this leads to less frequent checkpoints, and hence less overhead.

The major contributions of this paper are an ABFT framework to detect multiple errors striking the computation and a performance model that allows to compare methods that combine verification and checkpointing. The verifi-cation mechanism is capable of error detection, or of both error detection and correction. The model tries to determine the optimal intervals for verification and checkpointing, given the cost of an iteration, the overhead associated to verification, checkpoint and recovery, and the rate of silent errors. Our abstract model provides the optimal answer to this question, as a function of the cost of all application and resilience parameters.

We instantiate the model using a CG kernel, preconditioned with a sparse approximate inverse [17], and compare the performance of two ABFT-based ver-ification mechanisms. We call the first scheme, capable of error detection only,

(5)

ABFT-Detection and the second scheme, which enhances the first one by pro-viding single error correction as well, ABFT-Correction. Through numerical simulations, we compare the performance of both schemes with Online-Detec-tion, the approach of Chen [7] (which we extend to recover from memory errors by checkpointing the sparse matrix in addition to the current iteration vectors). These simulations show that ABFT-Correction outperforms both Online-Detection and ABFT-Online-Detection for a wide range of fault rates, thereby demonstrating that combining checkpointing with ABFT correcting techniques is more efficient than pure checkpointing for most practical situations.

Our discussion focuses on the sequential execution of iterative methods. Yet, all our techniques extend to parallel implementation based on the message pass-ing paradigm (with uspass-ing, e.g., MPI). In an implementation of SpMxV in such a setting, the processing elements (or processors) hold a part of the matrix and the input vector, and hold a part of the output vector at the end. A recent exposition of different algorithms can be found elsewhere [18]. Typically, the processors perform scalar multiply operations on the local matrix and the in-put vector elements, when all required vector elements have been received from other processors. The implementations of the MPI standard guarantees cor-rect message delivery, i.e., checksums are incorporated into the message so as to prevent transmission errors (the receives can be done in-place and hence are protected). However, the receiver will obviously get corrupted data if the sender sends corrupted data. Silent errors can indeed strike at a given processor during local scalar multiply operations. Performing error detection and correction lo-cally implies global error detection and correction for the SpMxV. Note that, in this case, the local matrix elements can form a matrix which cannot be assumed to be square in general (for some iterative solvers they can be). Furthermore, the mean time between failures (MTBF) reduces linearly with the number of processors. This is well-known for memoryless distributions of fault inter-arrival times and remains true for arbitrary continuous distributions of finite mean [19]. Therefore, resilient local matrix vector multiplies are required for resiliency in a parallel setting.

The rest of the paper is organized as follows. Section 2 provides an overview of related work. Section 3 provides background on ABFT techniques for the PCG algorithm, and presents both the ABFT-Detection and ABFT-Correc-tion approaches. Section 4 discusses the proposed ABFT method for the SpMxV. Section 5 is devoted to the abstract performance model. Section 6 reports numerical simulations comparing the performance of ABFT-Detec-tion, ABFT-Correction and Online-Detection. Finally, we outline main conclusions and directions for future work in Section 7.

2. Related work

We classify related work along the following topics: silent errors in general, verification mechanisms for iterative methods, and ABFT techniques.

(6)

2.1. Silent errors

Considerable efforts have been directed at error-checking to reveal silent errors. Error detection is usually very costly. Hardware mechanisms, such as ECC memory, can detect and even correct a fraction of errors, but in practice they are complemented with software techniques. The simplest technique is triple modular redundancy and voting [20], which induces a costly verification. For high-performance scientific applications, process replication (each process is equipped with a replica, and messages are quadruplicated) is proposed in the RedMPI library [21]. Elliot et al. [22] combine partial redundancy and checkpointing, and confirm the benefit of dual and triple redundancy. The drawback is that twice the number of processing resources is required (for dual redundancy). A comprehensive list of general-purpose techniques and references is provided by Lu et al. [23].

Application-specific information can be very useful to enable ad-hoc solu-tions, which dramatically decrease the cost of detection. Many techniques have been advocated. They include memory scrubbing [24] and ABFT techniques (see below).

As already stated, most papers assume on a selective reliability setting [25, 26, 27, 28]. It essentially means that one can choose to perform any operation in reliable or unreliable mode, assuming the former to be error-free but en-ergy consuming and the latter to be error-prone but preferable from an enen-ergy consumption point of view.

2.2. Iterative methods

Iterative methods offer a wide range of ad-hoc approaches. For instance, instead of duplicating the computation, Benson et al. [29] suggest coupling a higher-order with a lower-order scheme for PDEs. Their method only detects an error but does not correct it. Self-stabilizing corrections after error detec-tion in the CG method are investigated by Sao and Vuduc [28]. Heroux and Hoemmen [30] design a fault-tolerant GMRES capable of converging despite silent errors. Bronevetsky and de Supinski [31] provide a comparative study of detection costs for iterative methods.

As already mentioned, a nice instantiation of the checkpoint and verification mechanism that we study in this paper is provided by Chen [7], who deals with sparse iterative solvers. For PCG, the verification amounts to checking the orthogonality of two vectors and to recomputing and checking the residual (see Section 3 for further details).

As already mentioned, our abstract performance model is agnostic of the underlying error-detection technique and takes the cost of verification as an input parameter to the model.

2.3. ABFT

The very first idea of algorithm-based fault tolerance for linear algebra ker-nels is given by Huang and Abraham [11]. They describe an algorithm capa-ble of detecting and correcting a single silent error striking a matrix-matrix

(7)

multiplication by means of row and column checksums. This germinal idea is then elaborated by Anfinson and Luk [32], who propose a method to detect and correct up to two errors in a matrix representation using just four column checksums. Despite its theoretical merit, the idea presented in their paper is actually applicable only to relatively small matrices, and is hence out of our scope. Bosilca et al. [33] and Du et al. [12] present two relatively recent survey. The problem of algorithm-based fault-tolerance for sparse matrices is inves-tigated by Shantharam et al. [16], who suggest a way to detect a single error in an SpMxV at the cost of a few additional dot products. Sloan et al. [34] suggest that this approach can be relaxed using randomization schemes, and propose several checksumming techniques for sparse matrices. These techniques are less effective than the previous ones, not being able to protect the computation from faults striking the memory, but provide an interesting theoretical insight.

A preliminary version of this work appeared [1]. In the preliminary version, we dealt with CG only, without preconditioning. In practice, CG is most often used with preconditioning to accelerate the convergence. In the current work, we focus on preconditioned CG (PCG for short). We provide new results and experiments, together with several extensions on single error correction and multiple error detection.

3. CG-ABFT

We streamline our discussion on the CG method, however, the techniques that we describe are applicable to any iterative solver that use sparse matrix vector multiplies and vector operations. This list includes many of the non-stationary iterative solvers such as CGNE, BiCG, BiCGstab where sparse ma-trix transpose vector multiply operations also take place. In particular, we consider a PCG variant where the application of the preconditioner reduces to the computation of two SpMxV with triangular matrices [17], which are a sparse factorization of an approximate inverse of the coefficient matrix. In fact, the model discussed in this paper can be profitably employed for any sparse inverse preconditioner that can be applied by means of one or more SpMxV.

We first provide a background on the CG method and overview both Chen’s stability tests [7] and our ABFT protection schemes.

The code for the variant of the PCG method we use is shown in Algorithm 1. The main loop features three sparse matrix-vector multiply, two inner products (for p|_iq and kri+1k

2

), and three vector operations of the form axpy. Two of the sparse matrix-vector multiplications are carried out at line 10 where we compute zi+1 ← M|Mri+1. Here the matrix M is the upper triangular factor

of the approximate inverse M|M, where M|M well approximates A−1 [17]. Chen’s stability tests [7] amount to checking the orthogonality of vectors pi+1 and q, at the price of computing

p|_i+1q

kpi+1kkqik, and to checking the residual

at the price of an additional SpMxV operation Axi− b. The dominant cost of

(8)

Algorithm 1 The PCG algorithm. Require: A, M ∈ Rn×n, b, x0 ∈ Rn, ε ∈ R Ensure: x ∈ Rn : kAx − bk ≤ ε 1: r0 ← b − Ax0; 2: z0 ← M|Mr0; 3: p0 ← z0; 4: i ← 0; 5: while krik > ε (kAk · kr0k + kbk) do 6: q ← Api; 7: αi ← krik2/p|_iq; 8: xi+1 ← xi + α pi; 9: ri+1 ← ri− α q; 10: zi+1 ← M|Mri+1; 11: β ← kri+1k2/ krik2; 12: pi+1 ← zi+1 + β pi; 13: i ← i + 1; 14: end whilereturn xi;

Our only modification to Chen’s original approach is that we also save the sparse matrix A in addition to the current iteration vectors. This is needed when a silent error is detected: if this error comes for a corruption in data memory, we need to recover with a valid copy of the data matrix A. This holds for the three methods under study, Online-Detection, ABFT-Detection and ABFT-Correction, which have exactly the same checkpoint cost.

We now give an overview of our own protection and verification mechanisms. We use ABFT techniques to protect the SpMxV, its computations (hence the vector q), the matrix A, and the input vector pi. Since ABFT protection for

vector operations is as costly as repeated computation, we use triple modular redundancy (TMR) for them for simplicity.

Although theoretically possible, constructing an ABFT mechanism to detect up to k errors is practically not feasible for k > 2. The same mechanism can be used to correct up to bk/2c. Therefore, we focus on detecting up to two errors and correcting the error if there was only one. That is, we detect up to two errors in the computation q ← Api (two entries in q are faulty), or in pi, or

in the sparse representation of the matrix A. With TMR, we assume that the errors in the computation are not overly frequent so that two out of three are correct (we assume errors do not strike the vector data here). Our fault-tolerant PCG versions thus have the following ingredients: ABFT to detect up to two errors in the SpMxV and correct the error, if there was only one; TMR for vector operations; and checkpoint and roll-back in case errors are not correctable.

We assume the selective reliability model in which all checksums and check-sum related operations are non-faulty, also the tests for the orthogonality checks are non-faulty.

(9)

4. ABFT-SpMxV

Here, we discuss the proposed ABFT method for the SpMxV (combining ABFT with checkpointing is described later in Section 5.2). The proposed ABFT mechanisms are described for detecting single errors (Section 4.1), mul-tiple errors (Section 4.2), and correcting a single error (Section 4.3).

4.1. Single error detection

The overhead of the standard single error correcting ABFT technique is too high for the sparse matrix-vector product case. Shantharam et al. [16] propose a cheaper ABFT-SpMxV algorithm that guarantees the detection of a single error striking either the computation or the memory representation of the two input operands (matrix and vector). Because their results depend on the sparse storage format adopted, throughout the paper we will assume that sparse matrices are stored in the compressed storage format by rows (CSR), that is by means of three distinct arrays, namely Colid ∈ Nnnz(A), Val ∈ Rnnz(A) and Rowidx ∈ Nn+1; these arrays correspond to the arrays JA, AA, and IA in the more standard notation [35, Sec. 3.4]). Here nnz(A) is the number of non-zero entries in A.

Shantharam et al. can protect y ← Ax, where A ∈ Rn×n and x, y ∈ Rn. To perform error detection, they rely on a column checksum vector c defined by

cj = n

X

i=0

ai,j (1)

and an auxiliary copy x0 of the x vector. After having performed the actual SpMxV, to validate the result, it suffices to compute Pn

i=1yi, c|x and c|x0,

and to compare their values. It can be shown [16] that, in case of no errors, these three quantities carry the same value, whereas if a single error strikes either the memory or the computation, one of them must differ from the other two. Nevertheless, this method requires A to be strictly diagonally dominant, a condition that seems to restrict too much the practical applicability of their method. Shantharam et al. need this condition to ensure detection of errors striking an entry of x corresponding to a zero checksum column of A. We further analyze that case and show how to overcome the issue without imposing any restriction on A.

A nice way to characterize the problem is expressing it in geometrical terms. Consider the computation of a single entry of the checksum as

(w|A)j = n

X

i=1

wiai,j = w|Aj,

where w ∈ Rn denotes the weight vector and Aj the j-th column of A. Let us now interpret such an operation as the result of the scalar product h·, ·i : Rn × Rn _{→ R defined by hu, vi 7→ u}|_{v. It is clear that a checksum entry is}

(10)

weight vector. In (1), we have chosen w to be such that wi = 1 for 1 ≤ i ≤ n,

in order to make the computation easier. Let us see now what happens without this restriction.

The problem reduces to finding a vector w ∈ Rn that is not orthogonal to any vector out of a basis B = {b1, . . . , bn} of Rn – the rows of the input matrix.

Each of these n vectors is perpendicular to a hyperplane hi of Rn, and w does

not verify the condition

hw, bii 6= 0 , (2)

for any i, if and only if it lies on hi. Since the Lebesgue measure in Rn of an

hyperplane of Rn itself is zero, the union of these hyperplanes is measurable and has measure 0. Therefore, the probability that a vector w randomly picked in Rn does not satisfy condition (2) for any i is zero.

Nevertheless, there are many reasons to consider zero checksum columns. First of all, when working with finite precision, the number of elements in Rn one can have is finite, and the probability of randomly picking a vector that is orthogonal to a given one could be larger than zero. Moreover, a coefficient matrix usually comes from the discretization of a physical problem, and the distribution of its columns cannot be considered as random. Finally, using a randomly chosen vector instead of (1, . . . , 1)| increases the number of required floating point operations, causing a growth of both the execution time and the number of rounding errors (see Section 6). Therefore, we would like to keep w = (1, . . . , 1)| as the vector of choice, in which case we need to protect SpMxV with matrices having zero column sums. There are many matrices with this property, for example the Laplacian matrices of graphs [36, Ch. 1].

In Algorithm 2, we propose an ABFT SpMxV method that uses weighted checksums and does not require the matrix to be strictly diagonally dominant. The idea is to compute the checksum vector and then shift it by adding to all entries a constant value chosen so that all elements of the new vector are different from zero. We give the generalized result in Theorem 1.

Theorem 1 (Correctness of Algorithm 2). Let A ∈ Rn×n be a square matrix, let x, y ∈ Rn be the input and output vector respectively, and let x0 = x. Let us assume that the algorithm performs the computation

e

y ← eA_ex , (3)

where e_{A ∈ R}n×n and _e_{x ∈ R}n are the possibly faulty representations of A and x respectively, while _{y ∈ R}_e n is the possibly erroneous result of the sparse matrix-vector product. Let us also assume that the encoding scheme relies on

1. an auxiliary checksum vector c = " _n X i=1 ai,1 + m, . . . , n X i=1 ai,n + m # , where m is such that

cj = n

X

i=1

(11)

Algorithm 2 Shifting checksum algorithm.

Require: A ∈ Rn×n, x ∈ Rn

Ensure: y ∈ Rn such that y = Ax or the detection of a single error

1: x0 ← x;

2: [w, c, m, cr] = computeChecksum(A); return SpMxV(A, x, x0, w, c, m, cr);

3: _{function computeChecksum(A)} 4: _{Generate w ∈ R}n+1;

5: w ← w_1:n;

6: c ← w|A;

7: if min(| c |) = 0 then;

8: Find m that verifies (4);

9: c ← c + m; 10: end if 11: cr ← w|Rowindex ; return w, c, m, cr; 12: end function 13: _{function SpMxV(A, x, x}0, w, c, m, cr) 14: w ← w_1:n; 15: sr ← 0; 16: for i ← 1 to n do 17: yi ← 0; 18: sr ← sr + Rowindexi;

19: for j ← Rowindexi to Rowindexi+1− 1 do

20: ind ← Colidj; 21: yi ← yi+ Valj · xind; 22: end for 23: end for 24: yn+1 ← m w|x0; 25: cy ← w|y; 26: dx ← c|x; 27: dx0 ← c|x0; 28: dr ← cr − sr; 29: if dx = 0 ∧ dx0 = 0 ∧ dr = 0 then return y1:n; 30: else

31: error (“Soft error is detected”);

32: end if

(12)

for 1 ≤ j ≤ n,

2. an auxiliary checksum yn+1 = m

Pn

i=ixei,

3. an auxiliary counter sr initialized to 0 and updated at runtime by adding

the value of the hit element each time the Rowindex array is accessed (line 20 of Algorithm 2),

4. an auxiliary checksum cr =

Pn

i=1Rowindexi ∈ N.

Then, a single error in the computation of the SpMxV causes one of the following conditions to fail:

i. c|_ex = Pn+1_i=1 y_ei,

ii. c|x0 = Pn+1

i=1 yei, iii. sr = cr.

Proof. There are three possible error sources when computing Equation (3), namely:

a. a faulty arithmetic operation during the computation of y, b. a bit flip in the sparse representation of A,

c. a bit flip in an element of of x.

We consider these three possible cases in the following discussion.

Case a. Let us assume, without loss of generality, that the error has struck at the pth position of y, which implies y_ei = yi for 1 ≤ i ≤ n with i 6= p and

e

yp = yp+ε, where ε ∈ R \{0} represents the value of the error that has occurred.

Summing up the elements of _ey gives

n+1 X i=1 e yi = n X i=1 n X j=1 ai,jxej + m n X j=1 e xj + ε = n X j=1 cjxej + ε = c|_ex + ε , that violates condition (i).

Case b. A single error in the A matrix can strike one of the three vectors that constitute its sparse representation:

• a fault in Val that alters the value of an element ai,j implies an error in the

computation of _eyi, which leads to the violation of the safety condition (i)

(13)

• a variation in Colid can zero out an element in position ai,j shifting its

value in position ai,j0, leading again to an erroneous computation of

e yi,

• a transient fault in Rowindex entails an incorrect value of sr and hence a

violation of condition (iii).

Case c. Let us assume, without loss of generality, an error in position p of x. Hence we have that x_ei = xi for 1 ≤ i ≤ n with i 6= p and xep = xp + ε, for some ε ∈ R \ {0}. Noting that x = x0, the sum of the elements of y gives_e

n+1 X i=1 e yi = n X i=1 n X j=1 ai,jexj + m n X j=1 e xj = n X i=1 n X j=1 ai,jxj + m n X j=1 xj + ε n X i=1 ai,p+ εm = n X j=1 cjxj + ε n X i=1 ai,p+ m ! = c|x0+ ε n X i=1 ai,p+ m ! ,

that violates (ii) since Pn

i=1ai,p+ m 6= 0 by definition of m.

Let us remark that computeChecksum in Algorithm 2 does not require the input vector x of SpMxV as an argument. Therefore, in the common scenario of many SpMxV with the same matrix, it is enough to invoke it once to protect several matrix-vector multiplications. This observation will be crucial when discussing the performance of these checksumming techniques.

Shifting the sum checksum vector by an amount is probably the simplest deterministic approach to relax the strictly diagonal dominance hypothesis, but it is not the only one. An alternative solution is described in Algorithm 3, which basically exploits the distributive property of matrix multiplication over matrix addition. The idea is to split the original matrix A into two matrices of the same size, A and bA, such that no column of either matrix has a zero checksum. Two standard ABFT multiplications, namely y ← Ax and y ← A_b _bx, are then performed. If no error occurs neither in the first nor in the second computation, the sum of y and y is computed in reliable mode and then returned. Let us_b note that, as we expect the number of non-zeros of bA to be much smaller than n, we store sparsely both the checksum vector of bA and the y vector._b

We do not write down an extended proof of the correctness of this algorithm, and limit ourselves to a short sketch. We consider the same three cases as in the proof of Theorem 1, without introducing any new idea. An error in the computation of y or y can be detected using the dot product between the_b corresponding column checksum and the x error. An error in A can be detected by either cr or an erroneous entry in y or y, as the matrix loop structure of theb sparse multiplication algorithm has not been changed. Finally, an error in the

(14)

Table 1: Overhead comparison for Algorithm 2 and Algorithm 3. Here n denotes the size of the matrix and n0 the number of null sum columns.

Algorithm 2 Algorithm 3 initialization of y n n computation of yn+1 n -SpMxV overhead - n0 checksum check 2n 2n + 2n0 computation of cy and c b y n n + n0 computation of y +y_b - n0 Total SpMxV overhead 5n 4n + 5n0

pth component of x would make the sum of the entries of y and y differ from_b c|x0 and _bc|x0, respectively.

The evaluation of the performance of the two algorithms, though straight-forward from the point of view of the computational cost, has to be carefully assessed in order to devise a valid and practical trade-off between Algorithm 2 and Algorithm 3.

In both cases computeChecksum introduces an overhead of O (nnz(A)), but the shift version should in general be faster containing less assignments than its counterpart, and this changes the constant factor hidden by the asymptotic notation. Nevertheless, as we are interested in performing many SpMxV with the same matrix, this pre-processing overhead becomes negligible.

The function SpMxV has to be invoked once for each multiplication, and hence more care is needed. Copying x and initializing y both require n oper-ations, and the multiplication is performed in time O (nnz(A)), but the split version pays an n0 more to read the values of the sparse vector b. The cost of the verification step depends instead on the number of zeroes of the original checksum vector, that is also the number of non-zero elements of the sparse vector _bc. Let us call this quantity n0. Then the overhead is 4n for the shifting and 3n + 3n0 for the splitting, that requires also the sum of two sparse vectors to return the result. Hence, as summarized in Table 1, the two methods bring different overheads into the computation. Comparing them, it is immediate to see that the shifting method is cheaper when

n0 > n 5 ,

while it has more operations to do when the opposite inequality holds. For the equality case, we can just choose to use the first method, because of the cheaper preprocessing phase. In view of this observation, it is possible to devise a simple algorithm that exploits this trade-off to achieve better performance. It suffices to compute the checksum vector of the input matrix, count the number of non-zeros and choose which detection method to use accordingly.

We also note that by splitting the matrix A into say ` pieces and checksum-ming each piece separately, we can possibly protect A from up to ` errors, by

(15)

Algorithm 3 Splitting checksum algorithm.

Ensure: y ∈ Rn such that y = Ax or the detection of a single error

1: [c, k, cr] = computeChecksum(A); return SpMxV(A, x, c, bc, b, cr);

2: _{function computeChecksum(A)} 3: c, m ← 0;

4: for i ← 1 to nnz(A) do

5: ind ← Colidi;

6: cind ← cind+ Vali;

7: mind ← i; 8: end for 9: k ← 0; 10: for i ← 1 to n do 11: if ci = 0 ∧ mi 6= 0 then 12: bmi ← true; 13: _bci ← Valmi; 14: ci ← ci−_bci; 15: end if 16: end for 17: cr ← Pn

i=1Rowindexi; return c,bc, b, cr;

18: end function 19: _{function SpMxV(A, x, c,} _bc, b, cr) 20: x0 ← x; 21: for i ← 1 to n do 22: yi ← 0; 23: end for 24: sr ← 0; 25: for i ← 1 to n do 26: sr ← sr + Rowindexi;

28: ind ← Colidj; 29: if bj then 30: y_bi ←y_bi+ Valj · xind; 31: else 32: yi ← yi+ Valj · xind; 33: end if 34: end for 35: end for 36: cy ← Pn i=1yi; cyb ← Pn i=1ybi; 37: dx ← c|x − cy; d b x ←_bc|x − c b y; 38: dx0 ← c|x0− cy; d b x0 ←ec |_x0 − c b y; 39: dr ← cr − sr; 40: if dx = 0 ∧ dx0 = 0 ∧ dxe = 0 ∧ dxe 0 = 0 ∧ d_r = 0 then return y + b y; 41: else

42: error (“Soft error is detected”);

43: end if

(16)

protecting each piece against a single error (obviously the multiple errors should hit different pieces).

4.2. Multiple error detection

With some effort, the shifting idea in Algorithm 2 can be extended to detect errors striking a single SpMxV. Let us consider the problem of detecting up to k errors in the computation of y ← Ax introducing an overhead of O (kn). Let k weight vectors w(1), . . . , w(k) _{∈ R}n be such that any sub-matrix of

W = h

w(1) w(2) . . . w(k) i

has full rank. For k = 2 and k = 3, we give a possible choice for the W matrix explicitly. For larger values of k, we suggest the use of rectangular random matrices, which besides having full rank sub-matrices are also well conditioned with high probability [37, 38].

To build our ABFT scheme let us note that, if no error occurs, for each weight vector w(`) it holds that

w(`)|A = " _n X i=1 w(`)_i a_i,1, . . . , n X i=1 w(`)_i a_i,n # , and hence that

w(`)|Ax = n X i=1 w_i(`)a_i,1x₁ + · · · + n X i=1 w_i(`)a_i,nx_n = n X i=1 n X j=1 w(`)_i a_i,jx_j .

Similarly, the sum of the entries of y weighted with the same w(`) is

n X i=1 w(`)_i yi = w (`) 1 y1 + · · · + w (`) n yn = w(`)₁ n X j=1 a1,jxj + · · · + wn(`) n X j=1 a_n,jx_j = n X i=1 n X j=1 w(`)_i a_i,jxj ,

and we can conclude that

n

X

i=1

w(`)_i y_i = w(`)|Ax ,

for any w(`) with 1 ≤ ` ≤ k.

To convince ourself that with these checksums it is actually possible to detect up to k errors, let us suppose that k0 errors, with k0 ≤ k, occur in positions p1, . . . , pk0, and let us denote by

e

y the faulty vector where y_epi = ypi + εpi for

εpi ∈ R \ {0} and 1 ≤ i ≤ k

0 _and

e

yi = yi otherwise. Then for each weight vector

we have n X i=1 w_i(`)y_e_i− n X i=1 w(`)_i y_i = k0 X j=1 w_p(`)_j ε_p_j.

(17)

Said otherwise, the occurrence of the k0 errors is not detected if and only if, for 1 ≤ ` ≤ k, all the εpi respect

k0

X

j=1

w(`)_p_j ε_p_j = 0 . (5)

We claim that there cannot exist a vector (εp1, . . . , εp0_k)

| _{∈ R}k0 _{\ {0} such that}

all the conditions in (5) are simultaneously verified. These conditions can be expressed in a more compact way as a linear system

    wp(1)1 · · · w (1) p_k0 .. . . .. ... wp(k)1 · · · w (k) p_k0         εp1 .. . εp_k0     =    0 .. . 0   .

Denoting by W∗ the coefficient matrix of this system, it is clear that the errors cannot be detected if only if (εp1, . . . εpk0)

| _{∈ ker(W}∗_{) \ {0}. Because of}

the properties of W, we have that rk(W∗) = k. Moreover, it is clear that the rank of the augmented matrix

    wp(1)1 · · · w (1) p_k0 .. . . .. ... w(k)p1 · · · w (k) p_k0 0 .. . 0    

is k as well. Hence, by means of the Rouch´e–Capelli theorem, the solution of the system is unique and the null space of W∗ is trivial. Therefore, this construction can detect the occurrence of k0 errors during the computation of y by comparing the values of the weighted sums y|w(`) with the result of the dot product (w(`)|A)x, for 1 ≤ ` ≤ k.

However, to get a true extension of the algorithm described in the previous section, we also need to make it able to detect errors that strike the sparse representation of A and that of x. The first case is simple, as the k errors can strike the Val or Colid arrays, leading to at most k errors in y, or in Rowindex ,_e where they can be caught using k weighted checksums of the Rowindex vector. Detection in x is much trickier, since neither the algorithm just described nor a direct generalization of Algorithm 2 can manage this case. Nevertheless, a proper extension of the shifting technique is still possible. Let us note that there exists a matrix M ∈ Rk×n such that

W|A + M = W .

The elements of such an M can be easily computed, once that the checksum rows are known. Let _e_{x ∈ R}n be the faulty vector, defined by

e xi =    xi + εpi, 1 ≤ i ≤ k 0_, xi otherwise.

(18)

for some k0 ≤ k, and let us define _ey = Ax. Now, let us consider a checksum_e vector x0 _{∈ R}n such that x0 = x and let assume that it cannot be modified by a transient error. For 1 ≤ ` ≤ k, it holds that

n X i=1 w_i(`)_ey_i+ n X j=1 m_`,jx_e_j = n X i=1 n X j=1 w_i(`)a_i,jx_j + k0 X i=1 εpi   n X j=1 w(`)_j a_j,p i   + n X j=1 m_`,jx_j + k0 X i=1 εpiml,pi = n X j=1 n X i=1 w_i(`)a_i,jx_j + n X j=1 m_`,jx_j + k0 X i=1 εpi   n X j=1 w(`)_j a_j,p i + ml,pi   = n X j=1 n X i=1 w(`)_i a_i,j ! x_j + n X j=1 m_`,jx_j + k0 X i=1 εpiw (`) pi = n X j=1 n X i=1 w(`)_i a_i,j+ m`,j ! x_j + k0 X i=1 εpiw (`) pi = w(`)|x + k0 X i=1 εpiw (`) pi = w(`)|x0+ k0 X i=1 εpiw (`) pi .

Therefore, an error is not detected if and only if the linear system     wp(1)1 · · · w (1) p_k0 .. . . .. ... w(k)p1 · · · w (k) pk0         εp1 .. . εpk0     =    0 .. . 0   

has a non-trivial solution. But we have already seen that such a situation can never happen, and we can thus conclude that our method, whose pseudocode we give in Algorithm 4, can also detect up to k errors occurring in x. Therefore, we have proven the following theorem.

Theorem 2 (Correctness of Algorithm 4). Let us consider the same notation as in Theorem 1. Let W ∈ _Rn+1×n be a matrix such that any square submatrix is full rank, and let us denote by W ∈ Rn×n the matrix of its first n rows. Let us assume an encoding scheme that relies on

1. an auxiliary checksum matrix C = (W|A)|, 2. an auxiliary checksum matrix M = W − C,

3. a vector of auxiliary counters sRowindex initialized to the null vector and

updated at runtime as in lines 16 – 17 of Algorithm 4), 4. an auxiliary checksum vector cRowindex = W|Rowindex .

Then, up to k errors striking the computation of y or the memory locations that store A or x, cause one of the following conditions to fail:

(19)

ii. W|(x0− y),

iii. sRowindex = cRowindex.

Let us note that we have just shown that our algorithm can detect up to k errors striking only A, or only x or only the computation. Nevertheless, this result holds even when the errors are distributed among the possible cases, as long as at most k errors rely on the same checkpoint.

It is clear that the execution time of the algorithm depends on both nnz(A) and k. For the computeChecksum function, the cost is, assuming that the weight matrix W is already known, O (k nnz(A)) for the computation of C, and O (kn) for the computation of M and cRowindex. Hence the number of

per-formed operations is O (k nnz(A)). The overhead added to the SpMxV depends instead on the computation of four checksum matrices that lead to a number of operations that grows asymptotically as kn.

4.3. Single error correction

We now discuss single error correction, using Algorithm 4 as a reference. We describe how a single error striking either memory or computation can be not only detected but also corrected at line 27. We use only two checksum vectors, that is, we describe correction of single errors assuming that two errors cannot strike the same SpMxV. By the end of the section, we will generalize this ap-proach and discuss how single error correction and double error detection can be performed concurrently by exploiting three linearly independent checksum vec-tors. Whenever a single error is detected, regardless of its location (computation or memory), it is corrected by means of a succession of various steps. When one or more errors are detected, the correction mechanism tries to determine their multiplicity and, in case of a single error, what memory locations have been corrupted or what computation has been miscarried. Errors are then corrected using the values of the checksums and, if need be, partial recomputations of the result are performed.

As we did for multiple error detection, we require that any 2 × 2 submatrix of W ∈ Rn×2 has full rank. The simplest example of weight matrix having this property is probably W =        1 1 1 2 1 3 .. . ... 1 n        .

To detect errors striking Rowidx , we compute the ratio ρ of the second component of dr to the first one, and check whether its distance from an integer

is smaller than a certain threshold parameter ε. If this distance is smaller, the algorithm concludes that the σth element, where σ = Round(ρ) is the nearest integer to ρ, of Rowidx is faulty, performs the correction by subtracting the first component of dr from Rowidxσ, and recomputes yσ and yσ−1, if the error in

(20)

Algorithm 4 Shifting checksum algorithm for k errors detection.

Ensure: y ∈ Rn such that y = Ax or the detection of up to k errors

1: x0 ← x

2: [W, C, M,_e_{c] = computeChecksums(A, k); return SpMxV(A, x, x}0, W, C, M, k, _ec); 3: _{function computeChecksums(A, k)} 4: Generate W =w(1). . . w(k) ∈ Rn+1×n; 5: W ← W_1:n,∗ _{∈ R}n×k 6: C| ← W|_A; 7: M ← W − C;

8: cRowindex ← W|Rowindex ; return W, C, M, cRowindex;

9: end function 10: _{function SpMxV(A, x, x}0, W, C, M, k, _ec) 11: W ← W_1:n,∗ _{∈ R}n×k 12: _es ← [0, . . . , 0]; 13: for i ← 1 to n do 14: yi ← 0; 15: for j = 1 to k do 16: _esj ←s_ej + wijRowindexi; 17: end for

19: ind ← Colidj; 20: yi ← yi+ Valj · xind; 21: end for 22: end for 23: dx ← W|y − C|x; 24: dx0 ← W|(x0− y) − M|x; 25: dr ←ec −es; 26: if dx = 0 ∧ dx0 = 0 ∧ dr = 0 then return y; 27: else

28: error (“Soft errors are detected”);

29: end if

(21)

Rowindexσ is a decrement; or yσ+1 if it was an increment. Otherwise, it just

emits an error.

The correction of errors striking Val , Colid and the computation of y are performed together. Let now ρ be the ratio of the second component of dx to

the first one. If ρ is near enough to an integer σ, the algorithm computes the checksum matrix C0 = W|A and considers the number z

e

C of non-zero columns

of the difference matrix eC = |C − C0|. At this stage, three cases are possible: • If z

e

C = 0, then the error is in the computation of yσ, and can be corrected

by simply recomputing this value. • If z

e

C = 1, then the error has struck an element of Val . Let us call f the

index of the non-zero column of eC. The algorithm finds the element of Val corresponding to the entry at row σ and column f of A and corrects it by using the column checksums much like as described for Rowidx . Afterwards, yd is recomputed to fix the result.

• If z

e

C = 2, then the error concerns an element of Colid . Let us call f1 and

f2 the index of the two non-zero columns and m1, m2 the first and last

elements of Colid corresponding to non-zeros in row σ. It is clear that there exists exactly one index m∗ between m1 and m2 such that either

Colidm∗ = f₁ or Colid_m∗ = f₂. To correct the error it suffices to switch

the current value of Colidm∗, i.e., putting Colid_m∗ = f₂ in the former case

and Colidm∗ = f₁ in the latter. Again, y_σ has to be recomputed.

• if z

e

C > 2, then errors can be detected but not corrected, and an error is

emitted.

To correct errors striking x, the algorithm computes ρ, that is the ratio of the second component of dx0 to the first one, and checks that the distance between

d and the nearest integer σ is smaller than ε. Provided that this condition is verified, the algorithm computes the value of the error τ = Pn

i=1xi − cxσ and

corrects xσ = xσ − τ . The result is updated by subtracting from y the vector

yτ = Axτ, where xτ ∈ Rn×n _{is such that x}τ

σ = τ and xτi = 0 otherwise.

Let us now investigate how detection and correction can be combined and let us give some details about the implementation of ABFT-Correction as defined in Section 3. Indeed, note that double errors could be shadowed when using Algorithm 2, although the probability of such an event is negligible.

Let us restrict ourselves to an easy case, without considering errors in x. As usual, we compute the column checksums matrix

C = (W|A)|,

and then compare the two entries of C|_{x ∈ R}2 with the weighted sums

e yc₁ = n X i=1 e yi

(22)

and e y₂c = n X i=1 iy_ei

where _ey is the possibly faulty vector computed by the algorithm. It is clear that if no error occurs, the computation verifies the condition δ = _eyc − c = 0. Furthermore, if exactly one error occurs, we have δ1, δ2 6= 0 and δ_δ2₁ ∈ N, and if

two errors strike the vectors protected by the checksum c, the algorithm is able to detect them by verifying that δ 6= 0.

At this point it is natural to ask whether this information is enough to build a working algorithm or some border cases can bias its behavior. In particular, when δ2

δ1 = p ∈ N, it is not clear how to discern between single and double

errors. Let ε1, ε2 ∈ R \ {0} be the value of two errors occurring at position p1

and p2 respectively, and let ey ∈ R

n _{be such that} e yi =    yi, 1 ≤ i ≤ n, i 6= p1, p2 yi + ε1, i = p1 yi + ε2, i = p2 .

Then the conditions

δ1 = ε1 + ε2, (6)

δ2 = p1ε1 + p2ε2 , (7)

hold. Therefore, if ε1 and ε2 are such that

p (ε1 + ε2) = p1ε1 + p2ε2 , (8)

it is not possible to distinguish these two errors from a single error of value ε1 + ε2 occurring in position p. This issue can be solved by introducing a new

set of weights and hence a new row of column checksums. Let us consider a weight matrix c_{W ∈ R}n×3 that includes a quadratic weight vector

c W =        1 1 1 1 2 4 1 3 9 .. . ... ... 1 n n2        ,

and the tridimensional vector ˆ δ = c W|A x − cW|_ey , whose components can be expressed as

δ1 = ε1 + ε2

δ2 = p1ε1 + p2ε2

(23)

To be confused with a single error in position p, ε1 and ε2 have to be such that

p (ε1 + ε2) = p1ε1 + p2ε2

and

p2(ε1 + ε2) = p21ε1 + p22ε2

hold simultaneously for some p ∈ N. In other words, possible values of the errors are the solution of the linear system

(p − p1) (p − p2) (p2 − p2 1) (p2 − p22) ε1 ε2 = 0 0 .

It is easy to see that the determinant of the coefficient matrix is (p − p1) (p − p2) (p2 − p1) ,

which always differs from zero, as long as p, p1 and ps differ pairwise. Thus,

the matrix is invertible, and the solution space of the linear system is the trivial kernel (ε1, ε2) = (0, 0). Thus using cW as weight matrix guarantees that it is

always possible to distinguish a single error from double errors. 5. Performance model

In Section 5.1, we introduce the general performance model. Then in Sec-tion 5.2 we instantiate it for the three methods that we are considering, namely Online-Detection, ABFT-Detection and ABFT-Correction.

5.1. General approach

We introduce an abstract performance model to compute the best combi-nation of checkpoints and verifications for iterative methods. We execute T time-units of work followed by a verification, which we call a chunk, and we re-peat this scheme s times, i.e., we compute s chunks, before taking a checkpoint. We say that the s chunks constitute a frame. The whole execution is then par-titioned into frames. We assume that checkpoint, recovery and verification are error-free operations. Let Tcp, Trec and Tverif be the respective cost of these

operations. Finally, assume an exponential distribution of errors and let q be the probability of successful execution for each chunk: q = e−λT, where λ is the fault rate.

The goal of this section is to compute the expected time E (s, T ) needed to execute a frame composed of s chunks of size T . We derive the best value of s as a function of T and of the resilience parameters Tcp, Trec, Tverif, and q, the success

probability of a chunk. Each frame is preceded by a checkpoint, except maybe the first one (for which we recover by reading initial data again). Following earlier work [39], we derive the following recursive equation to compute the expected completion time of a single frame:

E (s, T ) = qs(s(T + Tverif)) + Tcp)

+ (1 − qs_{) (E (T}lost) + Trec + E (s, T )) .

(24)

Indeed, the execution is successful if all chunks are successful, which happens with probability qs, and in this case the execution time simply is the sum of the execution times of each chunk plus the final checkpoint. Otherwise, with probability 1 − qs_{, we have an error, which we detect after some time E (T}lost),

and that forces us to recover (in time Trec) and restart the frame anew, hence

in time E (s, T ). The difficult part is to compute E (Tlost).

For 1 ≤ i ≤ s, let fi be the following conditional probability:

fi = P(error strikes at chunk i|there is an error in the frame) . (10) Given the success probability q of a chunk, we obtain that

fi =

qi−1(1 − q) 1 − qs .

Indeed, the first i − 1 chunks were successful (probability qi−1), the ith one had an error (probability 1 − q), and we condition by the probability of an error within the frame, namely 1 − qs. With probability fi, we detect the error at

the end of the ith chunk, and we have lost the time spent executing the first i chunks. We derive that

E (Tlost) = s X i=1 fi(i(T + Tverif)) . We have Ps i=1fi = (1−q)h(q) 1−qs where h(q) = 1 + 2q + 3q2 + · · · + sqs−1. If

m(q) = q +q2+· · ·+qs = 1−q_1−qs+1−1, we get by differentiation that m0(q) = h(q), hence h(q) = −(s+1)q_1−q s + 1−q_(1−q)s+12 and finally

E (Tlost) = (T + Tverif)

sqs+1 − (s + 1)qs _{+ 1}

(1 − qs_{)(1 − q)} .

Plugging the expression of E (Tlost) back into (9), we obtain

E (s, T ) = s(T + Tverif)) + Tcp + (q−s − 1)Trec

+ T sq

s+1 _{− (s + 1)q}s _{+ 1}

qs_{(1 − q)} ,

which simplifies into

E (s, T ) = Tcp + (q−s − 1)Trec + (T + Tverif)

1 − qs qs_{(1 − q)} .

We have to determine the value of s that minimizes the overhead of a frame: s = argmin s≥1 E (s, T ) sT . (11)

The minimization is complicated and should be conducted numerically (because T , the size of a chunk, is still unknown). Luckily, a dynamic programming algorithm to compute the optimal value of T and s is available [10].

(25)

5.2. Instantiation to PCG

For each of the three methods, Online-Detection, ABFT-Detection and ABFT-Correction, we instantiate the previous model and discuss how to solve (11).

5.2.1. Online-Detection

For Chen’s method [7], we have chunks of d iterations, hence T = dTiter,

where Titer is the raw cost of a PCG iteration without any resilience method.

The verification time Tverif is the cost of the orthogonality check operations

performed as described in Section 3. As for silent errors, the application is protected from arithmetic errors in the ALU, as in Chen’s original method, but also for corruption in data memory (because we also checkpoint the matrix A). Let λa be the rate of arithmetic errors, and λm be the rate of memory errors.

For the latter, we have λm = M λword if the data memory consists of M words,

each susceptible to be corrupted with rate λword. Altogether, since the two

error sources are independent, they have a cumulative rate of λ = λa+ λm, and

the success probability for a chunk is q = e−λT.

Plugging these values in (11) gives an optimization formula very similar to that of Chen [7, Sec. 5.2], the only difference being that we assume that the verification is error-free, which is needed for the correctness of the approach. 5.2.2. ABFT-Detection

When using ABFT techniques, we detect possible errors every iteration, so a chunk is a single iteration, and T = Titer. For ABFT-Detection, Tverif is

the overhead due to the checksums and redundant operations to detect a single error in the method.

ABFT-Detection can protect the application from the same silent errors as Online-Detection, and just as before the success probability for a chunk (a single iteration here) is q = e−λT.

5.2.3. ABFT-Correction

In addition to detection, we now correct single errors at every chunk. Just as for ABFT-Detection, a chunk is a single iteration, and T = Titer, but Tverif

corresponds to a larger overhead, mainly due to the extra checksums needed to detect two errors and correct a single one.

The main difference lies in the error rate. An iteration with ABFT-Correc-tion is successful if zero or one error has struck during that iteraABFT-Correc-tion, so that the success probability is much higher than for Online-Detection and ABFT-Detection. We compute that value of the success probability as follows. We have a Poisson process of rate λ, where λ = λa + λm as for

Online-Detec-tion and ABFT-DetecOnline-Detec-tion. The probability of exactly k errors in time T is

(λT )k k! e

−λT _{[40], hence the probability of no error is e}−λT _{and the probability of}

(26)

6. Experiments 6.1. Setup

There are two different sources of advantages in combining ABFT and check-pointing. First, the error detection capability lets us perform a cheap validation of the partial result of each PCG step, recovering as soon as an error strikes. Second, being able to correct single errors makes each step more resilient and increases the expected number of consecutive valid iterations. We say an itera-tion is valid if it is non-faulty, or if it suffers from a single error that is corrected via ABFT.

For our experiments, we use a set of positive definite matrices from the UFL Sparse Matrix Collection [41], with size between 17456 and 74752 and density lower than 10−2. These matrices are listed in Table 2. Among these matrices, “Andrews/Andrews” and “GHS psdef/jnlbrng1” are diagonally dominant, oth-ers are not. We perform the experiments under Matlab and use the factored approximate inverse preconditioners [17, 42] in the PCG. The application of these preconditioners requires two SpMxV, which are protected against error using the methods proposed in Section 4 (in all methods Online-Detection, ABFT-Detection, and ABFT-Correction).

At each iteration of PCG, faults are injected during vector and matrix-vector operations but, since we are assuming selective reliability, all the checksums and checksum operations are considered non-faulty. Faults are modeled as bit flips occurring independently at each step, under an exponential distribution of pa-rameter λ, as detailed in Section 5.2. These bit flips can strike either the matrix (the elements of Val , Colid and Rowidx ), or any entry of the PCG vectors ri,

zi, q, pi or xi. We chose not to inject errors during the computation

explic-itly, as they are just a special case of error we are considering. Moreover, to simplify the injection mechanism, Titer is normalized to be one, meaning that

each memory location or operation is given the chance to fail just once per iter-ation [28]. Finally, to get data that are homogeneous among the test matrices, the fault rate λ is chosen to be inversely proportional to M (memory size) with a proportionality constant α ∈ (0, 1); this makes sense as larger the memory used by an application, larger is the chance to have an error. It follows that the expected number of PCG iterations between two distinct fault occurrences does not depend either on the size or on the sparsity ratio of the matrix.

We compare the performance of three algorithms, namely Online-Detec-tion, ABFT-Detection (single detection and rolling back as soon as an error is detected), and ABFT-Correction (correcting single errors during a given step and rolling back only if two errors strike a single operation). We instantiate them by limiting the maximum number of PCG steps to 50 (20 for #924, whose convergence is sublinear) and setting the tolerance parameter at line 5 of Algorithm 1 to 10−14. The number of iterations for a non-faulty execution and the achieved accuracy are detailed in Table 2.

Implementing the null checks in Algorithm 2, Algorithm 3 and Algorithm 4 poses a challenge. The comparison dr = 0 is between two integers, and can

(27)

Table 2: Test matrices used in the experiments. Name and id are from the University of Florida Sparse Matrix Collection.

name id size density steps residual

Boeing/bcsstk36 341 23052 2.15e-03 50 6.41e-04

Mulvey/finan512 752 74752 1.07e-04 25 2.19e-14

Andrews/Andrews 924 60000 2.11e-04 20 1.59e-04

GHS psdef/wathen100 1288 30401 5.10e-04 50 2.55e-13

GHS psdef/wathen120 1289 36441 4.26e-04 50 9.16e-14

GHS psdef/gridgena 1311 48962 2.14e-04 50 5.61e-05

GHS psdef/jnlbrng1 1312 40000 1.24e-04 50 5.83e-13

UTEP/Dubcova2 1848 65025 2.44e-04 50 1.16e-05

JGD Trefethen/Trefethen 20000 2213 20000 1.39e-03 10 6.00e-16

However, the other two, having floating point operands, are problematic. Since the floating point operations are not associative and the distributive property does not hold, we need a tolerance parameter that takes into account the round-ing operations that are performed by each floatround-ing point operation. Here, we give an upper bound on the difference between the two floating point checksums, using the standard model [43, Sec. 2.2] to make sure that errors caught by our algorithms really are errors and not merely inaccuracies due to floating point operations (which is tolerable, as non-faulty executions can give rise to the same inaccuracy).

Theorem 3 (Accuracy of the floating point weighted checksums). Let A ∈ Rn×n, x ∈ Rn, c ∈ Rn. If all of the sums involved into the matrix operations are performed using some flavor of recursive summation [43, Ch. 4], it holds that

|f l ((c|A) x) − f l (c|(Ax)) | ≤ 2 γ2n |c|| | A | | x | . (12)

We refer the reader to the technical report for the proof [44, Theorem 2]. Let us note that if all of the entries of c are positive, as it is often the case in our setting, the absolute value of c in (12) can be safely replaced with c itself. It is also clear that these bounds are not computable, since c| | A | | x | is not, in general, a floating point number. This problem can be alleviated by overestimating the bound by means of matrix and vector norms.

Since we are interested in actually computing the bound at runtime, we consider a weaker bound. Recalling that [45, Sec. B.7]

kAk₁ = max 1≤j≤n n X i=1 |ai,j| . (13)

we can upper bound the right hand side in so that

(28)

Table 3: Experimental validation of the model. Heres_eiand s∗i represent the best checkpointing

interval according to our model and to our simulations respectively, whereas Et(s_ei) and

Et s∗_i stand for the execution time of the algorithm using these checkpointing intervals.

id s_e1 Et(se1) s ∗ 1 Et(s∗1) l1 se2 Et(se2) s ∗ 2 Et(s∗2) l2 341 4 305.42 1 293.22 4.16 4 305.45 1 293.16 4.19 752 30 13.81 24 13.34 3.57 24 12.17 23 11.93 2.01 924 23 49.82 30 47.53 4.82 23 44.52 26 42.42 4.96 1288 22 11.12 19 10.82 2.72 22 11.32 19 11.03 2.58 1289 16 16.56 13 16.38 1.07 23 13.49 23 13.49 0.00 1311 4 216.70 1 207.97 4.19 4 220.20 1 208.19 5.77 1312 25 14.41 22 13.86 3.97 23 12.30 22 12.06 1.96 1848 4 321.70 1 309.28 4.01 4 366.03 1 314.20 16.49 2213 19 2.31 12 2.19 5.58 24 2.33 23 2.20 5.94

Though the right hand side of (14) is not exactly computable in floating point arithmetic, it requires an amount of operations dramatically smaller than (12); just a few sums for the norm of A. As this norm is usually computed using the identity in (13), any kind of summation yields a relative error of at most n0u [43, Sec. 4.6], where n0 is the maximum number of nonzeros in a column of A, and u is the machine epsilon. Since we are dealing with sparse matrices, we expect n0 to be very small, and hence the computation of the norm to be accurate. Moreover, since the right hand side in (14) does not depend on x, it can be computed just once for a given matrix and weight vector.

Using (14) as tolerance parameter guarantees no false positives (a computa-tion without any error that is considered as faulty), but allows false negatives (an iteration during which an error occurs without being detected) when the perturbations of the result are small. Nonetheless, this solution works almost perfectly in practice, meaning that though the convergence rate can be slowed down, the algorithms still converges towards the “correct” answer. Though such an outcome could be surprising at first, Elliott et al. [46, 47] showed that bit flips that strike the less significant digits of the floating point representation of vector elements during a dot product create small perturbations of the results, and that the magnitude of this perturbation gets smaller as the size of the vec-tors increases. Hence, we expect errors that are not detected by our tolerance threshold to be too small to impact the solution of the linear solver.

6.2. Simulations

To validate the model, we perform the simulation whose results are illus-trated in Table 3. For each matrix, we set λ = _{16 M}1 and consider the average execution time of 100 repetitions of both ABFT-Detection (columns 5-8) and ABFT-Correction (columns 6-9). In the table we record the checkpointing interval s∗_i which achieves the shortest execution time Et(s∗₁), and the

(29)

with its execution time Et(s_ei). Finally, we evaluate the performance of our

guess by means of the quantity li = Et(sei) − Et(s ∗ i) Et(s∗_i) · 100 ,

that expresses the loss, in terms of execution time, of executing with the check-pointing interval given by our model with respect to the best possible choice.

From the table, we clearly see that the values of s_ei and s∗_i are close, since the

time loss reaches just above 5% for l1 and just below 15% for l2. This sometimes

poor result depends just on the small number of repetitions we are considering, that leads to the presence of outliers, lucky runs in which a small number of errors occur and the computation is carried on in a much quicker way. Similar results hold for other values of λ.

We also compare the execution time of the three algorithms to empirically asses how much their relative performance depend on the fault rate. The results on our test matrices are shown in Fig. 1, where the y-axis is the execution time (in seconds), and the x-axis is the normalized mean time between failure (the reciprocal of α). Here, the larger x = _α1, the smaller the corresponding value of λ = _Mα , hence the smaller the expected number of errors. For each value of λ, we draw the average execution time of 50 runs of the three algorithms, using the best checkpointing interval predicted in Section 5.1 for ABFT-Detection and ABFT-Correction, and by Chen [7, Eq. 10] for Online-Detection. In terms of execution time, Chen’s method is comparable to ours for middle to high fault rates, since it clearly outperforms ABFT-Detection in five out of nine cases, being slightly faster than ABFT-Correction for lower fault rates. Intuitively, this behavior is not surprising. When λ is large, many errors occur but, since α is between zero and one, we always have, in expectation, less than one error per iteration. Thus ABFT-Correction requires fewer check-points than ABFT-Detection and almost no rollback, and this compensates for the slightly longer execution time of a single step. When the fault rate is very low, instead, the algorithms perform almost the same number of iterations, but ABFT-Correction takes slightly longer due to the additional dot products at each step.

Altogether, the results show that ABFT-Correction outperforms both Online-Detection and ABFT-Detection for a wide range of fault rates, thereby demonstrating that combining checkpointing with ABFT correcting techniques is more efficient than pure checkpointing for most practical situa-tions.

7. Conclusion

We consider the problem of silent errors in iterative linear systems solvers. At first, we focus our attention on ABFT methods for SpMxV, developing al-gorithms able to detect and correct errors in both memory and computation using various checksumming techniques. Then, we combine ABFT with repli-cation, in order to develop a resilient PCG kernel that can protect axpy’s and

(30)

102 104 0 100 200 (a) Matrix #341 102 104 0 2 4 (b) Matrix #752 102 104 0 20 40 (c) Matrix #924 102 104 0 2 4 6 (d) Matrix #1288 102 104 0 2 4 6 8 10 (e) Matrix #1289 102 104 0 100 200 (f) Matrix #1311 102 104 0 2 4 6 8 (g) Matrix #1312 102 104 0 50 100 (h) Matrix #1848 102 104 0 0.5 1 (i) Matrix #2213

Figure 1: Execution time in seconds (y axis) of Online-Detection (dotted_), ABFT-Detec-tion (solid line) and ABFT-Correction (dashed) with respect to the normalized MTBF (mean time between failures), x-axis). The matrix number is in the subcaption.

(31)

dot products as well. We also discuss how to take numerical issues into account when dealing with actual implementations. These methods are a worthy choice for a selective reliability model, since most of the operations can be performed in unreliable mode, whereas only checksum computations need to be performed reliably.

In addition, we examine checkpointing techniques as a tool to improve the resilience of our ABFT PCG and develop a model to trade-off the checkpointing interval so to achieve the shortest execution time in expectation. We implement two of the possible combinations, namely an algorithm that relies on roll back as soon as an error is detected, and one that is able to correct a single er-ror and recovers from a checkpoint just when two erer-rors strike. We validate the model by means of simulations and finally compare our algorithms with Chen’s approach, empirically showing that ABFT overhead is usually smaller than Chen’s verification cost.

We expect this combined approach to be interesting for other variants of the preconditioned conjugate gradient algorithm [35]. Triangular precondition-ers seem to be particularly attracting, in that it looks possible to treat them by adapting the techniques described in this paper (Shantharam et al. [16] ad-dressed the triangular case).

We discussed the adequacy of the proposed techniques in the single core and distributed memory settings, but did not address shared memory parallel systems with many-cores. In these latter systems, the task-level parallelism necessitates application of accuracy theorems (in the spirit of Theorem 3) to avoid false positives. We leave the resilience of task-parallel executions as a future work.

Acknowledgements

Y. Robert and B. U¸car were partly supported by the French Research Agency (ANR) through the Rescue and SOLHAR (ANR MONU-13-0007) projects. Y. Robert is with Institut Universitaire de France. J. Langou was fully sup-ported by NSF award CCF 1054864.

References

[1] M. Fasi, Y. Robert, B. U¸car, Combining Backward and Forward Recovery to Cope with Silent Errors in Iterative Solvers, Proceedings of the 16th Workshop on Parallel and Distributed Scientific and Engineering Comput-ing 2015 (PDSEC 2015) (2015) 980–989.

[2] A. Moody, G. Bronevetsky, K. Mohror, B. de Supinski, Design, model-ing, and evaluation of a scalable multi-level checkpointing system, in: High Performance Computing, Networking, Storage and Analysis (SC), 2010 In-ternational Conference for, 2010, pp. 1–11.