• Aucun résultat trouvé

Can we avoid rounding-error estimation in HPC codes and still get trustworthy results?

N/A
N/A
Protected

Academic year: 2022

Partager "Can we avoid rounding-error estimation in HPC codes and still get trustworthy results?"

Copied!
27
0
0

Texte intégral

(1)

Can we avoid rounding-error estimation in HPC codes and still get trustworthy results?

Fabienne Jézéquel

1

, Stef Graillat

1

, Daichi Mukunoki

2

, Toshiyuki Imamura

2

, Roman Iakymchuk

1

1LIP6, Sorbonne Université, CNRS, Paris, France 2RIKEN Center for Computational Science, Kobe, Japan

13th International Workshop on Numerical Software Verification 2020

20-21 July 2020

(2)

Introduction

Current computers: a high number of floating-point operations performed ,

Each of them can lead to a rounding error /

Numerical validation is crucial ...but costful /

execution time overhead

development cost induced by the application of numerical validation methods to HPC codes

Can we address this cost problem ...and still get trustworthy results?

Yes, when the input data is affected by rounding and/or measurement errors.

(3)

Introduction

Current computers: a high number of floating-point operations performed ,

Each of them can lead to a rounding error /

Numerical validation is crucial

...but costful /

execution time overhead

development cost induced by the application of numerical validation methods to HPC codes

Can we address this cost problem ...and still get trustworthy results?

Yes, when the input data is affected by rounding and/or measurement errors.

(4)

Introduction

Current computers: a high number of floating-point operations performed ,

Each of them can lead to a rounding error /

Numerical validation is crucial ...but costful /

execution time overhead

development cost induced by the application of numerical validation methods to HPC codes

Can we address this cost problem ...and still get trustworthy results?

Yes, when the input data is affected by rounding and/or measurement errors.

(5)

Introduction

Current computers: a high number of floating-point operations performed ,

Each of them can lead to a rounding error /

Numerical validation is crucial ...but costful /

execution time overhead

development cost induced by the application of numerical validation methods to HPC codes

Can we address this cost problem ...and still get trustworthy results?

Yes, when the input data is affected by rounding and/or measurement errors.

(6)

Definitions

Let

y=f(x)

be an exact result and

yˆ=fˆ(x)

be the associated computed result.

The forward error is the difference between

y

and

yˆ

.

The backward analysis tries to seek for

x

s.t.

yˆ=f(x+∆x)

.

x

is the backward error associated with

yˆ

.

It measures the distance between the problem that is solved and the initial one.

The condition number

C

of the problem is defined as:

C:= lim

ε→0+ sup

|∆x|≤ε

·|f(x+∆x)f(x)|

|f(x)| /|∆x|

|x|

¸ .

It measures the effect on the result of data perturbation.

(7)

Error induced by perturbed data

The relative rounding error is denoted by

u

.

binary64

format (double precision):

u=253 binary32

format (single precision):

u=2−24

.

If the algorithm is backward-stable (i.e. the backward error is of the order of

u

)

|f(x)−fˆ(x)|/|f(x)|.Cu.

If the input data are perturbed,

i.e.

the input data are not

x

but

xˆ=x(1+δ)

, then one computes

fˆ( ˆx)

with

|f(x)−fˆ( ˆx)|/|f(x)|.C(u+ |δ|).

If

|δ| Àu

, the rounding error generated by

fˆ

is negligible w.r.t.

C|δ|

.

Estimating this rounding error may be avoided.

(8)

Error induced by perturbed data

The relative rounding error is denoted by

u

.

binary64

format (double precision):

u=253 binary32

format (single precision):

u=2−24

.

If the algorithm is backward-stable (i.e. the backward error is of the order of

u

)

|f(x)−fˆ(x)|/|f(x)|.Cu.

If the input data are perturbed,

i.e.

the input data are not

x

but

xˆ=x(1+δ)

, then one computes

fˆ( ˆx)

with

|f(x)−fˆ( ˆx)|/|f(x)|.C(u+ |δ|).

If

|δ| Àu

, the rounding error generated by

fˆ

is negligible w.r.t.

C|δ|

.

Estimating this rounding error may be avoided.

(9)

Discrete Stochastic Arithmetic (DSA)

[J. Vignes, 2004]

Rounding error estimation

each operation is executed 3 times with a random rounding mode:

R→(R1,R2,R3)

where each result

Ri

is rounded up or down with the same probability

the number of correct digits in the results is estimated using Student’s test with the confidence level 95%

operations are executed synchronously

⇒ detection of numerical instabilities

(10)

The CADNA library

http://cadna.lip6.fr

CADNA allows one to use DSA in any scientific program written in C, C

++

or Fortran.

CADNA provides new numerical types, the stochastic types, which consist of: 3 floating point variables

an integer variable to store the accuracy.

All operators and mathematical functions are redefined for these types.

CADNA requires only a few modifications in user programs.

Performance overhead:

×4

memory,

≈ ×10

execution time

(11)

The CADNA library

http://cadna.lip6.fr

CADNA allows one to use DSA in any scientific program written in C, C

++

or Fortran.

CADNA provides new numerical types, the stochastic types, which consist of:

3 floating point variables

an integer variable to store the accuracy.

All operators and mathematical functions are redefined for these types.

CADNA requires only a few modifications in user programs.

Performance overhead:

×4

memory,

≈ ×10

execution time

(12)

Combining DSA and standard floating-point arithmetic

Computation routines are executed in a code that is controlled using DSA.

Their input data are affected by errors (rounding errors and/or measurement errors).

NSV2020 Can we avoid rounding-error estimation in HPC codes and still get trustworthy results? 7 / 19

(13)

Combining DSA and standard floating-point arithmetic

Computation routines are executed in a code that is controlled using DSA.

Their input data are affected by errors (rounding errors and/or measurement errors).

Computation with a call to CADNA routines:

stochastic data

D

CADNA routine(s)

stochastic result

R

D

and

R

consist in stochastic arrays (each element is a triplet).

we compare the number of correct digits (estimated by CADNA) in

R

and

R0

(14)

Combining DSA and standard floating-point arithmetic

Computation routines are executed in a code that is controlled using DSA.

Their input data are affected by errors (rounding errors and/or measurement errors).

Computation with 3 calls to classic routines:

stochastic data

D D2

D1

D3

classic routine(s) classic routine(s)

classic routine(s)

R02 R01

R03

stochastic result

R0

input data: 3 classic floating-point arrays

D1,D2,D3

created from the triplets of

D

We get 3 classic floating-point arrays

R10,R20,R03

.

A stochastic array

R0

created from

R10,R20,R03

can be used in the next parts of the code.

we compare the number of correct digits (estimated by CADNA) in

R

and

R0

(15)

Combining DSA and standard floating-point arithmetic

Computation routines are executed in a code that is controlled using DSA.

Their input data are affected by errors (rounding errors and/or measurement errors).

Computation with 3 calls to classic routines:

stochastic data

D D2

D1

D3

classic routine(s) classic routine(s)

classic routine(s)

R02 R01

R03

stochastic result

R0

input data: 3 classic floating-point arrays

D1,D2,D3

created from the triplets of

D

We get 3 classic floating-point arrays

R10,R20,R03

.

A stochastic array

R0

created from

R10,R20,R03

can be used in the next parts of the code.

we compare the number of correct digits (estimated by CADNA) in

R

and

R0

(16)

Accuracy comparison

Experimental setup

Each random input value is perturbed with a relative error

δ

. For

i=1, . . . ,n2

(matrix mult.) or for

i=1, ...,n

(matrix-vector mult.) we analyze:

the accuracy

CRi

of the element

Ri

of

R

the accuracy

CR0i

of the element

R0i

of

R0

i

¯CRiCR0i

¯

¯

(17)

Accuracy comparison

in double precision

accuracy accuracy difference

δ

of

R

between

R

&

R0

mean min-max mean max

Multiplication of matrices of size 500

1.e-14 13.9 9-15 2.5e-02 2

1.e-13 12.8 8-15 5.8e-03 1

1.e-12 11.9 7-14 4.2e-04 1

1.e-11 10.9 6-13 2.4e-05 1

Multiplication of a matrix of size 1000 with a vector

1.e-14 13.9 12-15 4.6e-02 1

1.e-13 12.7 11-14 7.0e-03 1

1.e-12 11.8 10-13 0 0

1.e-11 10.9 9-12 0 0

As the order of magnitude of

δ%

the mean accuracy

&

by 1 digit.

Low difference between the accuracy of

R

&

R0

(18)

Performance comparison

We compare the performance of the CADNA routine with codes using:

a naive floating-point algorithm the Intel MKL implementation.

In both cases: sequential and OpenMP 4 cores

(19)

Performance for matrix multiplication

Execution time including matrix multiplications and array copies:

10-4 10-3 10-2 10-1 100 101 102 103

400 800 1200 1600 2000

Execution time (s)

Matrix size (n)

CADNA naive seq naive OMP MKL seq MKL OMP

The codes using 3 classic matrix multiplications perform better than the CADNA routine.

For matrices of size 2000, the MKL OpenMP implementation outperforms

the CADNA routine by a factor 294 (this gain increases on many-cores).

(20)

Performance for matrix multiplication

Execution time for matrices of size 2000:

0 20 40 60 80 100 120 140

CADNA naive seq naive OMP MKL seq MKL OMP

Execution time (s)

computation array copies

Most of the execution time is spent in matrix multiplication.

(21)

Performance for matrix-vector multiplication

Execution time including matrix-vector multiplications and array copies:

0 0.2 0.4 0.6 0.8 1 1.2 1.4

2000 4000 6000 8000 10000

Execution time (s)

Matrix size (n) CADNA

naive seq naive OMP MKL seq MKL OMP

The CADNA routine performs better than the other sequential codes.

(22)

Performance for matrix-vector multiplication

Execution time for matrices of size 10000:

0 0.2 0.4 0.6 0.8 1 1.2 1.4

CADNA naive seq naive OMP MKL seq MKL OMP

Execution time (s)

computation array copies

Except with the CADNA routine, the main part of the execution time is spent in array copies.

Both computation and array copies are parallelized in the OpenMP codes.

(23)

Conclusions/Perspectives

In a code controlled using CADNA, if computation-intensive routines are run with perturbed data,

classic BLAS routines can be executed 3 times instead of the CADNA routines with almost no accuracy difference on the results

the performance gain can be high with BLAS routines from an optimized library

but we lose the instability detection.

The same conclusions would be valid with an HPC code using MPI.

CADNA-MPI routines

optimized floating-point MPI routines.

Application of our approach to real-life examples with realistic data sets.

(24)

Thanks for your attention!

(25)
(26)

On the number of runs

2

or

3

runs are enough. To increase the number of runs is not necessary.

From the model, to increase by

1

the number of exact significant digits given by

CR

, we need to multiply the size of the sample by

100

.

Such an increase of

N

will only point out the limit of the model and its error without really improving the quality of the estimation.

It has been shown that

N=3

is the optimal value.

[Chesneaux & Vignes, 1988]

(27)

On the probability of the confidence interval

With

β=0.05

and

N=3

,

the probability of overestimating the number of exact significant digits of at least

1

is

0.054%

the probability of underestimating the number of exact significant digits of at least

1

is

29%

.

By choosing a confidence interval at

95%

, we prefer to guarantee a minimal

number of exact significant digits with high probability (

99.946%

), even if we are

often pessimistic by

1

digit.

Références

Documents relatifs

Abstract: Worldwide schistosomiasis remains a serious public health problem with approximately 67 million people infected and 200 million at risk of infection from inhabiting

The fact that the students in Gershon’s project are making sense of science – and note that the word itself is used in the essay without much elaboration or critical scrutiny –

Moreover, we also found some general differences between the perceived trustworthiness of people from different groups, particularly that older people are considered more

In this topical issue “Numerical methods and HPC”, several of these issues are addressed, for example program- ming environments, I/O and storage performance, Big Data and HPC

The original phrase: “The exascale definition limited to machines capable of a rate of 1018 flops is of interest to only few scienti fic domains.”. Must be corrected by:

The functional secu- rity requirements may well serve as input for a (security-related) trustworthiness modeling, whereas the security assurance requirements, as

In this section, we compare the accuracy of results provided by the CADNA version and the classic version of BLAS routines for matrix (xGEMM) and matrix-vector (xGEMV)

CADNA enables one to estimate the numerical quality of any result and detect numerical instabilities. CADNA provides new numerical types, the stochastic types, which consist of: