CADNA demo

(1)

CADNA demo

Fabienne Jézéquel

Sorbonne Université, Laboratoire d’Informatique de Paris 6 (LIP6), France

Workshop on Large-scale Parallel Numerical Computing Technology RIKEN Center for Computational Science, Kobe, Japan, 6-8 June 2019

(2)

gunzip cadna_c-3.1.3.tar.gz tar -xvf cadna_c-3.1.3.tar cd cadna_c-3.1.3

./configure --prefix=‘pwd‘ --enable-fortran make install

Automatic detection of OpenMP, MPI. Up to 10 libraries can be created.

Ex: libcadnaC.a, libcadnaCdebug.a, libcadnaMPICforOpenMP.a, libcadnaMPICdebugforOpenMP.a, libcadnaMPIFortran.a,...

optimized versions: inlining, -O3 debug versions: no inlining, -O0, -g

(3)

Outline

1 CADNA installation

2 Numerical validation of C/C++ codes

3 Numerical validation of Fortran codes

4 The CADTRACE tool

(4)

P=333.75y⁶+x²(11x²y²−y⁶−121y⁴−2)+5.5y⁸+x/(2y) withx=77617andy=33096

Exact result :P≈-0.827396059946821368141165095479816292

§Execution without/with CADNA

(5)

Example 2

The roots of the following second order equation are computed:

0.3x²−2.1x+3.675=0.

The exact values are:

Discriminantd=0 x₁=x₂=3.5

§Execution without/with CADNA:

without CADNA:

wrong branching_⇒the result is false with CADNA:

i f(d==0.) is satisfied ifdis numerical noise

(6)

The determinant of Hilbert’s matrix of size 11 defined by a_i_,j=1/(i+j−1)

is computed without pivoting strategy.

After triangularization, the determinant is the product of the diagonal elements.

(7)

Example 4

[J.-M. Muller, 1987]

The 25 first iterations of the following recurrent sequence are computed:

U_n+1=111−1130/U_n+3000/(U_n∗U_n₋₁)

withU₀=5.5andU₁=61/11. The exact limit is 6.

With CADNA, instabilities related to DSA self-validation are detected.

Then, the accuracy estimation is not reliable.

(8)

A root of the polynomial

f(x)=1.47x³+1.19x²−1.83x+0.45

is computed by Newton’s method. The sequence is initialized withx=0.5. The iterative algorithm

x_n₊₁=x_n−f(x_n)/f⁰(x_n) is stopped by the criterion

|x_n−x_n−1| <10⁻¹².

(9)

Example 5

Without CADNA

stopping criterionfabs(x-y)<1.e-12 x( 35) = +4.285714252078272e-01 x( 36) = +4.285714252078272e-01

The stopping criterion is satisfied by chance.

With CADNA

stopping criterionfabs(x-y)<1.e-12 x(100) = 0.4285714E+000

x(101) = 0.4285714E+000

ifx-yis numerical noise, the test is not satisfied

⇒maximum number of iterations

Because of instable divisions detected by CADNA, a multiple root is suspected.

(10)

With CADNA

stopping criterionfabs(x-y)<=1.e-12or x==y x( 23) = 0.428571437E+000

x( 24) = 0.42857143E+000

ifx-yis numerical noise, the test is satisfied.

we simplify the faction, and so compte a simple root.

stopping criterion x==y

x( 45) = 0.428571428571430E+000 x( 46) = 0.428571428571429E+000

(11)

Example 6

A linear system of size 4 is solved using Gaussian elimination with partial pivoting.

§Execution without/with CADNA Without CADNA

wheni₌2,a[2][2]is 4864.

But that value has actually no correct digits (associate exact value: 0).

a[2][2]is chosen as the pivot and leads to round-off errors in the subsequent computation.

With CADNA

One can observe thata[2][2]has no correct digits.

The test fabsf(a[j][i])_>pmax fails.

a[3][2], that is computed accurately, is chosen as the pivot.

(12)

We compute several times:

x=6.83561e+5; y=6.83560e+5; z=1.00000000007;

r = ((z-x)+y) + ((z-y)+x-2);

Exact result:1.4 10⁻¹⁰

§Execution without/with CADNA Without CADNA:

using IEEE double precision with rounding to nearest r=2.32830643653870E-10

With CADNA:

we perform close evaluations:((z−x)+y)and((z−y)+x−2).

If the same rounding mode is chosen for both parts, the final result appears as

(13)

Example with OpenMP

[Eberhart et al., 2016]

The sum S of an arrayAof size2n withn=10⁶is computed in single precision:

#pragma omp parallel for for (i=0;i<2*n;i=i+2) {

A[i+1]=(float)i+1.;

A[i]=-(float)i;

} S=0.;

#pragma omp parallel for reduction(+:S) schedule(static,1) for (i=0;i<2*n;i++)

S=S+A[i];

Exact result:S=10⁶

(14)

# threads without CADNA with CADNA without CADNA with CADNA 1 1.000000E+06 1.000000E+06 1.000000E+06 1.000000E+06 2 1.000000E+06 1.000000E+06 1.966080E+06 @.0 3 1.000000E+06 1.000000E+06 1.000000E+06 1.000000E+06 32 1.000000E+06 1.000000E+06 1.892352E+06 @.0 64 1.000000E+06 1.000000E+06 1.787904E+06 @.0 128 1.000000E+06 1.000000E+06 1.609728E+06 @.0 239 1.000000E+06 1.000000E+06 1.000000E+06 1.000000E+06 240 1.000000E+06 1.000000E+06 1.617920E+05 @.0

(static): the loop is divided into chunks of size2n/T whereT is the number of threads.

Each thread has in charge contiguous elements of A and alternatively sums positive and negative values: all results are accurate.

(static,1): the chunk size is 1.

(15)

With (static,1) and an even number of threads

With 2 threads:

A: 0 1 2 3 4 5 6 7 8 9 10 11 . . .

thread 0 thread 1 thread 0 computesP(n)=Pn−1

i=0(−2i)= −n²+n and thread 1Q(n)=Pn−1

i=0(2i+1)=n². Fork=2, . . . ,n,

P(k)=P(k−1)−(2k−2)= −((k−1)²+k−1)−(2k−2) andQ(k)=Q(k−1)+2k−1=(k−1)²+2k−1.

The computation ofP(k)andQ(k)involves values with different orders of magnitude and generates round-off errors forksufficiently high.

With an even number of threads_>2: same phenomenon.

The reduction involves inaccurate results with close absolute values and different signs and thus provides numerical noise.

(16)

For example, with 3 threads:

A: 0 1 2 3 4 5 6 7 8 9 10 11 . . .

thread 0 thread 1 thread 2

Each thread has in charge positive and negative elements of A.

No cancellation occurs and all results are accurately computed.

(17)

Example with MPI

[S.M. Rump, 1983]

We compute using 4 MPI processesP=9x⁴−y⁴+2y²withx=10864and y=18817.

Processes 1, 2 and 3:

compute respectively9x⁴,₋y⁴and2y² send their local result to process 0.

Process 0 receives and sums the results.

Exact result :P=1.

§Execution without/with CADNA mpirun -np 4 exampleMPI1

mpirun -np 4 exampleMPI1_cad

(18)

We compute using 4 MPI processesP=9x −y +2y withx=10864and y=18817.

Processes 1, 2 and 3:

compute respectively9x⁴,₋y⁴and2y²in parallel using OpenMP send their local result to process 0.

Process 0 receives and sums the results.

Exact result :P=1.

§Execution without/with CADNA mpirun -np 4 exampleMPI_OMP1 mpirun -np 4 exampleMPI_OMP1_cad

(19)

Outline

4 The CADTRACE tool

(20)

interface operator(+)

module procedure add_double_st_double_st end interface operator(+)

interface

pure function cpp_add_double_st_double_st(a, b) bind(C) import double_st

type(double_st), intent(in) :: a type(double_st), intent(in) :: b

type(double_st) cpp_add_double_st_double_st end function cpp_add_double_st_double_st end interface

in cadna_add_binding.cc:

double_st cpp_add_double_st_double_st( double_st &a, double_st &b ) { return a+b; }

+ specific Fortran sources for overloading of array functions (MATMUL, SUM,

(21)

Outline

4 The CADTRACE tool

(22)

Installation

gunzip cadtrace-2.2.tar.gz tar -xvf cadtrace-2.2.tar cd cadtrace-2.2

./configure --prefix=‘pwd‘

make install

Usegdbwith thegdb_c.infile provided in theextra-filesdirectory and generate agdb.outfile.

§Example with a C program in theexamplesCdirectory:

gdb ex5_cad<gdb_c.in>gdb.out To obtain a detailed list of instabilities:

(23)

§Example with a Fortran program in theexamplesFortrandirectory:

gdb ex5_cad<gdb_c.in>gdb.out To obtain a detailed list of instabilities:

cadtrace_gcc gdb.out

You can also specify the number of function calls that generate each instability.

For instance, to get 3 levels of function calls:

cadtrace_gcc -n 3 gdb.out

Remark: thecadna_enable,cadna_disablefunctions may help for numerical debugging.

(24)

Thanks to Jean-Marie Chesneaux, Julien Brajard, Romuald Carpentier, Patrick Corde, Pacôme Eberhart, Pierre Fortin, Jean-Luc Lamotte, ...

(25)

Thanks to Jean-Marie Chesneaux, Julien Brajard, Romuald Carpentier, Patrick Corde, Pacôme Eberhart, Pierre Fortin, Jean-Luc Lamotte, ...

Thank you for your attention!