• Aucun résultat trouvé

Level 1 Parallel RTN-BLAS: Implementation and Efficiency Analysis

N/A
N/A
Protected

Academic year: 2021

Partager "Level 1 Parallel RTN-BLAS: Implementation and Efficiency Analysis"

Copied!
3
0
0

Texte intégral

(1)

HAL Id: lirmm-01095172

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01095172

Submitted on 27 Jun 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative CommonsAttribution - NoDerivatives| 4.0 International License

Level 1 Parallel RTN-BLAS: Implementation and Efficiency Analysis

Chemseddine Chohra, Philippe Langlois, David Parello

To cite this version:

Chemseddine Chohra, Philippe Langlois, David Parello. Level 1 Parallel RTN-BLAS: Implementation and Efficiency Analysis. SCAN: Scientific Computing, Computer Arithmetic and Validated Numerics, Sep 2014, Wurzburg, Germany. �lirmm-01095172�

(2)

Level 1 Parallel RTN-BLAS:

Implementation and Efficiency Analysis

Chemseddine Chohra,

Philippe Langlois and David Parello

Univ. Perpignan Via Domitia, Digits, Architectures et Logiciels Informatiques, F-66860, Perpignan. Univ. Montpellier II, Laboratoire

d’Informatique Robotique et de Micro´electronique de Montpellier, UMR 5506, F-34095, Montpellier. CNRS, Laboratoire d’Informatique

Robotique et de Micro´electronique de Montpellier, UMR 5506, F-34095, Montpellier.

[email protected]

Keywords: Floating point arithmetic, numerical reproducibility, Round- To-Nearest BLAS, parallelism, summation algorithms.

Modern high performance computation (HPC) performs a huge amount of floating point operations on massively multi-threaded sys- tems. Those systems interleave operations and include both dynamic scheduling and non-deterministic reductions that prevent numerical re- producibility, i.e. getting identical results from multiple runs, even on one given machine. Floating point addition is non-associative and the results depend on the computation order. Of course, numerical repro- ducibility is important to debug, check the correctness of programs and validate the results. Some solutions have been proposed like parallel tree scheme [1] or new Demmel and Nguyen’s reproducible sums [2].

Reproducibility is not equivalent to accuracy: a reproducible result may be far away from the exact result. Another way to guarantee the numerical reproducibility is to calculate the correctly rounded value of the exact result, i.e. extending the IEEE-754 rounding properties to larger computing sequences. When such computation is possible, it is certainly more costly. But is it unacceptable in practice?

We are motivated by round-to-nearest parallel BLAS. We can imple- ment such RTN-BLAS thanks to recent algorithms that compute cor- rectly rounded sums. This work is a first step for the level 1 of the

1

(3)

BLAS routines. We study the efficiency of computing parallel RTN- sums compared to reproducible or classic ones – MKL for instance.

We focus on HybridSum and OnlineExact, two algorithms that smooth the over-cost effect of the condition number for large sums [3,4]. We start with sequential implementations: we describe and analyze some hand-made optimizations to benefit from instruction level parallelism, pipelining and to reduce the memory latency. The optimized over-cost is at least 25% reduced in the sequential case. Then we propose paral- lel RTN versions of these algorithms for shared memory systems. We analyze the efficiency of OpenMP implementations. We exhibit both good scaling properties and less memory effect limitations than exist- ing solutions. These preliminary results justify to continue towards the next levels of parallel RTN-BLAS.

References:

[1] O. Villa, D. G. Chavarr´ıa-Miranda, V. Gurumoorthi, A. M´ arquez, and S. Krishnamoorthy. Effects of floating- point non-associativity on numerical computations on massively multithreaded systems. In CUG Proceedings, (2009), pp. 1–11.

[2] James Demmel and Hong Diep Nguyen , Fast Reproducible Floating-Point Summation. In 21st IEEE Symposium on Com- puter Arithmetic, Austin, TX, USA, April 7-10, (2013), pp. 163—

172.

[3] Yong-Kang Zhu and Wayne. B. Hayes, Correct rounding and a hybrid approach to exact floating-point summation. SIAM J. Sci. Comput., (2009), Vol. 31, No. 4, pp. 2981–3001.

[4] Yong-Kang Zhu and Wayne. B. Hayes, Algorithm 908: On- line exact summation of floating-point streams. ACM Trans. Math.

Software, (2010), 37:1–37:13.

2

Références

Documents relatifs

Performing Arithmetic Operations on Canonically Represented Values The fundamental idea of the canonical radix-2 RN-representation is that it is a binary representation of some

Over all the custom operators, we observe on the ST231 that for the binary32 format and ◦ = RN, the specialized operators introduce speedups ranging from 1.4 to 4.2 compared with

For instance, the arithmetic operations on quaternions as well as conversion algorithms to/from rotation matrices are subject to spurious under/overflow (an intermediate

If an RN-represented number is in canonical encoding, conversion into ordinary two’s complement representation may require a nonzero round-bit to be added in; it simply consists in

sidered as an asset for interval computations: as the exact result always belongs to any computed result, it belongs to the intersection of all the computed results and one could

Finally for dot product we show the performance on a distributed memory parallel system (platform C). In this case we have two levels of parallelism: OpenMP is used for thread

level 1 BLAS routines, FastAccSum or iFastSum are useful for short vectors while larger ones benefit from HybridSum or OnlineExact as we will explain it.. 3 Sequential Level

In practice for the level 1 BLAS routines, FastAccSum or iFastSum are useful for short vectors while larger ones benefit from HybridSum or OnlineExact as we will explain it..