Récemment recherché

Aucun résultat trouvé

Étiquettes

Aucun résultat trouvé

Document

Aucun résultat trouvé

Accueil Écoles Thèmes

Connexion

Level 1 Parallel RTN-BLAS: Implementation and Efficiency Analysis

Partager "Level 1 Parallel RTN-BLAS: Implementation and Efficiency Analysis"

N/A

N/A

Protected

Année scolaire: 2021

Info

Protected

Academic year: 2021

Partager "Level 1 Parallel RTN-BLAS: Implementation and Efficiency Analysis"

Copied!

3

0

0

3

0

0

Chargement.... (Voir le texte intégral maintenant)

Télécharger maintenant ( 3 Page )

Texte intégral

(1)

HAL Id: lirmm-01095172

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01095172

Submitted on 27 Jun 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative CommonsAttribution - NoDerivatives| 4.0 International License

Level 1 Parallel RTN-BLAS: Implementation and Eﬀiciency Analysis

Chemseddine Chohra, Philippe Langlois, David Parello

To cite this version:

Chemseddine Chohra, Philippe Langlois, David Parello. Level 1 Parallel RTN-BLAS: Implementation and Eﬀiciency Analysis. SCAN: Scientific Computing, Computer Arithmetic and Validated Numerics, Sep 2014, Wurzburg, Germany. �lirmm-01095172�

(2)

Level 1 Parallel RTN-BLAS:

Implementation and Efficiency Analysis

Chemseddine Chohra,

Philippe Langlois and David Parello

Univ. Perpignan Via Domitia, Digits, Architectures et Logiciels Informatiques, F-66860, Perpignan. Univ. Montpellier II, Laboratoire

d’Informatique Robotique et de Micro´electronique de Montpellier, UMR 5506, F-34095, Montpellier. CNRS, Laboratoire d’Informatique

Robotique et de Micro´electronique de Montpellier, UMR 5506, F-34095, Montpellier.

[email protected]

Keywords: Floating point arithmetic, numerical reproducibility, Round- To-Nearest BLAS, parallelism, summation algorithms.

Modern high performance computation (HPC) performs a huge amount of floating point operations on massively multi-threaded sys- tems. Those systems interleave operations and include both dynamic scheduling and non-deterministic reductions that prevent numerical re- producibility, i.e. getting identical results from multiple runs, even on one given machine. Floating point addition is non-associative and the results depend on the computation order. Of course, numerical repro- ducibility is important to debug, check the correctness of programs and validate the results. Some solutions have been proposed like parallel tree scheme [1] or new Demmel and Nguyen’s reproducible sums [2].

Reproducibility is not equivalent to accuracy: a reproducible result may be far away from the exact result. Another way to guarantee the numerical reproducibility is to calculate the correctly rounded value of the exact result, i.e. extending the IEEE-754 rounding properties to larger computing sequences. When such computation is possible, it is certainly more costly. But is it unacceptable in practice?

We are motivated by round-to-nearest parallel BLAS. We can imple- ment such RTN-BLAS thanks to recent algorithms that compute cor- rectly rounded sums. This work is a first step for the level 1 of the

1

(3)

BLAS routines. We study the efficiency of computing parallel RTN- sums compared to reproducible or classic ones – MKL for instance.

We focus on HybridSum and OnlineExact, two algorithms that smooth the over-cost effect of the condition number for large sums [3,4]. We start with sequential implementations: we describe and analyze some hand-made optimizations to benefit from instruction level parallelism, pipelining and to reduce the memory latency. The optimized over-cost is at least 25% reduced in the sequential case. Then we propose paral- lel RTN versions of these algorithms for shared memory systems. We analyze the efficiency of OpenMP implementations. We exhibit both good scaling properties and less memory effect limitations than exist- ing solutions. These preliminary results justify to continue towards the next levels of parallel RTN-BLAS.

References:

[1] O. Villa, D. G. Chavarr´ıa-Miranda, V. Gurumoorthi, A. M´ arquez, and S. Krishnamoorthy. Effects of floating- point non-associativity on numerical computations on massively multithreaded systems. In CUG Proceedings, (2009), pp. 1–11.

[2] James Demmel and Hong Diep Nguyen , Fast Reproducible Floating-Point Summation. In 21st IEEE Symposium on Com- puter Arithmetic, Austin, TX, USA, April 7-10, (2013), pp. 163—

172.

[3] Yong-Kang Zhu and Wayne. B. Hayes, Correct rounding and a hybrid approach to exact floating-point summation. SIAM J. Sci. Comput., (2009), Vol. 31, No. 4, pp. 2981–3001.

[4] Yong-Kang Zhu and Wayne. B. Hayes, Algorithm 908: On- line exact summation of floating-point streams. ACM Trans. Math.

Software, (2010), 37:1–37:13.

2

Références

Télécharger maintenant ( PDF - 3 Page - 123.32 KB )

Documents relatifs

Performing Arithmetic Operations on Round-to-Nearest Representations

Performing Arithmetic Operations on Canonically Represented Values The fundamental idea of the canonical radix-2 RN-representation is that it is a binary representation of some

Custom floating-point arithmetic for integer processors : algorithms, implementation, and selection

Over all the custom operators, we observe on the ST231 that for the binary32 format and ◦ = RN, the specialized operators introduce speedups ranging from 1.4 to 4.2 compared with

Algorithms for manipulating quaternions in floating-point arithmetic

For instance, the arithmetic operations on quaternions as well as conversion algorithms to/from rotation matrices are subject to spurious under/overflow (an intermediate

Performing Arithmetic Operations on Round-to-Nearest Representations

If an RN-represented number is in canonical encoding, conversion into ordinary two’s complement representation may require a nonzero round-bit to be added in; it simply consists in

Numerical reproducibility in HPC: issues in interval arithmetic

sidered as an asset for interval computations: as the exact result always belongs to any computed result, it belongs to the intersection of all the computed results and one could

Parallel experiments with RARE-BLAS

Finally for dot product we show the performance on a distributed memory parallel system (platform C). In this case we have two levels of parallelism: OpenMP is used for thread

Efficiency of Reproducible Level 1 BLAS

level 1 BLAS routines, FastAccSum or iFastSum are useful for short vectors while larger ones benefit from HybridSum or OnlineExact as we will explain it.. 3 Sequential Level

Implementation and Efficiency of Reproducible Level 1 BLAS

In practice for the level 1 BLAS routines, FastAccSum or iFastSum are useful for short vectors while larger ones benefit from HybridSum or OnlineExact as we will explain it..

Documents relatifs

Inégalité d'hermite-hadamard pour différents types de convexité

Inégalité d'hermite-hadamard pour différents types de convexité

41

0

0

Basic principles of unconventional gyros

Basic principles of unconventional gyros

125

0

0

An analysis of Sovereign Wealth Funds and international real estate investments

An analysis of Sovereign Wealth Funds and international real estate investments

129

0

0

Preparation and 13C NMR identification of solid cyclodextrin inclusion compounds

Preparation and 13C NMR identification of solid cyclodextrin inclusion compounds

9

0

0

Shareholder value and equilibrium rate of unemployment

Shareholder value and equilibrium rate of unemployment

11

0

0

Reconstruction of normal and pathological human epidermis on polycarbonate filter

Reconstruction of normal and pathological human epidermis on polycarbonate filter

19

0

0

The importance of an integrated genotype-phenotype strategy to unravel the molecular bases of titinopathies

The importance of an integrated genotype-phenotype strategy to unravel the molecular bases of titinopathies

12

0

0