HAL Id: hal-01140280
https://hal.archives-ouvertes.fr/hal-01140280
Submitted on 8 Apr 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
ExBLAS: Reproducible and Accurate BLAS Library
Roman Iakymchuk, Stef Graillat, Sylvain Collange, David Defour
To cite this version:
Roman Iakymchuk, Stef Graillat, Sylvain Collange, David Defour. ExBLAS: Reproducible and Ac-curate BLAS Library. RAIM: Rencontres Arithmétiques de l’Informatique Mathématique, Apr 2015, Rennes, France. 7ème Rencontre Arithmétique de l’Informatique Mathématique, 2015. �hal-01140280�
LATEX TikZposter
ExBLAS: Reproducible and Accurate BLAS Library
Roman Iakymchuk, Stef Graillat, Sylvain Collange, and David Defour
Sorbonne Universit´es, UPMC Univ Paris 06, ICS, UMR 7606, LIP6, F-75005 Paris, France
INRIA – Centre de recherche Rennes – Bretagne Atlantique, F-35042 Rennes Cedex, France
DALI–LIRMM, Universit´e de Perpignan, 52 avenue Paul Alduy, F-66860 Perpignan, France
ExBLAS: Reproducible and Accurate BLAS Library
Roman Iakymchuk, Stef Graillat, Sylvain Collange, and David Defour
Sorbonne Universit´es, UPMC Univ Paris 06, ICS, UMR 7606, LIP6, F-75005 Paris, France
INRIA – Centre de recherche Rennes – Bretagne Atlantique, F-35042 Rennes Cedex, France
DALI–LIRMM, Universit´e de Perpignan, 52 avenue Paul Alduy, F-66860 Perpignan, France
Abstract
We aim at providing algorithms and implementations for fundamental linear algebra oper-ations – like the ones included in the Basic Linear Algebra Subprograms (BLAS) library – that would deliver reproducible and accurate results with reasonable performance overhead compared to the standard non-reproducible implementations on modern parallel architec-tures such as Intel processors, Intel Xeon Phi co-processors, and GPU accelerators.
Introduction
In general, BLAS routines rely on the optimized version of parallel reduction and dot product involving floating-point additions and multiplications operations that are non-associative. Due to non-associativity of these operations and dynamic scheduling on paral-lel architectures, getting a bitwise reproducible floating-point result for multiple executions of the same code on different or even similar parallel architectures is challenging. These discrepancies worsen on heterogeneous architectures – such as clusters composed of stan-dard CPUs in conjunction with GPU accelerators and/or Intel Xeon Phi co-processors –, which combine together different programming environments that may obey various floating-point models and offer different intermediate precision or different operators. Such non-determinism and non-reproducibility of floating-point computations on parallel ma-chines causes validation and debugging issues, and may even lead to deadlocks.
Existing solutions to enhance reproducibility of BLAS routines are
• Fixed reduction scheme such as Intel’s “Conditional Numerical Reproducibility” (CNR) available as an option in the Math Kernel Library (MKL);
• Avoid rounding error by using the Kulisch accumulator [3];
• Mixed solution as the one proposed by Demmel and Nguyen for BLAS level-1.
Our Approach
We introduced in [1] an approach to compute deterministic sums of floating-point numbers. This approach is based on a multi-level algorithm that combines efficiently floating-point expansion [2], which are placed in registers, and Kulisch accumulator.
Fig. 1: Hierarchical superaccumulation scheme.
Performance Results
We provided implementations of the multi-level summation scheme on a range of parallel platforms: desktop and server CPUs, the Intel Xeon Phi many-core accelerator, and both NVIDIA and AMD GPUs. We relied on the parallel summation algorithm as well as exact multiplication to develop the fast, accurate, and reproducible implementations of fundamental linear algebra operations such as dot product, triangular solver, and matrix-matrix multiplication. We verified the accuracy of our implementations by comparing the computed results with the ones produced by the multiple precision MPFR library.
0 5 10 15 20 25
1000 10000 100000 1e+06 1e+07 1e+08 1e+09
Gacc/s Array size Parallel FP sum Serial sum TBB deterministic Demmel fast Superacc FPE2 + Superacc FPE4 + Superacc FPE8EE + Superacc 0 1 2 3 4 5 6
1 1e+20 1e+40 1e+60 1e+80 1e+100 1e+120 1e+140
Gacc/s Dynamic range Parallel FP Sum Superacc FPE2 + Superacc FPE3 + Superacc FPE4 + Superacc FPE8EE + Superacc
Fig. 2: The summation performance results on an 8-core Intel Xeon E5-4650L.
0 2 4 6 8 10
ParallelFPSumDemmelFastTBBDeterministicSuperaccFPE2+SuperaccFPE4+SuperaccFPE8EE+SuperaccParallelFPSumDemmelFastTBBDeterministicSuperaccFPE2+SuperaccFPE4+SuperaccFPE8EE+SuperaccParallelFPSumDemmelFastTBBDeterministicSuperaccFPE2+SuperaccFPE4+SuperaccFPE8EE+SuperaccParallelFPSumDemmelFastTBBDeterministicSuperaccFPE2+SuperaccFPE4+SuperaccFPE8EE+SuperaccParallelFPSumDemmelFastTBBDeterministicSuperaccFPE2+SuperaccFPE4+SuperaccFPE8EE+SuperaccParallelFPSumDemmelFastTBBDeterministicSuperaccFPE2+SuperaccFPE4+SuperaccFPE8EE+SuperaccParallelFPSumDemmelFastTBBDeterministicSuperaccFPE2+SuperaccFPE4+SuperaccFPE8EE+Superacc
Normilized time to parallel summation
MPI Processes Computation Reduction 64 32 16 8 4 2 1
Fig. 3: Performance scaling on the Mesu cluster composed of 64 Intel Xeon E5-4650L nodes.
0 0.2 0.4 0.6 0.8 1 1.2 0 2000 4000 6000 8000 10000 12000 14000 16000 Time [secs] Matrix size [n] Parallel DTRSV Superacc FPE3 + Superacc FPE4 + Superacc FPE6 + Superacc FPE8 + Superacc FPE6EE + Superacc 0 0.5 1 1.5 2 0 500 1000 1500 2000 Time [secs] Matrix size [m = n = k] Parallel DGEMM Superacc FPE3 + Superacc FPE4 + Superacc FPE6 + Superacc FPE8 + Superacc FPE4EE + Superacc
Fig. 4: The DTRSV and DGEMM performance results on NVIDIA K5000 and Tesla K20c, respectively.
Conclusions and Future Work
We presented an approach to achieve correct rounding for the floating-point summation problem, along with implementations on multi- and many-core architectures. This yielded results that are both reproducible and accurate to the last bit at no performance lost. We applied this approach to the other BLAS routines. Even though the performance of trian-gular solver and matrix product can be argued, their output is consistently reproducible and accurate independently of threads scheduling and data partitioning.
We plan to extend this approach to the rest of BLAS routines, derive specific implementa-tions for compute-bound algorithms to be within 10 times performance overhead at most, and perform theoretical analysis of their accuracy.
Acknowledgement
This work undertaken (partially) in the framework of CALSIMLAB is supported by the public grant ANR-11-LABX-0037-01 overseen by the French National Research Agency (ANR) as part of the “Investissements d’Avenir” program (reference: ANR-11-IDEX-0004-02). This work was also granted access to the HPC resources of ICS financed by Region ˆIle-de-France and the project Equip@Meso (reference ANR-10-EQPX-29-01).
References
[1] Sylvain Collange, David Defour, Stef Graillat, and Roman Iakymchuk. Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures. Technical Report HAL: hal-00949355, INRIA, DALI–LIRMM, LIP6, ICS, February 2014.
[2] Y. Hida, X. S. Li, and D. H. Bailey. Algorithms for quad-double precision floating point arithmetic. In Proceedings of the 15th IEEE Symposium on Computer Arithmetic, pages 155–162. IEEE Computer Society Press, Los Alamitos, CA, USA, 2001. [3] U. W. Kulisch. Computer arithmetic and validity, volume 33 of de Gruyter Studies in Mathematics. Walter de Gruyter &
Co., Berlin, 2nd edition, 2013. Theory, implementation, and applications.