Récemment recherché

Aucun résultat trouvé

Étiquettes

Aucun résultat trouvé

Document

Aucun résultat trouvé

Accueil Écoles Thèmes

Connexion

A new convolutions algorithm to leverage tensor cores

Partager "A new convolutions algorithm to leverage tensor cores"

N/A

N/A

Protected

Année scolaire: 2021

Info

Protected

Academic year: 2021

Partager "A new convolutions algorithm to leverage tensor cores"

Copied!

1

0

0

1

0

0

Chargement.... (Voir le texte intégral maintenant)

Télécharger maintenant ( 1 Page )

Texte intégral

(1)

CONTACT NAME

Mickaël Seznec: mickael.seznec@centralesupelec.fr

POSTER

P21960

CATEGORY: ALGORITHMS / NUMERICAL TECHNIQUES - 01

Mickaël Seznec ^1,2 (mickael.seznec@centralesupelec.fr), Nicolas Gac ¹ , Francois Orieux ¹ , Alvin Sashala Naik ²

1 L2S, CentraleSupelec, Gif-Sur-Yvette, France ² Thales Research and Technology, Palaiseau, France

We designed a new algorithm for 2D-convolutions that uses tensor cores . Contrary to cuDNN implementation, it does not rely on batched kernels

for efficiency. Results are already better than current libraries in some contexts and could be further optimized.

A New Convolution Algorithm to Leverage Tensor Cores

Context Proposed algorithm Results

Tensor Cores promise 12X the throughput for matrix multiplication (GEMM)

• Tensor cores are shipped with Nvidia GPUs since Volta (2017).

• Dedicated hardware for matrix multiplication.

• Handle sub-FP32 precision

Convolutions in CNNs already leverage Tensor Cores...

• Convolution are not natively matrix multiplication

• The im2col algorithm is used to rearrange batched convolutions as a matrix multiplication

• Implementation available in cuDNN.

• With a single image, single kernel, im2col becomes a vector / matrix multiplication

• Low artihmetic complexity = execution limited by bandwidth

-

...But poor performances with traditional workloads

Definitions and first approach:

• Does not use the circulant matrix associated with the convolution

• Use the image to build sub-matrices directly with some overlapping

• Convolution kernel is transposed

• Collect final result with a sum on the diagonals

• Leads to a (K, K) x (K, H x W) matrix multiplication = better arithmetic complexity!

Initial implementation with cuBLAS:

• The input matrix is built explicitely but duplicating image lines -> takes time and memory

• But rely on highly optimized GEMMs with cuBLAS

• Comparaison of the new implementation with the simple version (the time to create the matrix is not counted)

• The matrix does not have to be created explicitly:

CUDA blocks know which line to fetch from the image

• Allows image lines to be re-used in the same CUDA kernel, either from caches or shared memory

• Complexify a little the code of the algorithm

Defining the matrix implictly:

Using Nsight-Compute to assess Tensor Cores performance

• Nsight-Compute is available since CUDA 10.0 (2018)

• Is replacing nvvp and nvprof

• Can show Tensor Cores utilization

• We tried to reach the performance from the GEMM example in the CUDA samples

First results:

• Real impact when the kernel is big

• On par with smaller kernel

• FFT is asymoptically better

• Still need some optimizations to outperform other libraries on a larger scale of kernels

References:

• Chetlur, S., Woolley, C.,

Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E. (2014).

cudnn: Efficient primitives for deep learning.

• Vasudevan, A., Anderson, A., & Gregg, D. (2017, July). Parallel multi channel

convolution using general matrix multiplication.

• Seznec, M., Gac, N., Ferrari, A., &

Orieux, F. (2018, October). A Study on Convolution using Half-Precision

Floating-Point Numbers on GPU for Radio Astronomy Deconvolution. (IEEE SiPS)

Scan me for more information &

digital version

Plot # 1: Execution time vs Kernel radius on RTX2080. Lower is better

Plot # 2: Execution time vs Kernel radius on RTX2080. Lower is better

Plot # 3: Execution time vs Kernel radius on RTX2080. Lower is better

Fig # 1: Convolution with a 3x3 kernel

Fig # 2: Im2col rewrites convolutions a GEMM

Fig # 3: our algorithm computes one line of the result with a GEMM + diagonal sum convolutions a GEMM

Fig # 4: we can view our algorithm as a GEMM broadcast to a 3D- tensor

Fig # 5: use Nsight- Compute to profile tensor cores

Références

Télécharger maintenant ( PDF - 1 Page - 498.84 KB )

Documents relatifs

A systematic review on sleep alterations anticipating the onset of bipolar disorder

A common ﬁnding throughout the present review is that decreased sleep may be present time before BD onset in a variable percentage of patients ranging from 24% to 44% in

Case theory and the Projection Principle

object of the matrix clause due to government holding between a verb and complement and certain elements in the complement. Elements in A-psitions are licenced by Casef an

A JPEG-Like Algorithm for Compression of Single-Sensor Camera Image

This paper presents a JPEG-like coder for image compression of single-sensor camera images using a Bayer Color Filter Array (CFA).. The originality of the method is a joint scheme

Model Execution and Debugging: A process to leverage existing tools

The approach consists in analyzing TTS execution information (computed using the TTS Simulation model and stored in the TTS Trace model) to obtain the corre- sponding

Osmolality-based normalization enhances statistical discrimination of untargeted metabolomic urine analysis: results from a comparative study

To further investigate the link between the overall decrease in variability and the statistical power improvement, the ability of these various normalization strategies to reduce

Etude micromécanique du comportement thermomécanique des alliages à mémoire de forme

la contrainte interne en fonction de la fraction de martensite dans le grain pour quatre grains dont le comportement est particulièrement significatif. Leur orientation est définie

طرق وسياسات تمويل المشروعات الصغيرة والمتوسطة من طرف البنوك التجارية BADR دراسة حالة بنك الفلاحة والتنمية الريفية وكالة 463 بشلول

ﺚﻟﺎﺜﻟا ﻞﺼﻔﻟا : ﺔﯿﻔﯾﺮﻟا ﺔﯿﻤﻨﺘﻟا و ﺔﺣﻼﻔﻟا ﻚﻨﺑ ﺔﻟﺎﺣ ﺔﺳارد – ﺔﻟﺎﻛو 463 – لﻮﻠﺸﺑ ﺔﻟﺎﻛﻮﻟا ( ﺎﻬﻨﻣ و ﺔﻴﺟرﺎﳋا ) كﻮﻨﺒﻟا ﲔﺑ ﺎﻣ تﺎﻴﻠﻤﻋ ( ﺎﻤﻛ مﺎﻋ ﻞﻜﺸﺑ ﺔﻴﻜﻨﺒﻟا

A Framework for High Availability Based on a Single System Image

For that, SSI systems provide basic mechanisms like process migration, process checkpoint/restart, software distributed shared memory (DSM), and global data streams

Téléchargez tous les documents en téléchargeant vos documents d'étude.

Votre document sera enrichi, partagé sur 123dok FR pour vous aider à étudier.

Documents relatifs

Modélisation de la fissuration des structures en béton soumises à des sollicitations sévères

Modélisation de la fissuration des structures en béton soumises à des sollicitations sévères

117

0

0

Caractérisation physico-chimique du gypse synthétisé et du phosphogypse

Caractérisation physico-chimique du gypse synthétisé et du phosphogypse

13

0

0

In situ study of the synthesis of thorite (ThSiO 4 ) under environmental representative conditions

In situ study of the synthesis of thorite (ThSiO 4 ) under environmental representative conditions

10

0

0

اثر توظيف التمرينات البليومترية في الدروس العملية على القدرة العضلية و الانجاز الرقمي في فعالية دفع الجلة

اثر توظيف التمرينات البليومترية في الدروس العملية على القدرة العضلية و الانجاز الرقمي في فعالية دفع الجلة

17

0

0

SAIA : Sensors/Actuators Independent Architecture -- A showcase through the Martian Task Specifications

SAIA : Sensors/Actuators Independent Architecture -- A showcase through the Martian Task Specifications

11

0

0

Thaw pond dynamics and carbon emissions in a Siberian lowland tundra landscape

Thaw pond dynamics and carbon emissions in a Siberian lowland tundra landscape

1

0

0

Pierre Molinier, le corps réinventé

Pierre Molinier, le corps réinventé

12

0

0

" Salir del tobo y de la comida " : La Participation politique des femmes dans les Conseils Locaux de Planification Publique

" Salir del tobo y de la comida " : La Participation politique des femmes dans les Conseils Locaux de Planification Publique

21

0

0