• Aucun résultat trouvé

Short Introduction to the Roofline Model

N/A
N/A
Protected

Academic year: 2021

Partager "Short Introduction to the Roofline Model"

Copied!
9
0
0

Texte intégral

(1)

Short Introduction

to the Roofline Model

UL HPC School

June 2019

Xavier Besseron

University of Luxembourg, RUES Luxembourg XDEM Research Centre

http://luxdem.uni.lu/

(2)

Roofline model Overview

Estimate the performance of an algorithm on a given CPU

○ Also applies to GPUs, TPUs, etc.

● Throughput oriented model

● Identify the bottleneck

● Allow to improve the implementation of an algorithm

Model of the CPU

Model of the algorithm

Model of the performance

(3)

Roofline model

Peak performance limited by

● Compute operations: Gflop/s

● Data bandwidth: GB/s Algorithm characteristics

● Operations: Gflop

● Data: GB

CPU

(compute, flop/s)

FPU FPU

Memory

(data, GB)

Bandwidth

(GB/s)

A

(data, GB)

B

(data, GB)

C

(data, GB)

Operations (flop)

Model of a CPU Model of an algorithm

Arithmetic Intensity AI: flop / Byte

Peak Gflop/s

Attainable

(4)

Compute-bound Memory-bound

Roofline Plot

Performance [Gflop/s] (logscale)

Arithmetic Intensity [flop/Byte] (logscale)

Peak GB/s x AI

Peak Gflop/s Attainable Gflop/s

Algorithm with given AI Measured

performance Maximal attainable performance

(5)

Advanced Roofline Plot

Performance [Gflop/s] (logscale)

Arithmetic Intensity [flop/Byte] (logscale)

DRAM Peak GB/s x AI

Peak Gflop/s with SIMD L2 Cache Peak GB/s x AI

Peak Gflop/s without SIMD

Attainable Gflop/s with cache optimization and SIMD

Attainable Gflop/s without cache optimization and SIMD Algo1 requires

cache optimization

Algo2 requires vectorization

Change algorithm to reduce data access

(6)

Comments about the Roofline Model

In theory

● Gives good insight of the bottleneck of a given algorithm

In practice, use automatic tools

● CPU model can be hard to find

● Algorithm characterization is hard for complex algorithm

Warning

● The Roofline Model tells if an algorithm performs well,

● not if the algorithm is the best for your problem

e.g. Bubble sort O(n

2

) vs Quicksort O(n log n)

(7)

Tools to work with the roofline

CS Roofline Toolkit, Berkeley Lab

https://bitbucket.org/berkeleylab/cs-roofline-toolkit/

LIKWID, RRZE-HPC

https://github.com/RRZE-HPC/likwid

Intel Advisor, Intel

https://software.intel.com/en-us/advisor

(8)

More details

Roofline: An Insightful Visual Performance Model for Multicore

Architectures, Williams et al., CACM, 2009

https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/papers/RooflineVyNoYellow.pdf

Performance Tuning of Scientific Codes with the Roofline Model, Williams et

al., SC’18 Tutorial, 2018

https://crd.lbl.gov/assets/Uploads/SC18-Roofline-1-intro.pdf

Applying the roofline model, Ofenbeck et al., ISPASS, 2014

http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/ispass-2013_177.pdf

(9)

Practical session with Intel Advisor on Iris

Follow the instructions at https://gitlab.uni.lu/xbesseron/tutorial_roofline_model

Références

Documents relatifs

Our contribution in this field of study focuses on implementating reliable solar potential prediction model that able to predict the SENELEC’s capacity of production for

impact of the number of processors, the size of fast memory, and. the fast memory bandwidth, by varying these parameters

The tests consist in measuring time needed to execute one of the six memory operation for different sizes.. are used: in one test, allocated data size is fixed and we vary the

However, the modeller only has access to the internal error of the model: this error is biased and underestimates the generalization error because the same data have been used

The problem can be stated as follows: given (i) an application represented as directed acyclic graph (DAG), and (ii) a many-core platform with P identical processors sharing

methods developed for metazoan phagocytic blood cells to investigate prey recognition by O. Wootton et al., 2007); Keeling and Slamovits have uncovered exciting genetic

A memory transaction (corresponding to either a load or a store operation) between a SRI master and a SRI slave is modeled by a sequence of synchronizations of the

Ainsi au travers d’exemples, nous verrons dans cette partie que les réactions de polymérisation ou réticulation à l’interface eau-air permettent dans certains cas la