Modeling and minimizing memory contention in general-purpose GPUs

(1)

Modeling and Minimizing Memory Contention in General-Purpose GPUs Modelleren en minimaliseren van geheugencontentie

Lu Wang

Doctor in de ingenieurswetenschappen: computerwetenschappen

Voorzitter: prof. dr. ir. K. De Bosschere

(2)

ISBN 978-94-6355-408-4 NUR 980, 987

Wettelijk depot: D/2020/10.500/85

(3)

Examination Committee

Prof. Filip De Turck, chair

Department of Information Technology, Faculty of Engineering and Architecture Ghent University

Prof. Koen De Bosschere, secretary

Department of Electronics and Information Systems, Faculty of Engineering and Architecture

Ghent University

Prof. Lieven Eeckhout, supervisor

Department of Electronics and Information Systems, Faculty of Engineering and Architecture

Ghent University Prof. Jan Fostier

Department of Information Technology, Faculty of Engineering and Architecture Ghent University

Prof. Magnus Jahre

Department of Computer Science,

Faculty of Information Technology and Electrical Engineering Norwegian University of Science and Technology

Dr. Cecilia Gonz´alez ´Alvarez Nokia Bell Labs Belgium Prof. David R. Kaeli

Department of Electrical and Computer Engineering, Northeastern University

i

(4)

(5)

Acknowledgements

Time flies, I am still surprised that I have already finished my PhD. I really experienced a memorable life in the past four years in Gent.

I express my deep gratitude toward my advisor, Professor Lieven Eeckhout, for his valuable guidance both in work and in life during my PhD. Without him, I can not finish my PhD successfully. He taught me how to do research in a systematic way and how to write an academic paper. He also guided me how to present my work in a lecture. These lessons will always be at the foundation of my professional life. Lieven is a generous and patient professor who is always enjoying discussing problems with students. I still remember that he contributed a large amount of time on my first paper due to my poor English writing. He said he really enjoyed working with his students. Those words encouraged me to work harder in the following years. Meanwhile, his great passion for research inspired me to be a better researcher. It has been an honor to work with him and learn from him in the past four years.

Many thanks to the members in my examination committee for their valuable feedback and insightful questions which helped me to improve my thesis. I especially want to thank David and Magnus since they helped me a lot during my PhD. Their collaborations and guidance contributed to several publications during my PhD. David is especially professional in the field of GPU architecture. He gave me a lot of suggestions in my first project and encouragement for my future research. I have been working with Magnus for the MDM project for almost two and a half years. Actually, I was always questioning myself when I started working on this project since performance modeling was really new for me. Fortunately, Magnus was willing to help me with any problems in this research and he gave me confidence to further investigate on this topic.

I also want to thank Koen for organizing the HiPEAC summer school. It is a good place to learn about the latest research in computer architecture and have detailed discussions with experts in our area. Special thanks to Cicilia for serving as my committee member and instructing me when I joined the research group.

I want to thank all my lab colleagues: Almutaz Adileh, Ajeya Naithani, Cecilia Gonz´alez ´Alvarez, Shoaib Akram, Sander De Pestel, Sam Van den Steen, Kartik Lakshminarasimhan, Xia Zhao, Wenjie Liu, Shiqing Zhang, Yuxi Liu and Jennifer Sartor, for discussing in group meeting and providing valuable

vii

(10)

viii ACKNOWLEDGEMENTS suggestions in my PhD. I especially want to thank Almutaz for his kindness.

He was always patient and ready to help me with any problems. I also want to mention Xia and Yuxi since they gave me a lot of help both in research and in life especially during my first year of PhD. I really cherish the friendship with Wenjie and Shiqing. Our lunch meeting was an unforgettable experience in Gent. Moreover, I would like to thank the department sta↵ with the technical issues, especially Marnix, Vicky and Ronny. Marnix provided generous help in cases such as arranging flights for conferences.

I would like to thank all my friends in Gent: Xin Cheng, Luyuan Li, Lei Luo, Boxuan Gao, Sheng Yan, Yun Zhou, MeiZhu Li and Zhongjia Yu, etc.

Without them, I would not have the splendid and happy life in the past four years. I will never forget our formidable friendship. I also want to thank my previous Chinese advisor, Prof. Zhiying Wang. He gave me endless support for applying the CSC scholarship to pursue my PhD abroad.

In the end, I want to thank my parents. They brought me to the world and gave me endless love. I am so lucky to live in this happy family. They are open-minded to support every decision I made.

Thanks again, my advisor, my collaborators, lab members, my friends, and family. This PhD is not only for me, but also for you. I will cherish all the enjoyable moments with you forever. The unforgettable experience in this beautiful city will be a great fortune in my whole life.

Lu Wang Gent, September 6, 2020

(11)

Samenvatting

Throughput-processors zoals Graphics Processing Units (GPU’s) zijn belangrijke componenten in moderne computersystemen vanwege hun vermogen om gegevensparallelle computertoepassingen te versnellen. GPU’s zijn dan ook wijdverspreid voor het uitvoeren van een brede waaier aan rekeninten- sieve computerapplicaties voor o.a. wetenschappelijke berekeningen, machinaal leren, artifici¨ele intelligentie, data-analyse, medische beeldverwerking, enz. Om het programmeren van GPU’s te vereenvoudigen werden nieuwe programmeer- talen zoals CUDA en OpenCL ontwikkeld. Een GPU-toepassing bestaat typ- isch uit een aantal kernels waarbij elke kernel bestaat uit honderdduizenden draden die op hun beurt gegroepeerd zijn in zogenaamde Cooperative Thread Arrays (CTA’s). Een resem bibiliotheken en raamwerken vereenvoudigen het programmeren van een GPU drastisch. Het gevolg is dat GPU’s vandaag een steeds belangrijkere rol spelen voor cloudproviders zoals Amazon, Google, IBM en Facebook.

Tegelijkertijd verbeteren en verfijnen hardwareontwerpers het ontwerp van een GPU steeds verder teneinde steeds hogere rekencapaciteit en geheugenbandbreedte aan te bieden. De nieuwste Volta GPU van Nvidia bijvoor- beeld integreert 84 Streaming Multiprocessors (SM’s) en voorziet in 900 GB/s geheugenbandbreedte. Deze trend zal zich verderzetten m.b.v. nieuwe tech- nologieën, zoals b.v. GPU’s bestaande uit meerdere modules of chiplets. Het ontwikkelen en optimaliseren van nieuwe generaties GPU’s voor een steeds breder palet aan toepassingen is zeer uitdagend. Diepgaande innovaties in infrastructuur en methodologieën zijn dan ook vereist om nieuwe generaties GPU’s te simuleren en te modelleren. Daarnaast dient de GPU-architectuur geoptimaliseerd te worden teneinde de nieuwste toepassingen zo snel en zo efficiënt mogelijk uit te voeren.

In dit doctoraat spitsen we ons eerst toe op het optimaliseren van het interconnectienetwerk in hedendaagse geclusterde GPU’s. Clustering groepeert verschillende SM’s in een cluster om de druk op het interconnectienetwerk te reduceren. Hierdoor is het interconnectienetwerk eenvoudiger te schalen naar GPU’s met vele tientallen SM’s. Clustering leidt echter tot congestie ter hoogte van de toegangspoorten tot het interconnectienetwerk. In dit werk exploiteren we de lokaliteit die er bestaat tussen CTA’s die gemapt worden op eenzelfde cluster. We stellen Intra-Cluster Coalescing (ICC) en de Coa-

ix

(12)

x SAMENVATTING lesced Cache (CC) voor om redundante aanvragen binnen een cluster te elimineren en op die manier de hoeveelheid verkeer over het interconnectienetwerk te reduceren. Daarnaast stellen we Distributed-Block Scheduling (DBS) voor teneinde lokaliteit te exploiteren op het niveau van een individuele SM alsook op het niveau van een cluster. Ten tweede maken we in dit doctoraat de observatie dat geheugendivergentie wijdverspreid is; dit betekent dat een enkele instructie leidt tot geheugentoegangen tot verschillende locaties in het geheugen. Geheugendivergentie is een prestatiebottleneck voor heel wat nieuwe en belangrijke GPU-toepassingen zoals b.v. in machinaal leren en data-analyse.

Bestaande analytische prestatiemodellen slagen er echter niet in nauwkeurige prestatieschattingen te verkrijgen voor geheugendivergente GPU-toepassingen.

In dit doctoraat stellen we het Memory Divergence Model (MDM) voor wat tot een aanzienlijke verbetering leidt in nauwkeurigheid en snelheid van berekening.

Intra-cluster coalescing (ICC).De toenemende prestatie in hedendaagse GPU’s wordt mogelijk gemaakt door het integreren van een steeds groter aantal SM’s. Teneinde een schaalbaar interconnectienetwerk mogelijk te maken om de SM’s te verbinden met het geheugen, werd een clusterarchitectuur ge¨ıntroduceerd in hedendaagse GPU’s waarbij verschillende SM’s gegroepeerd worden in een cluster met een enkele netwerkpoort per cluster. Het delen van een netwerkpoort leidt echter tot congestie wanneer meerdere SM’s tegelijkertijd pakketten wensen te versturen over het netwerk, met negatieve gevolgen tot prestatie als gevolg. In dit doctoraat observeren we dat gemiddeld 19%

(en tot 48%) van de pakketten verstuurd vanuit een cluster redundant zijn, of m.a.w. verschillende SM’s in een cluster raadplegen dezelfde data binnen een kort tijdsbestek. We stellen Intra-Cluster Coalescing (ICC) en de Coalesced Cache (CC) voor teneinde de druk op het interconnectienetwerk te reduceren.

ICC groepeert pakketten naar eenzelfde data-element of een cachelijn in een enkele aanvraag teneinde het aantal aanvragen te reduceren. Om het tijdsbestek waarbinnen aanvragen gegroepeerd worden uit te breiden, voegen we de CC toe wat ICC in staat stelt recentelijk gegroepeerde data-elementen (cache- lijnen) bij te houden en op die manier het aantal aanvragen over het interconnectienetwerk verder te reduceren. ICC en CC leiden tot een verbetering in prestatie van gemiddeld 15% (en tot 69%). Bovendien reduceert ICC/CC het energieverbruik met gemiddeld 5.3% (en tot 16.7%), wat tot een verbetering in Energy-Delay Product (EDP) leidt van gemiddeld 12% (en tot 30%).

Distributed-block scheduling (DBS). We tonen ook aan in dit doctoraatswerk dat er een belangrijke interactie bestaat tussen ICC en CTA scheduling (het schedulen of mappen van CTA’s op SM’s en clusters van SM’s). We observeren dat ICC voordeel haalt uit het mappen van naburige CTA’s binnen eenzelfde cluster teneinde de bestaande lokaliteit tussen CTA’s te kunnen be- nutten. Op basis van deze observatie stellen we Distributed-Block Scheduling (DBS) voor. DBS buit lokaliteit uit binnen een SM en binnen een cluster, in tegenstelling tot eerder werk. Deze tweetrapsbenadering maximaliseert de mo- gelijkheid tot het exploiteren van lokaliteit: eerst worden groepen van opeenvolgende CTA’s gemapt op clusters, waarna twee opeenvolgende CTA’s gezamen- lijk gemapt worden per SM. Lokaliteit wordt op die manier ge¨exploiteerd op het

(13)

xi niveau van de L1 cache (d.i. de hergebruiksafstand tussen twee opeenvolgende toegangen tot dezelfde cachelijn wordt ingekort wat leidt tot een hogere cache hit rate) en op het niveau van de Miss Status Handling Registers (MSHR’s) (d.i.

in het geval van een L1 miss worden meerdere toegangen tot dezelfde cachelijn gegroepeerd). We tonen aan dat DBS de prestatie verbetert met gemiddeld 4%

(en tot 16%), en het energieverbruik en de EDP reduceert met respectievelijk 1.2% en 5%. Bovendien tonen we aan dat DBS complementair is aan ICC/CC, wat tot een gemiddelde prestatieverbetering leidt van 16% (en tot 67%) en een energiereductie van 6% (en tot 18%).

Memory divergence model (MDM).Analytische prestatiemodellen stellen computerarchitecten in staat gigantische ontwerpsruimtes effici¨ent te verken- nen, en dit vele grootteordes sneller dan gedetailleerde simulatie. Een hoge nauwkeurigheid is uiteraard noodzakelijk om correcte conclusies te bekomen. In dit doctoraatswerk spitsen we ons toe op geheugendivergente GPU-toepassingen die steeds meer wijdverspreid zijn in o.a. machinaal leren en data-analyse. Het ontbreken aan spatiale lokaliteit leidt tot frequente blokkeringen ter hoogte van de cache omdat een SM substantieel meer geheugentoegangen initieert dan de cache kan ondersteunen. Hierdoor is de GPU niet langer in staat de toegangs- latentie tot het hoofdgeheugen te verbergen via parallellisme op draadniveau.

In dit doctoraatswerk stellen we het Memory Divergence Model (MDM) voor dat de belangrijkste uitvoeringskarakteristieken van geheugendivergente GPU- toepassingen modelleert, inclusief het serialiseren van geheugentoegangen en de toenemende wachttijden in het interconnectienetwerk en hoofdgeheugen. We valideren MDM t.o.v. gedetailleerde simulatie alsook echte hardware, en we rapporteren aanzienlijke verbeteringen in (1) toepasbaarheid: MDM modelleert zowel geheugendivergente als niet-geheugendivergente GPU-toepassingen; (2) snelheid: MDM maakt gebruik van dynamische binaire instrumentatie en is daardoor 6.1⇥ sneller dan modellering op basis van functionele simulatie; en (3) nauwkeurigheid: de gemiddelde fout van MDM is beperkt tot 13.9% wat een aanzienlijke verbetering is t.o.v. het state-of-the-art GPUMech model met een fout van gemiddeld 162%. We tonen bovendien aan dat MDM bruikbaar is voor zowel de exploratie van de ontwerpsruimte als voor het evalueren van dynamisch herschalen van de voedingsspanning en klokfrequentie.

We concluderen dat GPU’s wijdverspreid zijn als accelerator voor reken- intensieve computertoepassingen. In dit doctoraatswerk onderzoeken we de uitdagingen en opportuniteiten van nieuwe GPU-architecturen en hun toepassingen. We maken meer bepaald gebruik van lokaliteit tussen CTA’s om de druk op het interconnectienetwerk te verlichten in geclusterde GPU-architecturen.

We stellen hiertoe drie complementaire innovaties voor: Intra-Cluster Coa- lescing, Coalesced Cache en Distributed-Block Scheduling. Deze innovaties elimineren redundante geheugentoegangen wat leidt tot een aanzienlijke verbetering in prestatie en energie-effici¨entie. Daarnaast tonen we ook aan dat geheugendivergente toepassingen zich grondig verschillend gedragen van meer courante niet-geheugendivergente toepassingen. We stellen het Memory Di- vergence Model voor om de prestatie van dergelijke toepassingen nauwkeurig analytisch te modelleren.

(14)

(15)

Summary

Throughput processors such as Graphics Processing Units (GPUs) are widely used to accelerate a wide range of emerging throughput-oriented applications, e.g., scientific computing, machine learning, artificial intelligence, data analytics, medical imaging, etc. To ease programming, bulk-synchronous parallel (BSP) programming models such as OpenCL and CUDA have been developed in which a GPU application is typically divided into several kernels and each kernel consists of hundreds of thousands of threads grouped in Coopera- tive Thread Arrays (CTAs). A variety of libraries and frameworks have been developed upon CUDA/OpenCL to further alleviate the burden for GPU programmers. As a result, the programmability motivates GPU platforms as a top choice for cloud infrastructures in major companies such as Amazon, Google, IBM and Facebook.

At the same time, hardware designers make their e↵ort to scale GPU performance through improvements in compute capability as well as memory bandwidth. For instance, the latest Nvidia Volta GPU integrates 84 SMs and deliv- ers 900 GB/s peak memory bandwidth. This trend will further continue thanks to new technology such as multi-module GPUs. However, the adaption problems are always challenging and cover di↵erent aspects of architecture design.

For instance, infrastructures such as analytical models or simulators should be developed for new applications and architectures for in-depth performance analysis.

In this thesis, we first focus on reducing Network-on-Chip (NoC) pressure in modern-day clustered GPU architecture. In particular, a cluster structure is implemented to group several SMs to address the NoC scalability problem. However, it exacerbates NoC congestion due to port sharing in the same cluster. Fortunately, inter-CTA locality in many GPU-compute applications provides an opportunity to cope with this challenge. In response, we propose intra-cluster coalescing (ICC) and the coalesced cache (CC) unit to eliminate redundant requests among SMs in the same cluster and significantly reduce NoC traffic. In addition, we propose distributed-block CTA scheduling for clustered GPU architectures which exploits the locality at both the SM level and cluster level. Second, we observe that memory divergence is prevalent and commonly becomes the performance bottleneck in emerging GPU applications such as machine learning and data analytics. Unfortunately, state-of-the-art

xiii

(16)

xiv SUMMARY GPU analytical models focus mainly on traditional non-memory-divergent applications and fail to capture the performance behavior of memory-divergent applications. In this thesis, we propose the memory divergence model (MDM) which provides significant prediction accuracy improvements with faster modeling speed.

Intra-cluster coalescing to reduce NoC pressure. GPUs continue to boost the number of streaming multiprocessors (SMs) to provide increasingly higher compute capabilities. To construct a scalable crossbar network-on-chip (NoC) that connects the SMs to the memory controllers, a cluster structure is introduced in modern GPUs in which several SMs are grouped together to share a network port. Because of network port sharing, clustered GPUs face severe NoC congestion, which creates a critical performance bottleneck. In this thesis, we observe that in many GPU-compute applications, an average of 19% (and up to 48%) of the requests are redundant in the same cluster which wastes limited NoC bandwidth. In response, we propose intra-cluster coalescing (ICC) and the coalesed cache (CC) to reduce NoC pressure in clustered GPUs. In particular, ICC coalesces outstanding misses among SMs in the same cluster once they access the same cache lines. To extend the time window for coalescing, CC is complemented upon ICC to keep track of recently coalesced cache lines. L1 hits in the CC further reduce NoC traffic. Our experiment shows that ICC along with CC leads to an average 15% (and up to 69%) performance improvement while at the same time reducing system energy by 5.3% on average (and up to 16.7%), and energy-delay-product by 12% and up to 30%, respectively.

Distributed-block scheduling policy. Through investigating the inter- CTA locality, we demonstrate the significant interaction between ICC and the CTA scheduling policy. We find that ICC benefits more when the CTA scheduling policy maps neighboring CTAs to the same cluster to better exploit inter-CTA locality. Motivated by this observation, we proposedistributed-block scheduling. In contrast to prior work, distributed-block scheduling exploits cache locality at both the cluster level and the SM level. This two-level approach maximizes the opportunities to exploit inter-CTA locality at the L1 cache within an SM as well as between SMs in the same cluster by first mapping a group of consecutive CTAs at the cluster level, and by subsequently mapping pairs of consecutive CTAs at the SM level. Inter-CTA locality is exploited to improve L1 cache performance (decreasing the reuse distance between accesses to the same memory location, thereby increasing L1 cache hit rate) and to increase the coalescing opportunities in the L1 miss status handling registers (MSHRs). Through execution-driven GPU simulation, we find that distributed-block scheduling improves GPU performance by 4% (and up to 16%) while at the same time reducing system energy and EDP by 1.2%

and 5% compared to state-of-the-art distributed scheduling policy. In addition, distributed-block scheduling works synergistically with ICC and CC to improve performance by 16% (and up to 67%) and reduce system energy by 6% (and up to 18%).

Memory Divergence Model. Analytical models enable architects to carry out early-stage design space exploration several orders of magnitude

(17)

xv faster than cycle-accurate simulation by capturing performance related behavior through a set of mathematical equations. However, this speed advantage is void if the conclusions obtained through the model are misleading due to model inaccuracies. In this work, we focus on analytically modeling the performance of emerging memory-divergent GPU-compute applications which are common in domains such as machine learning and data analytics. The poor spatial locality of these applications leads to frequent L1 cache blocking due to the application issuing significantly more concurrent cache misses than the cache can support, which cripples the GPU’s ability to use Thread-Level Parallelism (TLP) to hide memory latencies. Motivated by this observation, we propose the GPU Memory Divergence Model (MDM) which faithfully captures the key performance characteristics of memory-divergent applications, including memory request batching and excessive NoC/DRAM queueing delays. We validate MDM against detailed simulation and real hardware, and report substantial improvements in (1) scope: the ability to model emerging memory-divergent applications in addition to traditional non-memory-divergent applications; (2) practicality: 6.1⇥ faster by computing model inputs using binary instrumentation as opposed to functional simulation; and (3) accuracy: 13.9% average prediction error versus 162% compared to the state-of-the-art GPUMech model.

In addition, the MDM model is useful for design space explorations as well as for evaluating dynamic voltage and frequency scaling (DVFS).

In summary, GPUs have emerged as a high-performance computing accelerator in modern computer systems in recent years. In this thesis, we investigate both the challenges and opportunities of new-generation GPU architectures and emerging GPU applications. More specifically, we make use of the inter- CTA locality in GPU-compute applications and propose intra-cluster coalesing (ICC), the coalesced cache (CC) and distributed-block CTA scheduling policies to eliminate redundant memory requests and reduce NoC pressure in clustered GPUs. In addition, we observe the distinct performance behavior of emerging memory-divergent GPU applications and propose the MDM analytical GPU performance model to achieve fast design space explorations and accurate performance predictions.

(18)

(19)

List of Figures

2.1 GPU thread hierarchy: a GPU kernel is executed as a grid of thread blocks. A thread block is a batch of threads that can cooperate with each other. . . 10 2.2 SM architecture: each SM includes a large register file, several

caches, and 32 CUDA cores executing threads in a SIMT manner. 10 2.3 Clustered GPU architecture: SMs within a cluster go through

the NoC to access the L2 cache and main memory to serve L1 cache misses. . . 11 2.4 CTA scheduling policies: A 2-level round-robin scheduling pol-

icy allocates neighboring CTAs to di↵erent clusters while a distributed scheduling policy maps neighboring CTAs to the same cluster.. . . 12 2.5 Interval analysis: An interval is defined as a sequence of instruc-

tions at the maximum issue rate followed by stall cycles. . . 15 2.6 Generating interval profiles through instruction dependence anal-

ysis. . . 17 3.1 Quantifying the NoC bottleneck: Normalized IPC when varying

the NoC and LLC frequency from 0.25⇥to 2⇥. NoC (and LLC) bandwidth is a severe performance bottleneck. . . 21 3.2 Intra-cluster locality (fraction of redundant requests versus the

total number of requests in a cluster) as a function of a past window of requests under the distributed scheduling policy [24]. A distinction is made between cache line sharing and data sharing.

A substantial fraction of NoC requests are redundant because of intra-cluster locality due to cache line sharing or data sharing.. 24 3.3 Data sharing in LUD. L11 is reused for calculating submatrices

U12 and U13 (reuse along rows), while U11 is reused for calculating submatrices L21 and L31 (reuse along columns). . . 25

xvii

(20)

xviii LIST OF FIGURES 3.4 The intra-cluster coalescing (ICC) unit merges L1 cache misses

across SMs within a cluster. The coalesced cache (CC) keeps track of recently coalesced cache lines. . . 26 3.5 IPC improvement for intra-cluster coalescing (ICC) and the co-

alesced cache (CC) under distributed scheduling. ICC significantly improves performance by 9.7% on average; ICC along with CC yields an average 15% performance improvement. . . . 31 3.6 Energy consumption breakdown normalized to distributed schedul-

ing (D). ICC along with CC reduces system energy by 5.3% on average. . . 32 3.7 EDP reduction for ICC and CC. ICC along with CC reduces

system EDP by 16.7% on average. . . 32 3.8 NoC traffic (number of NoC read requests) reduction for ICC

and CC. ICC along with CC reduces NoC traffic by 19.5% on average. . . 33 3.9 IPC improvement for ICC and CC as a function of the number

of clusters while keeping total SM count constant at 60 SMs.

ICC along with CC consistently improves performance across di↵erent cluster sizes and e↵ective NoC port count per SM. . . 34 4.1 Illustrating the five CTA scheduling algorithms for a 10-CTA

workload. We assume a GPU architecture with 2 clusters with 2 SMs each; we can allocate at most 2 CTAs per SM. The top row shows the initial mapping of CTAs to clusters and SMs; the bottom row shows the mapping of the next CTA to schedule after CTA 1 finishes its execution. . . 39 4.2 Normalized IPC for two-level round-robin, greedy-clustering, global

round-robin and distributed CTA scheduling. Distributed scheduling outperforms the other three policies on average. . . 41 4.3 Intra-cluster locality for the di↵erent CTA scheduling policies.

CTA scheduling policies have a substantial impact on the ex- ploitable intra-cluster locality, and distributed scheduling yields the highest opportunity. . . 42 4.4 IPC improvement for distributed-block scheduling versus dis-

tributed scheduling. Distributed-block scheduling improves performance by 4% on average (up to 16%).. . . 44 4.5 IPC improvement for distributed-block scheduling with ICC and

CC versus distributed scheduling. Distributed-block scheduling with ICC and CC yields an average of 16% (up to 67%) performance improvement. . . 44 4.6 Energy consumption breakdown normalized to distributed schedul-

ing (D). Distributed-block scheduling with ICC and CC reduces system energy by 6% on average. . . 46

(21)

LIST OF FIGURES xix 4.7 EDP reduction for distributed-block scheduling without and with

ICC plus CC versus distributed scheduling. Distributed-block scheduling by itself reduce system EDP by 5%, along with ICC and CC reduces system EDP by 19% on average. . . 47 4.8 NoC traffic (number of NoC read requests) reduction for distributed-

block scheduling, ICC and CC versus distributed scheduling.

Distributed-block scheduling itself reduces NoC traffic by 6% on average. Along with ICC and CC reduces NoC traffic by 20% on average. . . 47 4.9 IPC improvement for distributed-block scheduling with ICC and

CC versus distributed scheduling as a function of the number of clusters while keeping total SM count constant at 60 SMs.

Distributed-block scheduling with ICC and CC consistently improves performance across di↵erent cluster sizes and e↵ective NoC port count per SM. . . 48 5.1 The key components of MDM-based performance prediction. . 55 5.2 Cache behavior for the example kernel in Listing 5.1 with two

di↵erent grid strides. . . 57 5.3 L1 miss latency breakdown for selected GPU compute appli-

cations. Delays due to insufficient MSHRs as well as queuing delays in the NoC and DRAM subsystem significantly a↵ect the overall memory access latency of MD-applications, while NMD- applications are hardly a↵ected. . . 58 5.4 Example explaining why MSHR utilization results in signifi-

cantly di↵erent performance-related behavior for NMD and MD- applications. MD-applications puts immense pressure on the L1 cache MSHRs and thereby severely limit the GPU’s ability to use TLP to hide memory latencies. . . 59 5.5 IPC prediction error for our NMD and MD-benchmarks under

di↵erent performance models. MDM significantly reduces the prediction error for the MD-applications. . . 67 5.6 Prediction error as a function of NoC bandwidth for the MD-

applications. . . 69 5.7 Prediction error as a function of DRAM bandwidth for the MD-

applications. . . 69 5.8 Prediction error as a function of the number of MSHR entries

for the MD-applications. . . 70 5.9 Prediction error as a function of SM count for the MD-applications. 70

(22)

xx LIST OF FIGURES 5.10 Hardware validation: relative IPC prediction error for GPUMech

and MDM compared to real hardware. MDM achieves high prediction accuracy compared to real hardware with an average prediction error of 40% compared to 164% for GPUMech (using binary instrumentation). . . 72 5.11 Normalized performance for two NMD-applications (BTandHS)

as a function of SM count and DRAM bandwidth. Results are normalized to the simulation results at 28 SMs and 480 GB/s DRAM bandwidth. Both GPUMech and MDM capture the performance trend. . . 73 5.12 Normalized performance for three MD-applications (CFD, BFS

andPVC) as a function of SM count and DRAM bandwidth. All results are normalized to the simulation results at 28 SMs and 480 GB/s DRAM bandwidth. GPUMech not only leads to high prediction errors, it also over-predicts the performance speedup with more SMs and memory bandwidth, in contrast to MDM. . 74 5.13 Prediction error for predicting the relative performance di↵er-

ence at 1.4 GHz versus 2 GHz for the MD-applications for GPUMech, CRISP and MDM. The general-purpose MDM model achieves similar accuracy as the special-purpose CRISP. . . 75 5.14 Normalized CPI as a function of NoC/DRAM bandwidth for two

memory-divergent benchmarks (BFSandCFD).MDM accurately captures MSHR batching and NoC/DRAM queueing delays, in contrast to GPUMech. . . 76 5.15 Normalized CPI as a function of NoC/DRAM bandwidth for

the non-memory-divergent BPbenchmark. Both GPUMech and MDM capture the performance trend since the number of MSHRs are sufficient. . . 77 5.16 Relative IPC prediction error with a streaming L1 cache. MDM

improves accuracy compared to GPUMech because it models batching behavior caused by NoC saturation. . . 77

(23)

List of Abbreviations

ALU Arithmetic Logic Unit

API Application Programming Interface BSP Bulk-Synchronous Parallel

CC Coalesced Cache

CPU Central Processing Unit CTA Cooperative Thread Arrays

DPKI Divergent loads Per Kilo Instructions DRAM Dynamic Random Access Memory DVFS Dynamic Voltage and Frequency Scaling EDP Energy-Delay Product

GB Gigabytes

GDDR Graphics Double Data Rate GPU Graphics Processing Unit GS Grid Stride

GTO Greedy Then Oldest ICC Intra-Cluster Coalescing ICL Intra-Cluster Locality IPC Instructions Per Cycle ISA Instruction-Set Architecture KB Kilobytes

LD Load

L1 Level-1 Cache

xxi

(24)

xxii LIST OF ABBREVIATIONS L2 Level-2 Cache

LLC Last-Level Cache LRR Loosely Round Robin LRU Least-Recently Used MC Memory Controller MD Memory Divergent

MDM Memory Divergence Model ML Machine Learning

MSHR Miss Status Handling Registers NMD Non-Memory Divergent

NoC Network on Chip

PTX Parallel Thread Execution RAW Read After Write

SASS Shader Assembler

SIMT Single Instruction Multiple Thread SM Streaming Multiprocessor

ST Store

TB Thread Block

TLP Thread-Level Parallelism VC Virtual channel

(25)

Chapter 1

Introduction

This chapter introduces the dissertation’s key contributions.

1.1 GPU Architecture Trends

Throughput processors such as Graphics Processing Units (GPUs) are widely used to accelerate a broad range of emerging throughput-oriented applications, e.g., scientific computing, machine learning, artificial intelligence, data analytics and medical imaging [34, 58, 93]. To ease programming, bulk-synchronous parallel (BSP) programming models such as OpenCL and CUDA have been developed in which a GPU application is typically divided into several kernels and each kernel consists of hundreds of thousands of threads. Using Nvidia’s terminology¹, Streaming Multiprocessors (SMs) are featured in GPUs to run a massive number of parallel threads, grouped in cooperative thread arrays (CTAs). Threads on each SM are executed at a granularity of a warp, which is essentially a collection of (usually 32) threads running in a lockstep fashion.

GPUs continue to boost the number of SMs under each generation to support demanding applications. Whereas the Nvidia Fermi GPU implemented 16 SMs, the latest Nvidia Pascal [13] and Volta GPUs [15] feature 60 and 84 SMs, respectively. Although transistor scaling has dramatically slowed down and restricts this trend to a single-GPU system [16], the potential deployment of Multi-Module GPUs [24] and multi-socket NUMA-Aware GPUs [86] will further increase the overall SM count. In contrast, bandwidth in the memory subsystem is less scalable and becomes the performance bottleneck especially for memory-intensive applications. In order to bridge this gap, prior work has proposed various solutions to improve DRAM bandwidth [30, 31, 94].

For instance, recent GPUs adopt on-package stacked DRAM [13] to provide high bandwidth. These stacked memories, such as High Bandwidth Memory

1We use Nvidia’s terminology without loss of generality. GPUs designed by other companies such as AMD and Intel use a similar organization.

1

(26)

2 CHAPTER 1. INTRODUCTION (HBM) [55], allow the processor and memory to communicate via short links within a package to improve performance and save energy. Other solutions include fine-grained DRAM (FGDRAM) [94] which partitions the DRAM die into many independent units, or subchannel DRAM architectures [31].

There are also architectural optimizations on the SM-side with respect to register utilization [18, 56, 65, 69, 71] and warp scheduling [43, 57, 59, 90, 99, 100]. State-of-the-art GPU register files are generally over-provisioned and statically allocated to meet the peak performance targets while utilization re- mains low [18, 71, 88, 113]. Based on this observation, some techniques propose to manage register files dynamically based on the their lifetime analysis to save energy [56, 65, 69]. As the basic scheduling unit, warp scheduling policies have been discussed in depth. These studies either aim to avoid cache thrashing [57, 59, 99, 100, 102] or migrating control flow divergence [43, 90].

Unfortunately, limited proposals focus on memory contention in the network- on-chip (NoC) [28, 66, 128] and last-level cache (LLC) [126, 127], which war- rants further exploration.

1.2 GPU-Application Diversity

Thanks to the high computation power, GPUs have become a popular choice for high-performance computing (HPC) systems [7], machine learning [17, 58, 72, 104] and data analytics applications in large-scale cloud installations and personal computing devices [20, 44]. According to recent reports [3, 7], many of the world’s leading supercomputer systems use Nvidia Tesla accelerators, including Titan, Lomonosov, Piz Daint, etc. Application diversity leads to di↵erent execution behavior across GPU workloads, which makes the design of a GPU challenging to suit the diverse application characteristics well. In this thesis, we observe two characteristics of GPU-applications:

Inter-CTA data locality: A CTA is a group of threads that cooperate with each other by synchronizing their execution. Traditionally, intra-CTA, especially intra-warp locality, is the most common and obvious form of data locality present in GPU-compute applications. To exploit this characteristic, a memory coalescing unit merges multiple memory accesses to the same cache line within the same warp before sending the request to the L1 cache [49]. In contrast, we observe a high degree of inter-CTA locality in many GPU-compute applications, as di↵erent CTAs access the same cache line or access the same read-only data.

Memory divergence: Several contemporary GPU applications di↵er from traditional GPU-compute workloads as they put a much larger strain on the memory system. More specifically, they are memory-intensive and memory- divergent. These applications typically have strided or data-dependent access patterns which cause the accesses of the concurrently executing threads to be divergent as loads from di↵erent threads access di↵erent cache lines. In this thesis, we observe that memory-divergent (MD) applications are prevalent among

(27)

1.3. GPU PERFORMANCE MODELING 3 a couple benchmark suites. The poor spatial locality of these applications leads to frequent L1 cache blocking due to issuing more concurrent cache misses than the cache can support which cripples the GPU’s ability to use Thread-Level Parallelism (TLP) to hide memory access latencies.

1.3 GPU Performance Modeling

Analyzing and optimizing a GPU architecture for the broad diversity of modern-day GPU-compute applications is challenging. Simulation is arguably the most commonly used evaluation tool as it enables detailed, even cycle- accurate, analysis. However, simulation is excruciatingly slow and parameter sweeps commonly require thousands of CPU hours. In addition, simulation needs to be modified to support new generations of GPU architectures and emerging GPU-applications [63, 76, 98]. An alternative approach is modeling, which captures the key performance-related behavior of the architecture in a set of mathematical equations, which is much faster to evaluate than simulation.

Modeling can be broadly classified in machine-learning (ML) based modeling versus analytical modeling. ML-based modeling [40, 81, 118] requires o✏ine training to infer or learn a performance model. A major limitation of ML-based modeling is that a large number (typically thousands) of training examples are needed to infer a performance model. These training examples are obtained through detailed simulation, which leads to a substantial one- time cost. Moreover, extracting insight from an ML-based performance model is not straightforward, i.e., an ML-based model is a black box. Analytical modeling [26, 50, 51, 103, 115, 124] on the other hand derives a performance model from fundamentally understanding the underlying architecture, driven by first principles. Analytical models provide deep insight, i.e., the model is a white box, and the one-time cost is small once the model has been developed.

The latter is extremely important when exploring large design spaces, making analytical models ideally suited for fast, early-stage architecture exploration.

There are also some special-purpose models for runtime optimization [36, 91]

based on hardware performance counters. Unfortunately, they are not suited for design space exploration.

1.4 Motivation

In this thesis, we focus on targeting two dominant challenges related to modern GPU architectures and their emerging applications:

Challenge #1: Reducing NoC pressure in clustered GPUs. Graph- ics Processing Units (GPUs) are popular in modern computing systems in order to provide high performance for a wide class of general-purpose applications.

In particular, the SMs feature private L1 caches and are connected to the L2 cache and memory controllers (MCs) through a Network-on-Chip (NoC).

The trend of increasing number of SMs poses a scalability challenge for the

(28)

4 CHAPTER 1. INTRODUCTION NoC. Typically, a crossbar is deployed due to its low latency and high bandwidth [13]. However, a crossbar NoC faces scalability issues as hardware costs increase quadratically with increasing port count. To address the GPU NoC scalability challenge, a cluster structure is implemented in modern-day GPUs to group several SMs into a cluster. For example, Pascal supports 6 clusters, with each cluster consisting of 10 SMs [13]; Volta features 14 SMs per cluster for the same number of clusters [15]. By sharing NoC ports among SMs in a cluster, the total number of ports to the network is reduced and so is the overall hardware cost of the crossbar NoC. Previous research has shown that NoC congestion is a severe GPU performance bottleneck for many memory-intensive applications [28, 66, 128]. Unfortunately, clustered GPUs further exacerbate this performance issue. By sharing ports among SMs in a cluster, congestion significantly increases as SMs need to compete with each other in a cluster for network bandwidth. This creates a new and critical performance challenge for the NoC in clustered GPU organizations. In this thesis, we focus on how to efficiently reduce the NoC pressure in a clustered GPU. Fortunately, inter-CTA locality in GPU-applications provides an potential solution. However, there are still several problems need to be solved. First, how to map inter-CTA locality to locality in a cluster? Second, how to efficiently eliminate the redundant NoC traffic?

Challenge #2: Modeling memory divergence. Quantitative evaluation is an essential part of the computer architect’s tool box. Analytical models are roughly two orders of magnitude faster than simulation — making them ideally suited for early-stage architectural exploration [51] and for helping programmers understand application performance characteristics [50, 124]. However, new emerging GPU-applications and their distinct characteristics have made GPU analytical modeling sophisticated. In particular, memory divergence is prevalent among emerging GPU applications. Unfor- tunately, prior GPU analytical models incur large inaccuracies: the state-of- the-art GPUMech [51] model incurs an average performance error of 298% for a broad set of memory-divergent (MD) applications. The key problem comes from the distinct performance behavior of MD-applications compared to well- understood Non-Memory-Divergent (NMD) applications. Modeling memory divergence is challenging. First, we need to identify the key performance- related behavior of MD-applications. Second, modeling the behavior through a set of equations is complex which imposes a trade-o↵ between simplicity and accuracy. Third, the modeling overhead should be considered to make sure the model’s practicality.

1.5 Thesis Contributions

This thesis provides three major contributions.

Contribution #1: Intra-Cluster Coalescing to Reduce NoC Pressure We make the observation that many GPU-compute applications exhibit inter-CTA locality, as di↵erent CTAs access the same cache line or access the

(29)

1.5. THESIS CONTRIBUTIONS 5 same read-only data. For clustered GPUs, this implies that memory requests from CTAs executing on the same cluster will access the same cache lines.

According to our experimental results, we find that on average 19% (and up to 48%) of all L1 misses originating from a cluster indeed access the same cache lines. These memory requests are redundant and can be eliminated.

In response, we propose intra-cluster coalescing (ICC) to exploit coalescing opportunities across SMs within a cluster. ICC reduces GPU NoC pressure by coalescing memory requests from di↵erent SMs in a cluster to the same L2 cache line. In particular, ICC records the memory requests sent to the NoC by all SMs in a cluster, and when subsequent memory requests from other SMs in the cluster access the same cache lines as an outstanding request, ICC coalesces them. By doing so, ICC significantly reduces NoC traffic. To extend the opportunity for coalescing beyond the time window during which a memory request is outstanding, we complement ICC with acoalesced cache (CC)to keep track of recently coalesced cache lines. L1 cache misses trigger an access to the CC, which in case of a hit, further reduces NoC traffic.

Our experiments show that ICC along with CC leads to an average 15%

(and up to 69%) performance improvement while at the same time reducing system energy by 5.3% on average (and up to 16.7%) , and EDP by 12% and up to 30%.

Contribution #2: Distributed-Block CTA Scheduling

Considering the intra-cluster locality stems from inter-CTA locality, we demonstrate the significant interaction between ICC and the CTA scheduling policy. We find that ICC benefits more when the CTA scheduling policy maps neighboring CTAs to the same cluster to better exploit inter-CTA locality. Motivated by this observation, we propose distributed-block scheduling. Distributed-block scheduling is a two-level locality-aware CTA scheduling policy that first evenly distributes consecutive CTAs across clusters, and subsequently schedules pairs of consecutive CTAs per SM to maximize L1 cache locality and L1 MSHR coalescing opportunity. In contrast to prior work in CTA scheduling, distributed-block scheduling exploits cache locality at both the cluster level and the SM level. Inter-CTA locality is exploited to improve L1 cache performance (decreasing the reuse distance between accesses to the same memory location, thereby increasing L1 cache hit rate) and to increase the coalescing opportunities in the L1 miss status handling registers (MSHRs).

Using execution-driven GPU simulation, we find that distributed-block scheduling improves GPU performance by 4% (and up to 16%) while at the same time reducing system energy and EDP by 1.2% and 5%, respectively, compared to the state-of-the-art distributed scheduling policy. In addition, distributed-block scheduling works synergistically with ICC and CC to improve performance by 16% (and up to 67%) and reduce system energy by 6% (and up to 18%).

These two contributions are published in:

L. Wang, X. Zhao, D. Kaeli, Z. Wang and L. Eeckhout. Intra- Cluster Coalescing to Reduce GPU NoC Pressure. In Proceedings

(30)

6 CHAPTER 1. INTRODUCTION of the International Parallel and Distributed Processing Symposium (IPDPS). pp. 990-999, May 2018

An extended version is published in:

L. Wang, X. Zhao, D. Kaeli, Z. Wang and L. Eeckhout. Intra- Cluster Coalescing and Distributed-Block Scheduling to Reduce GPU NoC Pressure. In IEEE Transactions on Computers, Vol.

68, No. 7, pp. 1064-1076, July 2019

Contribution #3: Memory Divergence Model (MDM)

In this work, we focus on analytically modeling the performance of emerging memory-divergent GPU-compute applications which are common in domains such as machine learning and data analytics. The poor spatial locality of these applications leads to frequent L1 cache blocking due to the application issuing significantly more concurrent cache misses than the cache can support, which cripples the GPU’s ability to use Thread-Level Parallelism (TLP) to hide memory latencies. Based on this observation, we propose the Memory Divergence Model (MDM), which faithfully models the batching behavior and NoC/DRAM queueing delays observed in MD-applications.

MDM significantly improves the performance prediction accuracy (by 16.5⇥ on average) compared to the state-of-the-art GPUMech [51] for MD-applications.

At the same time, MDM is equally accurate as GPUMech for NMD-applications.

Across a set of MD and NMD-applications, we report an average prediction error of 13.9% for MDM compared to detailed simulation (versus 162% for GPUMech). Moreover, we demonstrate high accuracy across a broad design space in which we vary the number of MSHRs, NoC and DRAM bandwidth, as well as SM count. Furthermore, we validate MDM against real hardware, for which we rely on binary instrumentation to collect the model inputs as opposed to functional simulation as done in prior work. By doing so, we improve both model evaluation speed (by 6.1⇥) and accuracy (average prediction error of 40% for MDM versus 164% for GPUMech).

In addition, we perform three case studies to demonstrate the utility of MDM model. First, we show the MDM is useful and accurate compared to detailed simulation for early design space exploration of GPU architectures when varying the number of SMs, MSHRs, NoC and DRAM bandwidth. Sec- ond, we find that the MDM model is equally accurate as the special-purpose CRISP [91] model to predict the performance impact of DVFS. Finally, we validate the observations of the MDM model by reporting CPI components.

Overall, MDM model advances the state-of-the-art in GPU analytical performance modeling by expanding its scope, by improving its practicality, and by enhancing its accuracy.

This work is published in:

L. Wang, M. Jahre, A. Adileh, Z. Wang, and L. Eeckhout. Modeling Emerging Highly Memory-Divergent GPU Applications. In IEEE Computer Architecture Letters, Vol. 18, No. 2, pp. 95-98, June 2019.

(31)

1.6. THESIS ORGANIZATION 7 An extended version of this work is published in:

L. Wang, M. Jahre, A. Adileh, and L. Eeckhout. MDM: The GPU Memory Divergence Model. In the 53rd International Symposium on Microarchitecture (MICRO), October 2020

1.6 Thesis Organization

The remainder of the thesis is organized as follows.

Chapter 2 introduces the necessary background regarding GPU architecture and the state-of-the-art GPU performance analysis tools to better understand the following chapters.

Chapter 3 observes the intra-cluster locality (ICL) in a broad range of GPU applications and exploits this characteristic to reduce NoC pressure in clustered GPUs by proposing intra-cluster coalescing (ICC) and the coalesced cache (CC) unit.

Chapter 4 analyzes the interaction between ICC and the CTA scheduling policy, and proposes distributed-block CTA scheduling which captures the data locality at both the SM and cluster levels.

Chapter 5 focuses on memory divergence in emerging GPU applications and proposes the MDM model which advances the state-of-the-art GPUMech [51]

model by expanding its scope, improving its accuracy and enhancing its practicality.

Finally, in Chapter 6, we conclude the thesis and discuss potential avenues for future work.

(32)

(33)

Chapter 2

Background

This chapter first provides an overview of the state-of-the-art in GPU computer architecture fundamentals, including clustered GPU architectures and GPU hardware schedulers. These are relevant to Chapters 3 and 4. We then provide background for the MDM model presented in Chapter 5, including the concept of memory divergence, state-of-the-art GPU performance modeling, and analytical interval-based performance modeling on which the MDM model builds.

2.1 Basics of GPU Architecture

2.1.1 GPU Thread Hierarchy

Using Nvidia’s terminology, a GPU-compute application consists of kernels, grids, thread blocks (TBs) or cooperative thread arrays (CTAs), warps and threads, organized in a hierarchy (see Figure 2.1). A kernel is a parallel code region that runs on a GPU and consists of multiple grids, which in turn consist of multiple CTAs. Each CTA is a batch of threads that can coordi- nate with each other through synchronization using a barrier instruction [62].

Threads in a CTA share a fast, on-chip scratchpad memory called shared memory. Since all the synchronization primitives are encapsulated within a CTA, di↵erent CTAs can be executed in any order. This is an important feature that we will explore in this work to understand how the mapping of CTAs to clusters a↵ects intra-cluster locality. CTAs are typically organized in a 1D, 2D or 3D structure¹. In particular, the 3D variable (blockIdx.x, blockIdx.y, blockIdx,z) distinguishes di↵erent CTAs.

1The most common case is a 2D structure which uses the row and the column to index one CTA.

9

(34)

10 CHAPTER 2. BACKGROUND

Figure 2.1: GPU thread hierarchy: a GPU kernel is executed as a grid of thread blocks. A thread block is a batch of threads that can cooperate with each other.

Figure 2.2: SM architecture: each SM includes a large register file, several caches, and 32 CUDA cores executing threads in a SIMT manner.

(35)

2.1. BASICS OF GPU ARCHITECTURE 11 Cluster #1

Crossbar Network

L2 cache MC #1

Cluster #12

SM SM

Injection Buffer Response Buffer

L2 cache MC #8 SM

Memory Subsystem Shared

Mem

CacheL1 Texture Cache

Const Cache Register

File ALUs Warp scheduler

SM

Memory Subsystem Shared

Mem

CacheL1 Texture Cache

Const Cache Register

File ALUs Warp scheduler

Figure 2.3: Clustered GPU architecture: SMs within a cluster go through the NoC to access the L2 cache and main memory to serve L1 cache misses.

2.1.2 Streaming Multiprocessor

The Streaming Multiprocessor (SM) is the basic computation unit in a GPU. An SM executes up to thousands of threads in a single-instruction multiple-thread (SIMT) manner which are organized as warps. Each warp consists of a couple dozen threads, e.g., 32 threads. Di↵erent warps within a CTA can synchronize through a barrier and communicate through shared memory. Figure 2.2 shows a diagram of an SM [1]. Each SM includes thousands of registers that can be partitioned among threads of execution, several caches such as scratchpad memory and L1 cache, warp schedulers and execution cores. Warp schedulers can quickly switch contexts between threads and issue instructions from ready warps. There are di↵erent warp scheduling policies such as loose round-robin (LRR) [99] and greedy-then-oldest (GTO) [99].

The LRR scheduling policy schedules warps in a round-robin way, while the GTO scheduling policy schedules the same warp until it stalls and then picks the oldest one. In particular, a warp is mainly stalled due to a RAW hazard (e.g., long memory latency) or LD/ST unit stall (e.g., no free entries in MSHR or memory congestion). However, the high thread-level parallelism (TLP) can usually efficiently hide related stalls. A CUDA core is the execution unit for floating-point and integer operations while the LD/ST unit is used to process memory instructions.

(36)

12 CHAPTER 2. BACKGROUND

Figure 2.4: CTA scheduling policies: A 2-level round-robin scheduling policy allocates neighboring CTAs to di↵erent clusters while a distributed scheduling policy maps neighboring CTAs to the same cluster.

2.1.3 Clustered GPU Architecture

To address the GPU NoC scalability challenge, a cluster structure is implemented in modern-day GPUs to group several SMs into a cluster. For example, Nvidia Pascal [13] and Volta [15] GPUs support 6 clusters, with each cluster consisting of 10 SMs and 14 SMs, respectively. A baseline clustered GPU architecture in this thesis is shown in Figure 2.3: 12 clusters are connected via a crossbar NoC to 8 memory controllers (MCs). Each MC has an associated L2 cache bank for the memory partition that the MC serves, and has one network port. Each cluster consists of 5 SMs, so there are 60 SMs in total. Each SM has a private L1 data cache, a read-only texture cache, a constant cache and shared memory. An L1 cache miss triggers a request to be sent over the NoC to reach one of the L2 cache banks; in case of an L2 cache miss, the request proceeds to main memory. In our baseline architecture, we assume one NoC injection port bu↵er that is shared by all SMs in a cluster. (We will study the sensitivity of our proposed design to the number of clusters and the e↵ective network ports per SM in evaluation section in Chapter 3.) Each cluster has a response FIFO queue to hold incoming packets from the NoC; responses are directed to one of the SMs in the cluster according to the control information in the packet.

(37)

2.2. MEMORY DIVERGENCE AND COALESCING 13

2.1.4 CTA Scheduling

Scheduling on a GPU is done in three steps. First, a kernel is launched on the GPU. In this thesis, we assume that only one kernel is active at a given time. Second, the CTA scheduler maps CTAs to the available SMs.

The scheduling usually follows a round-robin (RR) policy to balance the load among di↵erent SMs in traditional GPUs (without cluster structure). As shown in Figure 2.4(a), CTA 1 is allocated to SM #1, CTA 2 is allocated to SM #2, and so on. The maximum number of CTAs that can be scheduled per SM is determined by the SM’s resources. Finally, the warp scheduler in each SM schedules warps (from one or more CTAs) to execute, which we model to follow the Greedy-Then-Oldest (GTO) policy [99].

CTA Scheduling for Clustered GPUs

CTA scheduling policies for clustered GPUs can be di↵erent. The default CTA scheduler follows a 2-level round-robin (RR) policy [80], which first schedules CTAs across clusters and then across SMs within a cluster. In particular, CTA 1 is allocated to SM #1 in cluster #1, CTA 2 is allocated to SM #3 in cluster #2, and so on (see Figure 2.4(b)). Once all clusters are assigned one CTA, the next iteration allocates a CTA to the second SM in each cluster, etc., until all SMs are assigned one CTA. For example, CTA 3 is allocated to SM #2 in cluster #1 and CTA 4 is allocated to SM #4 in cluster #2. If an SM has enough resources to execute more than one CTA, additional CTAs are assigned — this is done in a round-robin manner similar to the procedure just described. For example, CTA 5 is allocated to SM #1 in cluster #1, etc. By doing so, a two-level RR policy balances the load among clusters and SMs, so that all clusters and SMs have a similar number of CTAs to execute.

As proposed in MCM-GPU [24], a state-of-the-art distributed CTA scheduling policy, or distributed scheduling for short, evenly maps a block of neighboring CTAs to the same cluster to exploit the locality benefit. As shown in Figure 2.4(c), assuming there are 6 CTAs in total, CTAs 1 through 3 are assigned to cluster #1, and CTAs 4 through 6 are assigned to cluster #2. Then CTAs mapped to the same cluster are allocated in a round-robin way. For instance, CTAs 1 and 2 are allocated to SM #1 and SM #2 in cluster #1, respectively. In particular, we assume distributed scheduling as the baseline policy in Chapters 3 and 4.

2.2 Memory Divergence and Coalescing

Memory Access Coalescing Unit: In GPUs, threads within a warp execute instructions in lockstep. For a global memory instruction, all 32 threads execute 32 load instructions. The Memory Access Coalescing Unit (MACU) coalesces these memory requests into several cache line-sized requests before accessing the L1D cache to reduce the number of requests and thus increase

(38)

14 CHAPTER 2. BACKGROUND the e↵ective bandwidth utilization. For memory requests with perfect spatial locality, threads within a warp would access 32 consecutive words and thus only one memory access to L1D will be generated. A memory-divergent access, however, will generate several memory requests to L1D after coalescing due to poor spatial locality. If memory requests in a warp are divergent, the warp cannot be executed until all memory transactions are handled, which takes significantly longer than waiting for only one memory request. In this thesis, we define an application as memory-divergent if it features more than 10 Divergent loads Per Kilo Instructions (DPKI).

Missing Status Holding Register: For misses in L1D, the corresponding requests are sent to the lower level of cache in the memory hierarchy. Particu- larly, theMissing Status Holding Registers (MSHRs) are used to track in-flight memory requests and merge duplicate requests to the same cache line. After MSHR allocation, a memory request is bu↵ered into the NoC port for transfer.

An MSHR entry is released once its corresponding memory request is back and all accesses to that block are serviced. Memory-divergent requests tend to access several cache lines and may thus lead to the GPU core running out of MSHR entries quickly. In Chapter 5, we will further investigate the relationship between memory divergence and MSHR blocking.

2.3 GPU Performance Modeling

2.3.1 GPU Performance Analysis Tools

Over the years, Graphics Processing Units (GPUs) have been used as accelerators to perform general-purpose computations. Continuously evolved GPU architectures and emerging GPU applications [6, 29] have made architecture exploration and performance analysis more complicated. This drives the need to develop fast and accurate performance evaluation tools for a broad class of GPU applications and modern-day GPU architectures.

With the advent of GPU computing, GPU manufacturers have developed profiling and debugging tools such as Nvidia’s Visual Profiler [12]. These tools use performance counters to profile various aspects of the program execution.

They are easy to use and run at native hardware speed. Unfortunately, the production-quality profilers are not flexible enough for computer architects to conduct studies involving micro-architecture innovations and design space explorations.

As one solution, architects, system designers and application developers turn towards cycle-accurate simulators such as GPGPU-Sim [27] for accurate performance analysis. Simulators are flexible, and allow architects to measure fine-grained details of execution. However, it is quite time-consuming especially for the full execution of real-world applications. This forces researchers and architects to use trimmed-down input data sets or to intelligently choose sections of the full program [95] so that their experiments finish in a reasonable

(39)

2.3. GPU PERFORMANCE MODELING 15

Figure 2.5: Interval analysis: An interval is defined as a sequence of instructions at the maximum issue rate followed by stall cycles.

amount of time. Unfortunately, there is no guarantee that simulation-sized inputs are representative of real workloads [106]. In addition, state-of-the-art simulators need to be modified to model and support new-generation GPU architectures [13] and emerging GPU applications [6, 29].

An alternative approach is analytical modeling, which captures the key performance-related behavior of an architecture in a set of mathematical equations. Analytical models are much faster than simulation — making them ideally suited for early-stage architectural exploration [51] and helping programmers understand application performance [50, 51]. Hong and Kim [50]

propose a model that estimates performance based on concurrent computation as well as memory requests. Baghsorkhi et al. [26] propose a work-flow graph (WFG) based analytical model to predict performance. However, these models are built on static code analysis. GPUMech [51] addresses this problem by integrating a trace-driven functional simulator with an analytical model based on interval analysis and therefore provides higher accuracy than traditional analytical models. In contrast, machine-learning based models are also proposed in several work [40, 118]. Unfortunately, they are black-box approaches which make it complicated to extract deep insight.

2.3.2 Interval-Based Analytical Modeling

Interval Analysis

The foundation of interval analysis was proposed by Karkhanis et al. [60]

and Eyerman et al. [41]. The basic idea is that the performance of a processor is equal to the issue rate of a processor (a sustained performance) unless disruptive miss events occur such as cache misses. Performance is then estimated by subtracting the stall cycles from the maximum issue rate. Figure 2.5 illustrates interval analysis. An interval is defined as a sequence of instructions at the maximum issue rate followed by stall cycles. Functional simulators are used to detect stall events. Interval analysis was originally proposed for single-thread workloads, and applying it to GPUs is not straightforward due to their highly parallel execution mode [51].

(40)

16 CHAPTER 2. BACKGROUND Generating Interval Profiles at the Warp Level

The starting point of applying interval analysis to model GPUs is to generate interval profiles for one warp. Equation 2.1 shows an interval profile of a warp. Particularly, each interval includes the number of instructions and the number of stall cycles.

interval= [#interval insti, stall cyclesi], i2intervals (2.1) Latencies of compute instructions are fixed and based on the system’s configuration. On the other hand, the latency of each memory instruction is calculated based on the predicted cache miss rate obtained from a cache simulator as well as the LLC/DRAM access latency. For instance, if one (static) memory instruction hits the L1 cache for 10% of the time, and hits the LLC for 30% of the time (among all LLC accesses), the latency equals (1 0.1)⇥0.3 ⇥120 + (1 0.1)⇥(1 0.3)⇥220 = 171 cycles, assuming the access latencies for the LLC and DRAM equal 120 cycles and 220 cycles, respectively. In particular, since all warps will execute the same memory instruction, the cache miss rate of one memory instruction (i.e., one PC or one static instruction) is calculated by counting the miss events of all executions across all warps.

Figure 2.6 illustrates how to generate intervals for a warp. The done cycle is equal to the issue cycle plus the instruction latency. Instruction 3 (i3) leads to stall cycles because instruction 5 (i5) depends on it. Other instructions can be issued every cycle because there are no dependences. We apply Equation 2.2 to determine the issue cycle of each instruction. An interval is formed if the issue cycle of the current instruction is not equal to the issue cycle of the previous instruction plus one, since this indicates that stall cycles have been incurred between the two instructions. As shown in Figure 2.6, i5 can only issue at cycle 123 which is equal to the done-cycle of i3 (122) plus 1.

issue cycle(instk+1) =max{issue cycle(instk)+1, done cycle(source instk+1)+1} (2.2) Trace Collection

A functional simulator is often used to collect a trace for the interval-based analytical performance modeling techniques. Applications are functionally executed to capture the dynamic basic block sequence for every warp as well as the addresses accessed by each thread. GPGPU-sim [27] is widely used for functional execution. However, it operates at the intermediate PTX represen- tation [4] and is time-consuming, especially for long-running GPU applications.

Traces can also be collected through instrumentation on real GPUs. In contrast to a functional simulator, instrumentation tools can profile kernels at

(41)

2.3. GPU PERFORMANCE MODELING 17

Figure 2.6: Generating interval profiles through instruction dependence analysis.

native execution speed. SASSI [106] is a compile-time instrumentation tool that operates directly at the native instruction level [5], leveraging Nvidia’s production back-end compiler. SASSI enables software-based, selective instrumentation of GPU applications. However, compile-based instrumentation tools have several limitations: (1) they cannot operate on GPU driver code, and (2) they cannot target pre-compiled libraries because the source code is not available.

NVBit [114] is a fast, dynamic and portable binary instrumentation frame- work targeting Nvidia GPUs. By working directly at the native instruction level, NVBit can faithfully instrument applications that have been produced by users with NVCC, via JIT compilation of PTX [4], or through the inclusion of pre-compiled shared libraries such as cuDNN [17] and cuBLAS [8]. These op- timized libraries have been widely used in machine learning workloads [34, 58].

It significantly improves the usefulness upon the compile-based tool SASSI.

NVBit provides a rich set of high-level APIs which enable instruction inspection, callbacks to CUDA driver APIs, and injection of arbitrary CUDA func- tions into any application before kernel launch. NVBit enables basic-block instrumentation, multi-function injection to the same location, inspection of ISA-visible state, dynamic selection of instrumented or un-instrumented code, permanent modification of register state, correlation with source code and instruction removal. It supports the Nvidia GPU architecture families of Ke- pler [10], Maxwell [11], Pascal [13], and Volta [15]. In this thesis, we use NVBit to collect per-warp instruction and memory address traces (see Chapter 5).

(42)