HAL Id: cea-02279573
https://hal-cea.archives-ouvertes.fr/cea-02279573
Submitted on 5 Sep 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Unifying distributed memories in a heterogeneous system with reconfigurable accelerators
Erwan Lenormand, Loïc Cudennec, Henri-Pierre Charles
To cite this version:
Erwan Lenormand, Loïc Cudennec, Henri-Pierre Charles. Unifying distributed memories in a hetero-geneous system with reconfigurable accelerators. Conférence d’informatique en Parallélisme, Archi-tecture et Système (Compas’2019), Jun 2019, Anglet, France. �cea-02279573�
UNIFYING DISTRIBUTED MEMORIES IN A HETEROGENEOUS SYSTEM WITH
RECONFIGURABLE ACCELERATORS
Erwan Lenormand
∗
∗
∗†††
, Lo¨ıc Cudennec
†††
, Henri-Pierre Charles
‡‡‡
∗∗∗
Univ Paris-Saclay
†††CEA, List, F-91191, PC 172, Gif-sur-Yvette, France
‡‡‡Univ Grenoble Alpes, CEA, List, F-38000 Grenoble, France
The increase in heterogeneity of computer systems requires a homogenization of the programming models
Context, Motivation & Objectives
Performance and energy efficiency:
Hard-ware accelerators integrated in distributed systems, on a node scale, make it possible to increase computing power and to reduce energy consumption
Programming complexity:
These platforms aremore complex to use. Compute kernels of the application are deported to accelerator functional unit (AFU) devel-oped with accelerator programming languages. Data must be sliced and transferred to accelerators
Objectives:
Provide the same task model betweenCPU and accelerators. Allow accelerator to access irregu-lar data structures whose memory locations are not known at compile time
State of the art
Unifying two separate memory spaces
Processing
Unit
Physical
Memory
Distributed
Shared
Memory
write read read writenode 0 node 1 node n
CPU
A
Accel-erator
B
0 n1 - 1A
B
CPU
B
Accel-erator
0 n2 - 1B
CPU
Accel-
erator
A
0 nn - 1
A
Logical memory space
0 N
A B
State of the art: Unifying CPU memory accesses through DSM Our objective: Allow FPGA to access to DSM
Software-distributed shared memory over heterogeneous system
Targeted distributed system overview
System software stack
S-DSM semi-structured super-peer topology
FPGA
CPU
Network
Shared
Memory
MPI
Inter-connect
Logic
Cells
FIFO
Logic
Cells
FIFO
Task
S-DSM
Server
Task
#define cid 0chunk =MALLOC(cid,size); WRITE(chunk,cid); foo(chunk); RELEASE(cid);
S-DSM
Server
Task
S-DSM
Proxy
AFU
S-DSM
Micro
Server
AFU
#define cid 0 READ(chunk,cid); foo(chunk); RELEASE(cid);Distributed
system
node
Heterogeneous S-DSM overview
Programming model:
Data arechunked in indivisible unit of memory iden-tified by a unique number. The user presses access to chunks with mutual ex-clusion primitives. These primitives encap-sulate API code making service requests to S-DSM. Tasks and AFUs use the same primitives.
API:
The logical organization of the S-DSM follows a client-server model. User tasks correspond to clients. Each client is attached to at least one server. The servers are in charge of data localization and access management. The access re-quests from FPGA are managed by a mi-croserver. The microserver communicates with servers through a proxy process run-ning on host CPU.Conclusion & Outlook
Easy to use:
The proposed system makes itpos-sible to hide the topology of distributed memories by ex-pressing data access with a common identifier for all the platform. It provides mutual exclusion primitives, used in multi-threaded programming model, to schedule data ac-cess at run time.
Improvement:
Increase software portability bypro-viding a common programming model for all resources over the system. Avoid copying data between host memory and accelerator memories through shared memory space and reduce the memory footprint.
Outlook:
Evaluate this programming model fordis-tributed applications with irregular and non-predictable memory access. Specify and implement the system.
[1] Cudennec L., Software-Distributed Shared Memory over heterogeneous micro-server architecture, In Euro-Par 2017: Parallel Processing Workshops, 2017. [2] Stuecheli J., Starke W. J. et al., Ibm power9 opens up a new era of acceleration enablement: Opencapi, IBM Journal of Research and Development, 2018.