Unifying distributed memories in a heterogeneous system with reconfigurable accelerators

(1)

HAL Id: cea-02279573

https://hal-cea.archives-ouvertes.fr/cea-02279573

Submitted on 5 Sep 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Unifying distributed memories in a heterogeneous system with reconfigurable accelerators

Erwan Lenormand, Loïc Cudennec, Henri-Pierre Charles

To cite this version:

Erwan Lenormand, Loïc Cudennec, Henri-Pierre Charles. Unifying distributed memories in a hetero-geneous system with reconfigurable accelerators. Conférence d’informatique en Parallélisme, Archi-tecture et Système (Compas’2019), Jun 2019, Anglet, France. �cea-02279573�

(2)

UNIFYING DISTRIBUTED MEMORIES IN A HETEROGENEOUS SYSTEM WITH

RECONFIGURABLE ACCELERATORS

Erwan Lenormand

∗

∗†††

, Lo¨ıc Cudennec

†††

, Henri-Pierre Charles

‡‡‡

∗∗∗

_{Univ Paris-Saclay}

†††

_{CEA, List, F-91191, PC 172, Gif-sur-Yvette, France}

‡‡‡

_{Univ Grenoble Alpes, CEA, List, F-38000 Grenoble, France}

The increase in heterogeneity of computer systems requires a homogenization of the programming models

Context, Motivation & Objectives

Performance and energy efficiency:

Hard-ware accelerators integrated in distributed systems, on a node scale, make it possible to increase computing power and to reduce energy consumption

Programming complexity:

These platforms are

more complex to use. Compute kernels of the application are deported to accelerator functional unit (AFU) devel-oped with accelerator programming languages. Data must be sliced and transferred to accelerators

Objectives:

Provide the same task model between

CPU and accelerators. Allow accelerator to access irregu-lar data structures whose memory locations are not known at compile time

State of the art

Unifying two separate memory spaces

Processing

Unit

Physical

Memory

Distributed

Shared

Memory

write read read write

node 0 node 1 node n

CPU

A

Accel-erator

B

0 n1 - 1

A

B

CPU

B

Accel-erator

0 n2 - 1

B

CPU

Accel-

_erator

A

0 nn - 1

A

Logical memory space

0 N

A B

State of the art: Unifying CPU memory accesses through DSM Our objective: Allow FPGA to access to DSM

Software-distributed shared memory over heterogeneous system

Targeted distributed system overview

System software stack

S-DSM semi-structured super-peer topology

FPGA

CPU

Network

Shared

Memory

MPI

Inter-connect

Logic

Cells

FIFO

Logic

Cells

FIFO

Task

S-DSM

Server

Task

#define cid 0

chunk =MALLOC(cid,size); WRITE(chunk,cid); foo(chunk); RELEASE(cid);

S-DSM

Server

Task

S-DSM

Proxy

AFU

S-DSM

Micro

Server

AFU

#define cid 0 READ(chunk,cid); foo(chunk); RELEASE(cid);

Distributed

system

_node

Heterogeneous S-DSM overview

Programming model:

Data are

chunked in indivisible unit of memory iden-tified by a unique number. The user presses access to chunks with mutual ex-clusion primitives. These primitives encap-sulate API code making service requests to S-DSM. Tasks and AFUs use the same primitives.

API:

The logical organization of the S-DSM follows a client-server model. User tasks correspond to clients. Each client is attached to at least one server. The servers are in charge of data localization and access management. The access re-quests from FPGA are managed by a mi-croserver. The microserver communicates with servers through a proxy process run-ning on host CPU.

Conclusion & Outlook

Easy to use:

The proposed system makes it

pos-sible to hide the topology of distributed memories by ex-pressing data access with a common identifier for all the platform. It provides mutual exclusion primitives, used in multi-threaded programming model, to schedule data ac-cess at run time.

Improvement:

Increase software portability by

pro-viding a common programming model for all resources over the system. Avoid copying data between host memory and accelerator memories through shared memory space and reduce the memory footprint.

Outlook:

Evaluate this programming model for

dis-tributed applications with irregular and non-predictable memory access. Specify and implement the system.

[1] Cudennec L., Software-Distributed Shared Memory over heterogeneous micro-server architecture, In Euro-Par 2017: Parallel Processing Workshops, 2017. [2] Stuecheli J., Starke W. J. et al., Ibm power9 opens up a new era of acceleration enablement: Opencapi, IBM Journal of Research and Development, 2018.