Network Communications in Grid Computing: At a Crossroads Between Parallel and Distributed Worlds

(1)

HAL Id: inria-00000136

https://hal.inria.fr/inria-00000136

Submitted on 24 Jun 2005

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Network Communications in Grid Computing: At a Crossroads Between Parallel and Distributed Worlds

Alexandre Denis, Christian Pérez, Thierry Priol

To cite this version:

Alexandre Denis, Christian Pérez, Thierry Priol. Network Communications in Grid Computing: At

a Crossroads Between Parallel and Distributed Worlds. 18th International Parallel and Distributed

Processing Symposium (IPDPS 2004), Apr 2004, Santa Fe/USA, United States. pp.95a. �inria-

00000136�

(2)

Network Communications in Grid Computing:

At a Crossroads Between Parallel and Distributed Worlds

Alexandre Denis

¹

Christian Pérez

²

Thierry Priol

²

1

IRISA/IFSIC,

²

IRISA/INRIA

Campus de Beaulieu — F-35042 Rennes Cedex — France

Alexandre.Denis@irisa.fr, Christian.Perez@irisa.fr, Thierry.Priol@irisa.fr

Abstract

This paper studies a communication model that aims at extending the scope of computational grids by allowing the execution of parallel and/or distributed applications with- out imposing any programming constraints or the use of a particular communication layer. Such model leads to the design of a communication framework for grids which al- lows the use of the appropriate middleware for the applica- tion rather than the one dictated by the available resources.

Such a framework is able to handle any communication middleware —even several at the same time— on any kind of networking technologies. Our proposed dual-abstraction (parallel and distributed) model is organized into three lay- ers: arbitration, abstraction and personalities which are highlighted in the paper. The performance obtained with PadicoTM, our available open source implementation of the proposed framework, show that such functionality can be obtained with still providing very high performance.

1. Introduction

The emergence of computational grids as new high- performance computing infrastructures gives the users ac- cess to computing resources at an unprecedented scale in the history of computing. However, computational grids differ from previous computing infrastructures since they exhibit both parallel and distributed aspects: a computational grid is a set of various and widely distributed computing resources, which are often parallel, ranging from high-performance supercomputers to clusters of P

C

s. As a consequence, a grid usually contains various networking technologies — from S

AN

in a room through W

AN

at a continent scale.

This work was supported by the Incentive Concerted Action “GRID”

(ACI GRID) of the French Ministry of Research.

Ideally, when applications are deployed on grid re- sources, they should adapt themselves to their environment, and to the networks in particular. The current program- ming practices associated with computational grids were strongly influenced by such an adaptation capability. A common programming approach is to see the grid as a vir- tual parallel computer, so that programmers can follow the usual techniques of parallel programming, for example with M

PI

. Since M

PI

is available on a large number of network- ing technologies, applications based on this communication middleware will be able to adapt to the networking environ- ment. Such an adaptation is performed at the application programming interface level. However, adaptation is also required at runtime. For example, an application linked with a M

PI

library configured to used the G

M

driver of a Myrinet network restricts the application deployment to systems that provide such a network.

However, providing a single communication model (message-based), will not be enough for most applications, because it does not take into account any other communica- tions such as visualization, steering, coupling of simulation codes, or interactive control. Therefore, in addition to a par- allel middleware system such as M

PI

, at least another mid- dleware system is required to handle these new kinds of in- teraction. Such a middleware system should be distributed- oriented to handle dynamic connection/disconnection.

The first contribution of this paper is to propose a com- munication framework that decouple application middle- ware systems from the actual networking environment.

Hence, applications become able to transparently and ef-

ficiently utilize any kind of communication middleware (ei-

ther parallel or distributed-based) on any network that they

are deployed on, removing thus the aforementioned deploy-

ment constraints. As a second contribution of this paper,

the proposed model is able to concurrently support sev-

eral communication middleware systems with very few or

no change. Such capability is very important when using

(3)

modern programming practices such as distributed compo- nent programming for the design of HPC applications. In- deed, distributed component models, such as CCA [2] or GridCCM [20], require a communication middleware for communication between components; if the code inside the components is parallel, then a communication middle- ware is used inside the components. As a consequence, these modern programming practices need to middleware systemes, one for intra-component communications and another for inter-components communications. We have shown in [10] that even for standard networking technolo- gies such as Ethernet with T

CP

/I

P

, sharing such a network interface between two middleware systems raises some se- rious technical concerns.

The remainder of this paper is divided as follows. Sec- tion 2 presents an analysis of grid communication based on some examples of typical grid usage. In section 3, we pro- pose a communication framework model that supports both parallelism and distributed computing. Section 4 describes and evaluates the implementation of this model in the Padi- coTM platform. Section 6 presents some related works. Fi- nally, we conclude in Section 7.

2. Grid Communication Model Analysis

This section introduces some important features we think that actual and forthcoming grid-enabled applications will require. Then, it defines the communication paradigms and analyzes communication abstraction so as to draw the main directions for a communication framework for grids.

2.1. Grid Network Use Analysis

A grid application can be deployed on different resource configurations. For instance, one deployment configuration may be a set of nodes within a single P

C

cluster equipped with a high-performance network, while another deploy- ment configuration may be a set of nodes in two separate P

C

clusters interconnected through a high-bandwidth W

AN

. Another example of grid use is given by parallel compo- nent based applications [2, 20] where a component em- beds a parallel code. The component framework uses its own paradigm to interconnect components. This paradigm should be independent from the communication paradigms used internally by parallel components. Hence, a M

PI

- based component could be connected to a P

VM

-based com- ponent.

A last example is a grid application which supports con- nection and disconnection from the user to visualize and/or monitor the ongoing computation. Hence, the grid appli- cation is likely to use at least two middleware systems:

one or more for the computation and another for visualiza- tion/monitoring.

These scenarios introduce some important features which should be supported by grid-enabled middleware sys- tems:

Transparency — The middleware systems used by an ap- plication should be able to transparently and efficiently use the available resources. For example, a M

PI

, P

VM

, Java or C

ORBA

communication should be able to utilize high speed networks (S

AN

) as well as local area networks (L

AN

) and wide area networks (W

AN

).

Moreover, they should adapt their security require- ments to the characteristics of the underlying network, eg. if the network is secure, it is useless to cipher data.

Flexibility — There is a diversity of middleware systems, and we can assume there will always be. It seems im- portant not to tie grid applications to a specific grid framework but instead to ease the “gridification” of middleware systems.

Interoperability — Grids are not a closed world. Grid applications will need to be accessible using standard protocols. So, there is a high need to keep protocol interoperability.

Support Multiple Communication Paradigms — Some programming models like parallel components (C

CA

[2], GridCCM [20]), or situations like a S

OAP

- based monitoring system of a M

PI

application, require several middleware systems. Thus, it is important to allow different middleware systems to be used simultaneously.

2.2. Communication Paradigm Analysis

If we define a communication paradigm as a family of middleware systems which are built on the same model, we can distinguish two important kinds of communica- tion paradigms: the parallel paradigm and the distributed paradigm.

Parallel paradigm — The main aspect of parallelism is, without any doubt, high performance. Communica- tions take place inside a definite and usually static set of nodes known by each other (mostly S

PMD

- oriented), messages have well-defined boundaries, the A

PI

is optimized for zero-copy implementations, there are collective operations which involves several nodes of the set. A typical example is M

PI

. We can dis- tinguish distributed-memory parallelism and shared- memory parallelism; in this network-centric paper, we focus on distributed-memory parallelism.

Distributed paradigm — The main constraint is interop-

erability. Connections are dynamic, managed on a

(4)

per-link basis in a client/server way; interoperabil- ity is brought across architectures, operating systems and software vendors; communication primitives may use streaming. Some typical examples are T

CP

/I

P

, C

ORBA

or S

OAP

.

These are our definitions and will be used in the remain- ing of this paper. They should be understood as a clas- sification with soft boundaries, not as absolute rules; for example M

PI

2 allows dynamic connections, a D

SM

sys- tem (Distributed Shared Memory) is not message-based, but we still consider them as parallel. In this paper, we consider T

CP

/I

P

and U

DP

, C

ORBA

-I

IOP

[18], S

OAP

[4], H

LA

-R

TI

[15] and Java R

MI

as distributed-oriented; M

PI

, P

VM

, D

SM

, FastMessage, Madeleine [3] or Panda [22] as parallel-oriented.

2.3. Abstraction Level Analysis

The last step of our analysis deals with the different lev- els of communication abstraction found in a grid applica- tion.

The resource abstraction principle consists in the def- inition of an abstract interface which is not bound to any particular implementation. There may exist several incar- nations which implements the same abstract interface. Ab- straction is a widely used mechanism to cope with the dif- ferences between various kinds of networks; in this case, it is called a portability mechanism. When an abstract inter- face for portability is designed to be used by several mid- dleware systems and/or applications (and not only for the portability of one middleware system), it is called a gener- icity mechanism. This results in a stack of software layers whose abstraction level increases down-top:

System-level — implemented by a network driver such as G

M

, B

IP

[21], V

IA

, Sisci [14] or other vendor- supplied communication library, or by the operating system such as T

CP

/I

P

.

Generic-level — implemented by a communication frame- work, such as Madeleine [3], Nexus [12] or Panda [22].

The A

PI

, independent from the network, is likely to be used by a middleware system.

Application-level — implemented by a middleware sys- tem, such as C

ORBA

, M

PI

, P

VM

or H

LA

-R

TI

. It im- plements a programming model. The A

PI

is designed to be used by applications.

3. A Model for Grid Communication Frame- works

This section presents our proposed model of a commu- nication framework for grids that takes into account both

parallel and distributed paradigms.

3.1. Abstraction Model Study

The commonly used abstraction model brings portabil- ity: the ability for a middleware system to utilize several kinds of networks, according to what is available. It brings also genericity: the ability to reuse the portability software infrastructure for several middleware systems. However, genericity is usually brought by the definition of a unique abstract interface. This choice of a unique abstract inter- face is especially relevant for portability, but is question- able regarding genericity: this approach is generic inside a particular paradigm —the paradigm chosen for the abstract interface.

Which abstract interface? Since we want our communi- cation framework to ba able to support middleware systems based on both paradigms, we want to find an abstract in- terface able to be used by both kinds of middleware. We may think of using a unique distributed-oriented abstract interface. Indeed, a lot of parallel middleware can uti- lize T

CP

/I

P

sockets that are a distributed abstract interface.

This approach is well adapted for making a parallel sys- tem look like a distributed network infrastructure (eg. Ether- net). However, it seems irrelevant to use a parallel-oriented network, such as the internal network of a supercomputer or a cluster. As depicted in Figure 1 (a), the use of a sin- gle abstract interface imposes unnecessary compromises, in particular when running a parallel application on a parallel machine! In this case, for example, an M

PI

implementa- tion built atop T

CP

/I

P

is able to run on most networking re- sources, including supercomputers networks, but is unable to utilize “parallel-specific” properties of these networks, such as optimized collective operations. This is due to the lack of expressiveness of the distributed-oriented T

CP

/I

P

A

PI

.

Symmetrically, it is quite common to use a unique par- allel interface on grids, for example M

PICH

-G2 [11]. It is possible to use it to implement distributed-oriented commu- nication mechanisms, such as distributed objects. However, the parallel-oriented M

PI

interface cannot express proper- ties which are essential for distributed computing, such as I

P

addressing, dynamic connections in a client/server fash- ion (not spawn as in M

PI

-2), or interoperability with other standard implementations. For instance, it seems impossi- ble to build a standard-conforming C

ORBA

implementation on top of M

PICH

-G2 (or more precisely, M

PICH

’s abstract interface called “A

DI

-2”) alone.

In both cases, a unique abstract interface biased towards

only parallelism or distributed computing penalizes the

middleware systems from the other paradigm since some

(5)

network Distributed network

Parallel Parallel middleware

Distributed middleware

Distributed abstraction

middleware

Parallel Distributed middleware

Unified abstraction

Parallel

network network

Distributed

Middleware

Abstract interfaces

Networks

Cross−paradigm Straight

(a) Left side: everything expressed through a single abstraction (distributed) — two cross-paradigm translations are needed for a parallel middleware atop a parallel network.

(b) Right side: a unified abstraction makes compromises in all cases! — gives up most possible optimizations and imposes compromises to everything.

Parallel network Distributed network Parallel abstraction Distributed abstraction

Distributed middleware Parallel middleware

Networks Abstract Interfaces Middleware

Cross−paradigm Straight

(c) Dual-abstraction model: use different abstract interfaces for different paradigms

— only required compromises are done.

Figure 1. Several abstraction models may be envisaged.

properties available at system-level cannot be expressed by the abstract interface, so they are lost.

Avoid the “bottleneck of features”. Thus we should try to find a better abstract interface which would combine both properties from parallelism and distributed computing, as depicted in Figure 1 (b); this abstraction would “keep the best of both worlds”. In order to take into account the interoperability constraint from distributed computing, a unified abstract interface cannot be far from a distributed- oriented interface. More generally, it seems unrealistic to weaken the strong constraints of distributed computing to make them more look like the weaker hypothesis of par- allelism which allow some optimizations: giving up the streaming capability from distributed computing in order to optimize a message-based communication system (à la M

PI

or Madeleine) breaks the required interoperability with T

CP

/I

P

; using topology and hardware configuration infor- mations to optimize collective operations seems incompat- ible with the per-link connection management and interop- erability with standard plain I

P

from the distributed side.

A unified abstract interface cannot give up the strong con- straints required by the distributed side, thus it uselessly im- poses these strong constraints even to the parallel side. A single abstract interface, be it distributed, parallel, or uni- fied, does not seems satisfactory.

Rather than trying to unify contrary things, we pro- pose a dual-abstraction interface, with both a parallel- and distributed-oriented interface. Each middleware system is either parallel or distributed not both at the same time. For

example C

ORBA

, H

LA

and S

OAP

are distributed when M

PI

and P

VM

need a parallel abstract interface. There is no need to find an interface which would be both; it is suffi- cient to provide each middleware system with the appropri- ate abstract interface, and to supply each abstract interface on both kinds of networks. This dual-abstraction approach is depicted in Figure 1 (c). Each middleware system uti- lizes the required abstracted interface. Each abstract inter- face is instantiated on each network through an adapter: an adapter may be either straight, or cross-paradigm. Con- sequently, compromises for cross-paradigm translation are performed only when they are required. With such a dual- abstraction model, there always exists an abstract-level in- terface able to express the properties for each kind of hard- ware. Bending all system-level interfaces towards a unique abstraction does not seem appropriate because it loses some key features: a communication framework for grids cannot be parallel- nor distributed-only. We chose to build our grid communication framework on this dual-abstraction model.

3.2. Resource Virtualization for Seamless Swapping of Communication Methods

The middleware systems likely to be used by grid-

enabled applications are various: M

PI

, C

ORBA

, S

OAP

,

H

LA

, J

VM

, P

VM

, etc. Moreover, for each kind of mid-

dleware, there are several implementations which have their

own specific properties. Developing a middleware system is

a heavy task —for example, M

PICH

contains 200,000 lines

of C— and requires very specific skills. Moreover, the stan-

(6)

dards —and thus, the middleware systems themselves— are ever-changing. It does not seem reasonable to re-develop an implementation of each one of these middleware systems specifically for a given communication framework. Instead, we chose to re-use existing implementations. Thus it is easy to follow the new versions and to use specific features of a given implementation.

To seamlessly re-use existing implementations of mid- dleware systems, we choose to virtualize networking re- sources. It consists in giving the middleware system the illusion that it is using the usual resource it knows, even if the real underlying resource is completely different. For example, we show a “socket” A

PI

to a C

ORBA

implementa- tion so as to make it believe it is using T

CP

/I

P

, even if it is actually using another protocol/network behind the scene.

This is performed through the use of thin wrappers on top of the appropriate abstract interface to make it look like the required A

PI

. We call these small wrappers personalities.

It is possible to give several personalities to an abstract in- terface.

Virtualization and abstraction mechanisms with cross- paradigm adapters allows any middleware system to seam- lessly utilize any network. However, even if a straight adapter is available, it is not always the better method, espe- cially on distributed-oriented networks. The other methods are for example:

Parallel streams on W

AN

— Over a high-bandwidth high-latency W

AN

with T

CP

/I

P

, each single packet loss can dramatically lower the bandwidth. A solu- tion consists in utilizing multiple sockets in parallel for a single logical link, so as to reduce the influence of each isolated loss. This principle of parallel streams is already used for example in GridFTP [1].

Online compression — On slow networks, it may be worth compressing data to speed-up the transfers.

AdOC [16] implements an adaptive online compres- sion mechanism.

Encryption and authentication — When a connection lays between two different sites, it is likely that the user wants authentication and/or encryption. This may be achieved through the use of a protocol plug-in. It raises a whole set of new problems, such as certificate management and credential delegation. We investigate the use of the Grid Security Infrastructure (G

SI

) [13]

or I

P

sec.

Loss-tolerant protocol — On slow W

AN

which suffer from high loss-rate, applications may prefer to give up reliability against a better bandwidth, but not accept totally uncontrollable losses. Such a tunable tradeoff is implemented in V

RP

[6], a protocol with a tunable loss tolerance.

These various communication methods may be supplied as alternate adapters beside straight and cross-paradigm adapters. They must exhibit the right abstract interface according to their respective paradigm. Their use is thus seamless from the point of view of the middleware systems.

Thanks to these virtualization mechanisms, the hardware re- sources do not curb the programming model to be used in applications. The possible deployments schemes are more advanced than just parallel applications on a parallel ma- chine or distributed applications on a distributed system.

Each middleware system is able to use every available re- sources —parallel and distributed— with the most appro- priate method — eg. C

ORBA

as well as M

PI

are able to effi- ciently use Myrinet if available, or use W

AN

-specific meth- ods if necessary. The virtualization enables the use of a communication paradigm not dictated by the hardware.

3.3. A Hybrid Parallel + Distributed Model

In this section, we propose a model of communication framework for grids, based on a 3-layer approach, with both parallel- and distributed-oriented abstract interfaces.

An implementation of this model is depicted in Figure 2.

Our proposed dual-abstraction model is organized in 3 lay- ers: arbitration, abstraction, and personalities. Parallel and distributed paradigms are present at each level. Therefore, cross-paradigm translation is performed only when required (ie. distributed middleware atop parallel hardware or paral- lel middleware atop distributed networks) with no bottle- neck of features.

Arbitration layer. Concurrent access to network hard- ware by multiple middleware systems at the same time is not straightforward. There is a high risk of access conflicts.

We propose that arbitration should be dealt for at the low- est possible level, so as to build more advanced abstractions atop a fully reentrant system. Arbitration is performed by a layer which provides a consistent, reentrant and multiplexed access to every networking resources, each resource is uti- lized with the most appropriate driver and method. The arbitrated interfaces are designed for efficiency and reen- trance. Thus, we propose these A

PI

to be callback-based (à la Active Message). For true arbitration, this layer is the only client of the system-level resources: all accesses to the network should be performed through the arbitration layer.

It provides also arbitration between different networks (eg.

Myrinet against Ethernet) so that they do not bother each

other, and between different adapters (as defined in Sec-

tion 3.1) on the same network (eg. both C

ORBA

and M

PI

on

Myrinet) even if the communication library does not pro-

vide multiplexing. More details about cooperative access

rather than competitive are given in [9].

(7)

NetAccess MadIO Interface

NetAccess SysIO

Circuit VLink

Interface

Madeleine System Sockets VIO AIO

Posix Sockets

BSD FM Madeleine

Personalities

Adapters

Arbitration

Abstract interfaces Standard interfaces

Arbitrated interfaces

Figure 2. Implementation of the model in PadicoTM.

Abstraction Layer. On top of the arbitration layer, we propose an abstraction layer which provides higher level services, independent from the hardware. Its goal is to pro- vide abstract interfaces well suited for their use by various middleware systems. The abstract layer should be fully transparent: the interfaces are the same whatever the un- derlying network is. The abstraction layer supplies both parallel- and distributed-oriented abstract interfaces on top of every method from the arbitration layer, through mod- ules called adapters. This layer is responsible for automati- cally and dynamically choosing the best available interface from the arbitration layer according to the available hard- ware; then it should map it onto the right abstract interface through the right adapter. As shown on Figure 2, adapters may be straight (same paradigm at system- and abstract- level, eg. parallel abstract interface on parallel hardware) or cross-paradigm — eg. distributed abstract interface on par- allel hardware.

Personalities. In order to provide virtualized communi- cation A

PI

, we propose a personality layer able to supply various standard A

PI

s on top of the abstract interfaces. Per- sonalities are thin wrappers which adapt a generic A

PI

to make it look like another A

PI

. They do no protocol adapta- tion nor paradigm translation; they only adapt the syntax.

4. Implementation of the Communication Model

Padico [7] is our software infrastructure for Grid Com- puting. The communication model described in the previ- ous section has been implemented in the high-performance runtime system of Padico called PadicoTM [9, 10] as de- picted in Figure 2. The PadicoTM framework is used for parallel C

ORBA

objects [8] and components [20]. This pa- per focuses only on the novel communication model pro-

posed in PadicoTM. However, PadicoTM addresses other issues for integrating middleware systems, such as dy- namic code loading and configuration, arbitration for multi- threading, memory management and Unix signals. These other issues are purposely not discussed in this paper.

4.1. Network Access Arbitration: NetAccess

The arbitration layer in PadicoTM is called NetAccess, which contains two subsystems: SysIO for access to sys- tem I/O (sockets, files), and MadIO for multiplexed access to high-performance networks. A core handles a consistent interleaving among the concurrent polling loops. NetAccess is open enough so as to allow the integration of other sub- systems beside MadIO and SysIO for other paradigms such as Shmem on S

MP

for example.

NetAccess MadIO: A

PI

for Accessing Parallel-oriented Hardware. For good I/O reactivity and portability over high performance networks, we have chosen the high- performance network library Madeleine [3] as a foundation.

Madeleine is used for high-performance networks such as Myrinet, S

CI

, V

IA

. Madeleine provides no more multi- plexing channels than what is allowed by the hardware (eg.

2 over Myrinet, 1 over S

CI

). MadIO adds a logical mul-

tiplexing/demultiplexing facility which allows an arbitrary

number of communication channels. Multiplexing on top of

Madeleine adds a header to all messages. This can signifi-

cantly increase the latency if not done properly. We imple-

ment headers combining to aggregate headers from several

layers into a single packet. Thus, multiplexing on top of

Madeleine adds virtually no overhead to middleware sys-

tems which send headers anyway. We actually measure that

the overhead of MadIO over plain Madeleine is less than

0.1 s which is imperceptible on most current networks.

(8)

NetAccess SysIO: A

PI

for Accessing Distributed- oriented Hardware. Contrary to a widespread belief, us- ing directly the socket A

PI

from the O

S

does not bring full reentrance, multiplexing and cooperation. Several middle- ware systems not designed to work together may get into troubles when used simultaneously, even with only plain T

CP

/I

P

. There are reentrance issues for signal-driven I/O (used by middleware systems designed to deal with heavy load), which results in an incorrect behavior, or worst, in a crash. If a middleware system uses blocking I/O and an- other uses active polling, the one which does active polling holds near 100 % of the C

PU

time; it will result in inequity or even deadlock. To solve these conflicts, SysIO manages a unique receipt loop that scans the opened sockets and calls user-registered callback functions when a socket is ready.

The callback-basedness guarantees that there is no reen- trance issue nor signals to mangle with.

NetAccess core. The core of NetAccess manages the threads with the polling loops. It enforces fairness between SysIO and MadIO. The interleaving policy between SysIO and MadIO is dynamically user-tunable through a configu- ration A

PI

to give more priority to system sockets or high performance network depending on the application.

4.2. Abstractions: VLink and Circuit

The abstract interfaces in PadicoTM are called VLink for distributed computing, and Circuit for parallelism.

Distributed abstract interface: VLink. The VLink in- terface is designed for distributed computing. It is client/server-oriented, supports dynamic connections, and streaming. In order to easily allow several personalities — both synchronous and asynchronous personalities—, VLink is based on a flexible asynchronous A

PI

. This A

PI

consists in five primitive operations —read, write, connect, accept, close. These functions are asynchronous:

when they are invoked, they initiate (post) the operation and may return before completion. Their completion may be tested by polling the VLink descriptor; a handler may be set which will be called upon operation completion. Such a set of functions is called a VLink-driver. VLink drivers have been implemented on top of: MadIO, SysIO, Parallel Streams for W

AN

, AdOC [16], loopback.

Abstract interface for parallelism: Circuit. The Cir- cuit interface is designed for parallelism. It manages com- munications on a definite set of nodes called a group. A group may be an arbitrary set of nodes, eg. a cluster, a sub- set of a cluster, may span across multiple clusters or even multiple sites. Circuit allows communications from ev- ery node to very other node through an interface optimized

for parallel runtimes: it uses incremental packing with ex- plicit semantics to allow on-the-fly packet reordering, like in Madeleine [3]. Collective operations in Circuit still needs to be investigated. Circuit adapters have been implemented on top of MadIO, SysIO, loopback and VLink (to use the alternates VLink adapters); a given instance of Circuit can use different adapters for different links.

Selector. VLink and Circuit automatically choose which protocol to use according to a knowledge base of the net- work topology managed by PadicoTM and user-defined preferences. All protocols are available for both VLink and Circuit interfaces.

4.3. Personalities and Middleware Systems

PadicoTM provides several well-known A

PI

through simple “cosmetics” adapters over the VLink and Circuit ab- stract interfaces. These thin A

PI

wrappers are called per- sonalities. The personalities for VLink are: Vio for an explicit use through a socket-like A

PI

; SysWrap supplies a 100 % socket-compliant A

PI

through wrapping at link stage for direct use within C, C++ or F

ORTRAN

legacy codes without even recompiling. Thus, legacy applications are able to transparently use all PadicoTM communication methods without losing interoperability with PadicoTM- unaware applications on plain sockets. We implement an Aio personality on top of VLink which provides a plain Posix.2 Asynchronous I/O (Aio) A

PI

. Thin adapters on top of Circuit provides a FastMessage 2.0 A

PI

, and a (virtual) Madeleine A

PI

.

Thanks to SysWrap, various middleware systems have been seamlessly ported on PadicoTM with no change in their code: C

ORBA

implementations (omniORB 3, om- niORB 4, ORBacus 4.0, all Mico 2.3.x including C

CM

- enabled versions), an H

LA

implementation (Certi from the Onera), and a S

OAP

implementation (gSOAP 2.2). A Java virtual machine (Kaffe 1.0.7) has been slightly modified for use within PadicoTM, with some changes in its multi- threading management code. Thanks to the Madeleine personality, the existing M

PICH

/Madeleine implementation can run in PadicoTM. The middleware systems are dynam- ically loadable into PadicoTM. Arbitration guarantees that any combination of them may be used at the same time.

5. Performance Evaluation

Our test platform is comprised of dual-Pentium III

1 GHz with 512 MB RAM, switched Ethernet-100,

Myrinet-2000 and Linux 2.2. The raw bandwidth of var-

ious middleware systems in PadicoTM over Myrinet-2000

is depicted in Figure 3. The maximum bandwidth and

(9)

0 50 100 150 200 250

32 1KB 32KB 1MB

Bandwidth (MB/s)

Message size (bytes) omniORB-3.0.2/Myrinet-2000 omniORB-4.0.0/Myrinet-2000 Mico-2.3.7/Myrinet-2000 ORBacus-4.0.5/Myrinet-2000 MPICH-1.1.2/Myrinet-2000 Java socket/Myrinet-2000 TCP/Ethernet-100 (reference)

Figure 3. Bandwidth of various middleware systems in PadicoTM over Myrinet-2000.

API or middleware Circuit VLink M

PICH

-1.2.5 omniORB 3 omniORB 4 Java sockets

Oneway latency (

s) 8.4 10.2 12.06 20.3 18.4 40

Maximum bandwidth (MB/s) 240 239 238.7 238.4 235.8 237.9

Table 1. Performance of various middleware systems with PadicoTM over Myrinet-2000.

latency for some middleware systems in PadicoTM is given in Table 1.

For M

PI

, omniORB and Java sockets, the peak band- width is excellent: roughly 240 MB/s, which is 96 % of the maximum Myrinet-2000 hardware bandwidth. The latency is 12

s for M

PI

and 18

s for omniORB. We notice the excellent performance for omniORB; as far as we know, omniORB in PadicoTM is the fastest existing C

ORBA

im- plementation. Mico and ORBacus get lower performance because, unlike omniORB, they always copy data for mar- shalling and unmarshalling. Mico peaks at 55 MB/s with a latency of 63

s, and ORBacus gets 63 MB/s with a la- tency of 54

s. However, these poor performance results are due to bad internal design of the middleware systems themselves and are consistent with theory [9].

PadicoTM overhead is negligible: M

PICH

in PadicoTM over Myrinet-2000 gets roughly the same performance as a standalone implementation of M

PICH

over Myrinet-2000.

The performance of the other middleware systems cannot be compared: without PadicoTM, no C

ORBA

implementation is able to utilize a Myrinet-2000 network.

We have run test on V

THD

, a French experimental high- bandwidth W

AN

. All middleware systems get roughly the same performance, namely a bandwidth of 9 MB/s and a

8 ms latency, which is the typical performance on this kind of network. When activating Parallel Streams, the band- width goes up to 12 MB/s which is the maximum possible given the fact that each node is connected to V

THD

through Ethernet-100. On the W

AN

, every middleware systems get roughly the same performance since software overhead is negligible compared to the network speed.

We have tested V

RP

on a slow trans-continental Internet link. The link exhibits a typical loss-rate of 5-10 %. With T

CP

/I

P

and plain sockets, we get 150 KB/s; if we give up some reliability and allow up to 10 % loss with V

RP

, we get an average of 500 KB/s on the same link, ie. three times more.

6. Related Works

Several middleware environments for managing network communications have emerged. However, very few take both parallel and distributed paradigms into account and thus are not tailored for general grid applications. For ex- ample, Panda [22] is a framework designed for parallel runtimes, namely P

VM

and M

PI

. A

DAPTIVE

(A

CE

) [23]

is a distributed-oriented generic communication environ-

ment Harness [17] and Quarterware [24] allow the use

(10)

of multiple middleware systems at the same time; for the moment, they are limited respectively to M

PI

+ P

VM

and M

PI

+ R

MI

and published performance mention only plain T

CP

—no Myrinet nor W

AN

-optimized protocols. V

MI

[19]

deals with both paradigms; its is close to V

IA

, and targets large clusters with S

AN

rather than W

AN

. Proteus [5] is a system for integrating multiple message protocols such as S

OAP

and J

MS

within one system. It aims at decoupling applications from protocols, which is an approach quite similar to ours, but at a much higher level in the protocol stack. Nexus [12] used to be the communication subsys- tem of Globus. Nowdays, it becomes accepted that M

PICH

- G2 [11] built on Globus-IO is a popular communication mechanism for grids, but only supports one A

PI

, namely M

PI

.

7. Conclusion

Grid applications can be largely leveraged with an ad- equate support of middleware systems. This paper has introduced a novel communication model for grids based on a crossroads of parallel and distributed worlds: both paradigms are present in the supported infrastructures and middleware systems. Hence, middleware systems are de- coupled from the actual network so that they can transpar- ently and efficiently utilize any network they are deployed on.

The second advantage of the proposed communication model is its ability to support several middleware systems from different paradigms at the same time. This feature is very important for parallel objects/components program- ming models though traditional M

PI

application can also benefit from it.

The paper has also described the network-related aspects in the PadicoTM framework which implements this model and supports various methods to utilize the networks: B

IP

, G

M

, Sisci, V

IA

, T

CP

/I

P

, Parallel Streams, AdOC, V

RP

, and supports various middleware systems seamlessly: M

PI

, various C

ORBA

implementations, H

LA

, S

OAP

, Java and a D

SM

.

Security issues need further investigations as they bring new problems. Other future works aim at bringing other communication methods for more flexibility to the deploy- ment: tunnels for full-connectivity through firewalls, global addressing (without being tied to the IP system).

PadicoTM is Open Source software and is available for download at http://www.irisa.fr/paris/

Padicotm/.

References

[1] B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, I. Fos- ter, C. Kesselman, S. Meder, V. Nefedova, D. Quesnal, and

S. Tuecke. Data management and transfer in high perfor- mance computational grid environments. Parallel Comput- ing Journal, 28(5):749–771, May 2002.

[2] R. Armstrong, D. Gannon, A. Geist, K. Keahey, S. Kohn, L. McInnes, S. Parker, and B. Smolinski. Toward a com- mon component architecture for high-performance scientific computing. In Proceeding of the 8th IEEE International Symposium on High Performance Distributed Computation, Aug. 1999.

[3] O. Aumage, L. Bougé, A. Denis, J.-F. Méhaut, G. Mercier, R. Namyst, and L. Prylli. A portable and efficient com- munication library for high-performance cluster comput- ing. In IEEE Intl Conf. on Cluster Computing (CLUSTER 2000), pages 78–87, Technische Universität Chemnitz, Sax- ony, Germany, Nov. 2000.

[4] E. Cerami. Web Services Essentials, chapter Simple Object Access Protocol (SOAP), pages 49–112. O’Reilly & Asso- ciates, 1st edition, Feb. 2002.

[5] K. Chiu, M. Govindaraju, and D. Gannon. The proteus mul- tiprotocol library. In Proceedings of the 2002 Conference on Supercomputing (SC’02), Baltimore, USA, Nov. 2002.

[6] A. Denis. Variable Reliability Protocol: A protocol with a tunable loss tolerance for high performance over a WAN.

Research Report RR2000-11, LIP, ENS Lyon, Lyon, France, Feb. 2000.

[7] A. Denis, C. Pérez, T. Priol, and A. Ribes. Padico: A component-based software infrastructure for grid comput- ing. In International Parallel and Distributed Processing Symposium (IPDPS), 2003.

[8] A. Denis, C. Pérez, and T. Priol. Portable parallel CORBA objects: an approach to combine parallel and distributed pro- gramming for grid computing. In Proc. of the 7th Intl. Euro- Par’01 conf., pages 835–844, Manchester, UK, Aug. 2001.

Springer.

[9] A. Denis, C. Pérez, and T. Priol. Towards high performance CORBA and MPI middlewares for grid computing. In G. A.

Lee, editor, Proc of the 2nd International Workshop on Grid Computing, number 2242 in LNCS, pages 14–25, Denver, Colorado, USA, Nov. 2001. Springer-Verlag. In conjunction with SuperComputing 2001 (SC’01).

[10] A. Denis, C. Pérez, and T. Priol. PadicoTM: An open inte- gration framework for communication middleware and run- times. In IEEE International Symposium on Cluster Com- puting and the Grid (CCGRID2002), 2002.

[11] I. Foster, J. Geisler, W. Gropp, N. Karonis, E. Lusk, G. Thiruvathukal, , and S. Tuecke. Wide-area implementa- tion of the message passing interface. Parallel Computing, 24(12):1735–1749, 1998.

[12] I. Foster, J. Geisler, C. Kesselman, and S. Tuecke. Manag- ing multiple communication methods in high-performance networked computing systems. Journal of Parallel and Dis- tributed Computing, 40(1):35–48, 1997.

[13] I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke. A secu- rity architecture for computational grids. In 5th ACM Con- ference on Computer and Communications Security Confer- ence, pages 83–92, 1998.

[14] IEEE. Standard for Scalable Coherent Interface (SCI). Stan-

dard no. 1596, Aug. 1993.

(11)

[15] IEEE. IEEE standard for modeling and simulation (M&S) high level architecture (HLA)—federate interface specifica- tion. IEEE Standard 1516, Sept. 2000.

[16] E. Jeannot, B. Knutsson, and M. Bjorkmann. Adaptive on- line data compression. Edinburgh, Scotland, July 2002.

[17] D. Kurzyniec, V. Sunderam, and M. Migliardi. On the vi- ability of component frameworks for high performance dis- tributed computing: A case study. In IEEE International Symposium on High Performance Distributed Computing (HPDC-11), Edimburg, Scotland, July 2002.

[18] OMG. The Common Object Request Broker: Architecture and Specification V3.0. OMG Document formal/02-06-33, June 2002.

[19] S. Pakin and A. Pant. VMI 2.0: A dynamically reconfig- urable messaging layer for availability, usability, and man- agement. In Workshop on Novel Uses of System Area Net- works (SAN-1), Cambridge, Massachusetts, Feb. 2, 2002.

[20] C. Pérez, T. Priol, and A. Ribes. A parallel CORBA com- ponent model for numerical code coupling. In C. A. Lee, editor, Proc. of the 3nd International Workshop on Grid Computing, LNCS, Baltimore, Maryland, USA, Nov. 2002.

Springer-Verlag. to appear.

[21] L. Prylli and B. Tourancheau. Bip: a new protocol designed for high performance networking on myrinet. In 1st Work- shop on Personal Computer based Networks Of Worksta- tions (PC-NOW ’98), Lect. Notes in Comp. Science, pages 472–485. Springer-Verlag, apr 1998. In conjunction with IPPS/SPDP 1998.

[22] T. Rühl, H. Bal, R. Bhoedjang, K. Langendoen, and G. Ben- son. Experience with a portability layer for implementing parallel programming systems. In Proceedings of the Inter- national Conference on Parallel and Distributed Processing Techniques and Applications, pages 1477–1488, Sunnyvale, CA USA, AUG 1996.

[23] D. C. Schmidt. An architectural overview of the ACE frame- work: A case-study of successful cross-platform systems software reuse. USENIX login magazine, Tools special is- sue, Nov. 1998.