I Marco Fillo Richard B. Gillett - Digital Technical Journal

In computing, a cluster is loosely defi ned as a parallel system comprising a collection of stand-alone comput

ers (each called a node) connected by a network. Each node runs its own copy of the operating system, and cluster software coorctinating the entire parallel system attempts to provide users with a unitled system view.

Since each node in the cluster is an off-the-shelf computer system , clusters offer several advantages over traditional massively parallel processors ( MPPs) and large-scale symmetric multiprocessors ( SMPs ) . Specifically, clusters provide1

• Much better price/performance ratios, opening a wide range of computing possibilities for users who could not otherwise afford a single large system .

• M uch better availability. With appropriate software support, cl usters can survive node fai lures, whereas SMP and MPP systems generally do not.

• Impressive scaling (hundreds of processors) , when the inctividual nodes are medium-scale SMP systems.

• Easy and economical upgrading and technology migration . Users can simply attach the latest

generation node to the existing cluster network.

Despite their advantages and their i mpressive peak computational power, clusters have been u nable to displace traditional parallel systems in the marketplace because their effective performance on many real

world parallel applications has often been disappoint

ing. Clusters' lack of computational efticiency can be attri buted to their traditionally poor communication, which is a result of the usc of standard networking technology as a cluster interconnect. The develop

ment of the MEMORY CHANNEL network as a cluster interconnect was motivated by the realization that the gap in effective performance between clusters and SMPs can be bridged by designing a communication network to deliver low latency and high bandwidth all the way to the user applications.

Over the years, many researchers have recogn.ized that the performance of the m ajority of real-world par

allel applications is affected by the latency and band

width available for communication.2-5 In particular, it has been showd·6.7 that the efficiency of paral lel scientific applications is strongly influenced by the

Digital Technical Journal Vol . 9 No. I 1 997 27

system's architectural balance as qu antified by irs commu nication-to-computation ratio, which is some

times called the q-rario 2 The q - rario is dctined �1s

the ratio between the time it takes to send an 8 -Lwte floating- point result from one process to another (commu nication ) and the time it takes to perform a floating-point operation (computJtion ) . I n a system wirh a q-ratio equal to

1 ,

ir takes rhc same rime for �1 node to compute a result as i t docs tor the node to com mun icate rhe result to another node in the system.

Thus, rhe higher rhe q - rario, rhe more d i fficu l t i t is ro program a paral lel system ro achieve a given level of performance. Q-rarios close to unity have been obtained only in experimental machines, such as iWarp' and the M-Machine," by employing direct register-based commu nication .

Table l shows actua l q-ratios ti:>r several commercial systems. 111·1 ¹These q -ratios vary fi-om about

100

_ti:>r_a

DJGJTAL AJphaServcr

4 1 00 SM P

system using shared memory to

3 0 , 000

f(>r a cluster of these

S M P

_svstems

intercon nected over a ti ber distri but..:d data inrcr bcc (F D D I ) network using tbe transmission control protocol/internet protocol

(TCP/I P).

An MI> J>

system, such as the I R M SP2, using the Message Passing Interface

( M l' l )

has a q-ratio of

5 , 7 1 4 .

_The M EMORY CHAN N E L network developed bv Digit�ll Equipment Corporation reduces the q-ratio of an AlphaServer-based cluster by a factor of

38

_to₈₂_to_be

within the range of 367 ro

1

0

₆

7

_. Q-ratios in this range permit clusters to efficiently tackle a large ci:lSS of paral lel technical and commer..:ial problems.

The benefits of low-l atency, high-bandwidth networks arc well understood. '2·'; As shown by mJny studies,14·" high communication latency over tradi

tional ncrworks is the result of the operating svstcm overhead i nvoh'Cd in transmitting and recci\'ing mes

sages. The

M E M ORY CHAN N E L

network eliminJtcs this latency by supporting direct process-to-process com munication that bypasses the operating syste m .

Ta ble 1

The M EM O RY CHANNEL net:\vork supports this type of communication by i m p l ementing ^anatural exten

sion of the virtual mcmor�' spJcc, which provid..:s direct, but protected , auxss ro rhe memor�· resid ing in other nodes.

Ihs..:d on this approach, D I G ITA L d eveloped its first-generation M EM O RY CHA N N EL net:\vork ( M EMORY CH.AJ'\J N EL ^l),1'' which has been shipping in production since April 1 996. The net:\\'ork docs not require an�· fl.mction::dit:\' beyond th..: pnipheral com ponent interconnect

( PCI )

bus and theretore can be used on any system with a PCI l/0 slot. D IGITAL currently supports produ ction M E MORY CHANNE L cl usters as large as 8 nodes by 1 2 pro..:cssors per node (a total of 96 processors) . One of these clusters was presented at Supercomputing '95 and rJn cl usterwide applicuions using High Pertormancc Fortran

(

H P f ) ,·1 Parallel Virtual Mac h i ne

( PVM ) , 17

_and

M PI "

_in

DiGITAL's

Parallel Soltware En\'iron mcnt

( PS E ) .

_This

96-proccssor svstem ll3s a q - rario of

SOO

to l

,000,

depend i ng on the commu nication inrcrtace.

A

4-node M E M O RY C HA N N E L c luster running DIGITA L Tru Ci uster sofnvare1" and rhe Oracle Parallel Serv..:r has held the cluster performance world record on the Tl'C-C benchmark2''-the indusrrv st:lllcbrd in on-li ne transaction processing-since April 1 996.

We next prese nt :1 11 overview of the generic 1\tl EMORY C HAN NEL nct\vork to j ustit)' the design goals of the second-generation MElvl O RY CHAl'\J N E L nwmrk ( M E M O RY CHANNEL 2 ) Following this o,·en·iew, \\'e describe in derail the architecture of the t\\'O components that make up the M EM O RY CHAN NEL 2 network: the hub and the adapter. Last, we present hardware-measured pcrti:>rmance data.

ME MORY CHANNEL Overview

The M EM O RY C H A N N E L network is a dedicated cl uster i nterconnection net\vork, based on Encore's

Comparison of Communicati o n and Computation Perfo rma nce (q-ratio) for Va rious Pa ra l lel Syste ms Comm u n ication

Performance Latency

System (Microseconds)

Alpha Server 4 1 00 Model 300 config urations

SMP using shared memory messaging 0 . 6

SMP using MPI 3.4

FDDI cl uster using TCP/I P 1 80.0

M EMORY CHAN N E L cluster using

native messaging 2 . 2

M E M O RY CHAN N E L cluster u s i n g M P I 6.4

IBM SP2 using MPI 40.0

Digiral T.:c hn icJI Journal Vol . 9 No. I ! <)97

Co mputation

Communication-Performance Based on to-com putation UNPACK 1 00 X 1 00 Ratio

(Microseco nds/FLOP) (q-ratio)

0.006 1 00

0.006 567

0.006 30, 000

0.006 367

0.006 1 , 067

0_006 5,7 1 4

M E M O RY C H A N N E L technology, that supports virtual shared memory space by means of internodal memory address space mapping, similar to that used in the S H RI M P system 2 ' The 1V! E M O RY CHAN N E L substrate is ^a Hat, fully interconnected network that provides push-on ly message-based comm unica

tion. ^{"' 2'}Unlike traditional networks, the ^{M E MORY} C H A N N E L network provides low- latency communi

cation by supporti ng direct user access to the network.

As in Scalable Coherent Interrace ( SCI

)

23 and Myrinet24 networks, connections between nodes are established by mapping part of the nodes' virtual address space to the M EMORY CHAN N E L interf1ce.

A M E M O RY CHAN N E L connection can be opened as either an outgoing connection (in which case an address-to-destination node mapping must be pro

vided ) or an i ncoming connection. Before a pair of nodes can communicate bv means of the M EM O RY C H A N N E L network, they must consent to share part of their address space-one side as outgoing and the other as incoming. The MEM ORY CHANNEL net

work has no storage of i ts own . The granularity of the mapping is the same as the operating system page size.

MEMORY CHANNEL Address Space Mapping

M apping is accompl ished through manipulation of page tables. Each node that maps a page as incoming allocates a si ngle page of physical memory and makes it available to be shared by the duster. The page is always resident and is shared by all processes in the node that map the page. Tbe first map of the page causes the memory allocation, and subsequent

GLOBAL

reads/maps point to tl1e same page . No memory is allocated for pages mapped as outgoing. The mapper simply assigns the page table entry to a portion of the MEMORY CHANNEL hardware transmit window and defines the destination node for that transmit sub

space. Thus, the amount of p hysical memory con

sumed for the clusterwide network is the product of the operating system page size and the total number of pages mapped as i ncoming on each node.

After mapping, MEMORY CHANNEL accesses are accomplished by simple load a nd store instructions, as for any other portion of virtual memory, without any operating system or run-time l i brary calls. A store i nstruction to a M EM O RY CHANN E L outgoing address results in data being transterred across the M EM O RY CHANN E L network to the memory allo

cated on the destination node. A load i nstruction from a M EM O RY CHAN N EL incoming channel address space results in a read from the local physical memory i nitialized as a M EM O RY CHAN N E L i ncoming chan

nel . The overhead (in CPU cycles) i n establishing a MEMORY CHANNEL connection is m uch higher than that of using the connection. Because of the memory

mapped nature of the interface, the transmit or receive overhead is similar to an access to local main memory.

This mechanism is the fundamental reason tor the low MEMORY CHANNEL latency. Figure

I

illustrates a n example o f MEMORY C HANNEL add ress mapping.

The figure shows two sets of i ndependent connec

tions. Node

I

h as established an outgoing channel to node

3

and node 4 and also an i ncoming channel to i tself. Node 4 has an outgoing channel to node 2 .

MEMORY CHANNEL A DDRESS SPACE NODE 1

NODE 2

Figure 1

NODE 1 ^TO NODES 3 A N D ₄

NODE ₄TO NODE 2

JV!EMORY C HANNEL Mappiⁿg of a Portion of the Clusterwide Add ress Space

Digital Technical Journal

NODE 3

NODE ₄

Vol . 9 No. I 1 997 29

All

connections are unidirectional, either outgoing or i ncoming. To map a channel as both outgoing and i ncoming to the same shared address space, node l maps the channel two ti mes into a single process' vir

tual address space. The mapping example in Figure l req uires a total of four pages of physical memory, one for each of the four arrows poi nted toward the nodes' virtual address spaces.

MEMORY CHANNEL mappings reside in two page control tables ( PCTs) located on the MEMORY CHANNEL interface, one on the sender side and one on the receiver side. As shown in Figure 2, each page entry in the PCT h as a set of attri butes that speci�' the MEMORY CHANNEL behavior tor that page .

The page attributes on the sender side are

• Transmit enabled, which must be set to allow trans

mission from store instructions to a specific page

• Local copy on transmit, which directs an ordered copy of the transmitted packet to the local memory

• Acknowledge request, which is used to request

The page attributes on the receiver side are

• Receive enabled , which must be set to allow recep

tion of messages addressed to a specific virtual page

• Interrupt on receive, which generates an interrupt on reception of a packet

• Receive enabled under error, which is asserted for error recovery communication pages

• Remote read , which identifies all packets that arrive at a page as requests for a remote read operation

• Conditional write, which identifies all packets that arrive at a page as conditional write packets

S E N D E R

The MEMORY C HANNEL communication paradigm is based on three fundamental ordering rules:

1 . Single-sender Ru le:

All

destination nodes will

receive packets in the order in which they were gen

erated by the sender.

2 . Multisender Rul e : Packets fi:om mu ltiple sender

nodes wi U be received in the same order at all desti nation nodes.

3. Orderi ng-under-errors Rul e : Rules 1 and 2 must apply even when an error occurs in the network.

Let Pj," ·x be the jth point-to-point packet from

there is a ti.nite set of val id reception orders at destina

tion nodes X a nd Y, depending on the actual arrival messages destined to both receivers must be received in the same order. For example, if X receives B2", B l "',

MEMORY C HAN NEL Page Control Attributes

Digital Technical )oumal Vol . 9 No. I 1997

CONDITIONAL WRITE

same order. One arrival order congruent with both of

These rules are independent of a parrjcular intercon

nection topology or implementation and must be obeyed in all generations of the M EMORY CHANNEL network.

On the lvlEMORY CHANNEL network, error han

d l ing is a shared responsibility of the hardware and the cl uster management software. The hardware provides real -time precise error hand ling and strict packet ordering by discarding all packets in a particular path that fo l low an erroneous one. The software is respon

sible f(>r recovering the network fi·om the faulty state back to i ts normal state and for retransmitting the lost packets.

Additional MEMORY CHANNEL Network Features Three additional features of the MEMORY CHANNEL network make it ideal f(>r cl uster interconnection:

l . A hardware- based barrier acknowledge that sweeps

tbe nenvork and all its buffers

A

fast, hardware-supported lock primitive 3. Node failure detection and isolation

Because of the three ordering rules, the MEMORY CHANNEL nenvork acknowledge packets are imple

mented with little variation over ordinary packets. To request acknowledgment of packet reception, a node sends an ord inary packet marked with the request

acknowledge attri bute.

The

packet is used to sweep clean the network queues in the sender destination path and to ensure that all previously transmitted pack

ets have reached the destination . ln response to the reception of a M EMORY CHANNEL acknowledge request, the destination node transmits a M EMORY CHANN EL acknowledgment back to the originator.

The arrival of the acknowledgment at

the

originating node signals that all preceding packets on that path have been successfu l ly received .

MEMORY CHANNEL

locks

are

implemented

using a lock-acquire software data structure mapped as both incoming and outgoing by all nodes in the cluster.

is the same for all nodes. The node can t hen determine if it was the only bidder for the l ock, in which case the node has won the lock. If the node sees multiple bidders for the same lock, it resorts to an operating system-specific back-offand -retry algorithm. Thanks to the M EMORY CHANNEL !:,'llaranteed packet order

ing, even under error the above mechanism ensures that at most one node in the cluster perceives that it was the first to write the lock data structure. To guarantee that data structures are never locked indefi

nitely by a node that is removed from a c luster, the cluster manager software also monitors lock acquisi tion and release.

The M EMORY CHANNEL network supports a strong-consistency shared -memory model due to its strict packet ordering. In addition, the

1/0

operations used to access the M EMO RY CHANN EL arc ful l y integrated within the node's cache coherency scheme.

Besides greatly simpl it)1ing the programming model, such consistency allows tor an implementation of spinlocks that does not saturate the memory system . For instance, whi l e a receiver is polling tor a tlag that signals the arrival of data ti·om the MEMO RY CHANNEL net\vork, the node processor accesses only the locally cached copy of the flag, which will be upd ated whenever the corresponding main memory location is written by a M EMORY CHANNEL packet.

Unlike other networks, the MEMORY CHANNEL hardware maintains information on which nodes are currently part of the cluster. Through a collection of timeouts, the MEMORY CHANN EL hardware con tinuously monitors all nodes in the cl uster tor illegal behavior. When a tail ure is detected , the node is iso

lated from the c luster and recovery software is invoked .

A

MEMORY CHANNEL cluster is equ ipped with software capable of reconfiguration when a node is added or removed ti-om the cl uster. The node is simply brought on-line or off-line, the event is broad

cast to al l other nodes, and operation continues. To MEMO RY CHANNEL nenvork, the sofuvare switches over to the stand by nenvork, in a man ner transparent to the appl ication .'"

The First-generation MEMORY CHANNEL Network

The first generation of the J\II EMORY CHANNEL ncnvork consists of a node interrace card and a con

centrator or hub. The interface card, ca lled an adapter, plugs into the

1/0

PCl. To send a packer, the CPU

Digi r<1l TcdlJJi(<ll journal Vol . 9 No. I 1 997 3 1

Dans le document Digital Technical Journal (Page 29-34)