DISTRIBUTED-MEMORY OR GRAPH MODELS - Models of Parallel Processing

Models of Parallel Processing

4.5. DISTRIBUTED-MEMORY OR GRAPH MODELS

Given the internal processor and memory structures in each node, a distributed-memory architecture is characterized primarily by the network used to interconnect the nodes (Fig. 4.5).

This network is usually represented as a graph, with vertices corresponding to processor–memory nodes and edges corresponding to communication links. If communication links are unidirec-tional, then directed edges are used. Undirected edges imply bidirectional communication, although not necessarily in both directions at once. Important parameters of an interconnec-tion network include

1. Network diameter: the longest of the shortest paths between various pairs of nodes, which should be relatively small if network latency is to be minimized. The network diameter is more important with store-and-forward routing (when a message is stored in its entirety and retransmitted by intermediate nodes) than with wormhole routing (when a message is quickly relayed through a node in small pieces).

2. Bisection (band)width: the smallest number (total capacity) of links that need to be cut in order to divide the network into two subnetworks of half the size. This is important when nodes communicate with each other in a random fashion. A small bisection (band)width limits the rate of data transfer between the two halves of the network, thus affecting the performance of communication-intensive algorithms.

3. Vertex or node degree: the number of communication ports required of each node, which should be a constant independent of network size if the architecture is to be readily scalable to larger sizes. The node degree has a direct effect on the cost of each node, with the effect being more significant for parallel ports containing several wires or when the node is required to communicate over all of its ports at once.

Table 4.2 lists these three parameters for some of the commonly used interconnection networks. Do not worry if you know little about the networks listed in Table 4.2. They are there to give you an idea of the variability of these parameters across different networks (examples for some of these networks appear in Fig. 4.8).

The list in Table 4.2 is by no means exhaustive. In fact, the multitude of interconnection networks, and claims with regard to their advantages over competing ones, have become quite confusing. The situation can be likened to a sea (Fig. 4.8). Once in a while (almost monthly over the past few years), a new network is dropped into the sea. Most of these make small waves and sink. Some produce bigger waves that tend to make people seasick! Hence, there have been suggestions that we should stop introducing new networks and instead focus on analyzing and better understanding the existing ones. A few have remained afloat and have been studied/analyzed to death (e.g., the hypercube).

Even though the distributed-memory architecture was introduced as a subclass of the MIMD class, machines based on networks of the type shown in Fig. 4.8 can be SIMD- or MIMD-type. In the SIMD variant, all processors obey the same instruction in each machine cycle, executing the operation that is broadcast to them on local data. For example, all processors in a 2D SIMD mesh might be directed to send data to their right neighbors and receive data from the left. In fact, the distributed-memory algorithms that we will study in Chapters 9–14 are primarily of the SIMD variety, as such algorithms are conceptually much simpler to develop, describe, and analyze.

78 INTRODUCTION TO PARALLEL PROCESSING

Table 4.2. Topological Parameters of Selected Interconnection Networks Network name(s) No. of nodes

Network

diameter Bisection width Node degree Local links?

1D mesh (linear array)

The development of efficient parallel algorithms suffers from the proliferation of available interconnection networks, for algorithm design must be done virtually from scratch for each new architecture. It would be nice if we could abstract away the effects of the interconnection topology (just as we did with PRAM for global-memory machines) in order to free the algorithm designer from a lot of machine-specific details. Even though this is not

Figure 4.8. The sea of interconnection networks.

MODELS OF PARALLEL PROCESSING 79 completely possible, models that replace the topological information reflected in the inter-connection graph with a small number of parameters do exist and have been shown to capture the effect of interconnection topology fairly accurately.

As an example of such abstract models, we briefly review the LogP model [Cull96]. In LogP, the communication architecture of a parallel computer is captured in four parameters:

L Latency upper bound when a small message (of a few words) is sent from an arbitrary source node to an arbitrary destination node

o The overhead, defined as the length of time when a processor is dedicated to the transmission or reception of a message, thus not being able to do any other computation

g The gap, defined as the minimum time that must elapse between consecu-tive message transmissions or receptions by a single processor (1/g is the available per-processor communication bandwidth)

P Processor multiplicity (p in our notation)

If LogP is in fact an accurate model for capturing the effects of communication in parallel processors, then the details of interconnection network do not matter. All that is required, when each new network is developed or proposed, is to determine its four LogP parameters.

Software simulation can then be used to predict the performance of an actual machine that is based on the new architecture for a given application. On most early, and some currently used, parallel machines, the system software overhead (o ) for message initiation or reception is so large that it dwarfs the hop-to-hop and transmission latencies by comparison. For such machines, not only the topology, but also the parameters L and g of the LogP model may be irrelevant to accurate performance prediction.

Even simpler is the bulk-synchronous parallel (BSP) model which attempts to hide the communication latency altogether through a specific parallel programming style, thus making the network topology irrelevant [Vali90]. Synchronization of processors occurs once every L time steps, where L is a periodicity parameter. A parallel computation consists of a sequence of supersteps. In a given superstep, each processor performs a task consisting of local computation steps, message transmissions, and message receptions from other proces-sors. Data received in messages will not be used in the current superstep but rather beginning with the next superstep. After each period of L time units, a global check is made to see if the current superstep has been completed. If so, then the processors move on to executing the next superstep. Otherwise, the next period of L time units is allocated to the unfinished superstep.

A final observation: Whereas direct interconnection networks of the types shown in Table 4.2 or Fig. 4.8 have led to many important classes of parallel processors, bus-based architectures still dominate the small-scale-parallel machines. Because a single bus can quickly become a performance bottleneck as the number of processors increases, a variety of multiple-bus architectures and hierarchical schemes (Fig. 4.9) are available for reducing bus traffic by taking advantage of the locality of communication within small clusters of processors.

80 INTRODUCTION TO PARALLEL PROCESSING

Figure 4.9. Example of a hierarchical interconnection architecture.

Dans le document Introduction to Parallel Processing (Page 102-105)