GLOBAL VERSUS DISTRIBUTED MEMORY - Models of Parallel Processing

Models of Parallel Processing

4.3. GLOBAL VERSUS DISTRIBUTED MEMORY

Within the MIMD class of parallel processors, memory can be global or distributed.

72 INTRODUCTION TO PARALLEL PROCESSING

Figure 4.3. A parallel processor with global memory.

Global memory may be visualized as being in a central location where all processors can access it with equal ease (or with equal difficulty, if you are a half-empty-glass type of person). Figure 4.3 shows a possible hardware organization for a global-memory parallel processor. Processors can access memory through a special processor-to-memory network.

As access to memory is quite frequent, the interconnection network must have very low latency (quite a difficult design challenge for more than a few processors) or else memory-latency-hiding techniques must be employed. An example of such methods is the use of multithreading in the processors so that they continue with useful processing functions while they wait for pending memory access requests to be serviced. In either case, very high network bandwidth is a must. An optional processor-to-processor network may be used for coordination and synchronization purposes.

A global-memory multiprocessor is characterized by the type and number p of proces-sors, the capacity and number m of memory modules, and the network architecture. Even though p and m are independent parameters, achieving high performance typically requires that they be comparable in magnitude (e.g., too few memory modules will cause contention among the processors and too many would complicate the network design).

Examples for both the processor-to-memory and processor-to-processor networks in-clude

1. Crossbar switch; O(pm) complexity, and thus quite costly for highly parallel systems 2. Single or multiple buses (the latter with complete or partial connectivity)

3. Multistage interconnection network (MIN); cheaper than Example 1, more band-width than Example 2

The type of interconnection network used affects the way in which efficient algorithms are developed. In order to free the programmers from such tedious considerations, an abstract model of global-memory computers, known as PRAM, has been defined (see Section 4.4).

One approach to reducing the amount of data that must pass through the processor-to-memory interconnection network is to use a private cache processor-to-memory of reasonable size within each processor (Fig. 4.4). The reason that using cache memories reduces the traffic through

MODELS OF PARALLEL PROCESSING 73

Figure 4.4. A parallel processor with global memory and processor caches.

the network is the same here as for conventional processors: locality of data access, repeated access to the same data, and the greater efficiency of block, as opposed to word-at-a-time, data transfers. However, the use of multiple caches gives rise to the cache coherence problem:

Multiple copies of data in the main memory and in various caches may become inconsistent.

With a single cache, the write-through policy can keep the two data copies consistent. Here, we need a more sophisticated approach, examples of which include

1. Do not cache shared data at all or allow only a single cache copy. If the volume of shared data is small and access to it infrequent, these policies work quite well.

2. Do not cache “writeable” shared data or allow only a single cache copy. Read-only shared data can be placed in multiple caches with no complication.

3. Use a cache coherence protocol. This approach may introduce a nontrivial consis-tency enforcement overhead, depending on the coherence protocol used, but re-moves the above restrictions. Examples include snoopy caches for bus-based systems (each cache monitors all data transfers on the bus to see if the validity of the data it is holding will be affected) and directory-based schemes (where writeable shared data are “owned” by a single processor or cache at any given time, with a directory used to determine physical data locations). See Sections 18.1 and 18.2 for more detail.

Distributed-memory architectures can be conceptually viewed as in Fig. 4.5. A collec-tion of p processors, each with its own private memory, communicates through an intercon-nection network. Here, the latency of the interconintercon-nection network may be less critical, as each processor is likely to access its own local memory most of the time. However, the communication bandwidth of the network may or may not be critical, depending on the type of parallel applications and the extent of task interdependencies. Note that each processor is usually connected to the network through multiple links or channels (this is the norm here, although it can also be the case for shared-memory parallel processors).

In addition to the types of interconnection networks enumerated for shared-memory parallel processors, distributed-memory MIMD architectures can also be interconnected by

74 INTRODUCTION TO PARALLEL PROCESSING

Figure 4.5. A parallel processor with distributed memory.

a variety of direct networks, so called because the processor channels are directly connected to their counterparts in other processors according to some interconnection pattern or topology. Examples of direct networks will be introduced in Section 4.5.

Because access to data stored in remote memory modules (those associated with other processors) involves considerably more latency than access to the processor’s local memory, distributed-memory MIMD machines are sometimes described as nonuniform memory access (NUMA) architectures. Contrast this with the uniform memory access (UMA) property of global-memory machines. In a UMA architecture, distribution of data in memory is relevant only to the extent that it affects the ability to access the required data in parallel, whereas in NUMA architectures, inattention to data and task partitioning among the processors may have dire consequences. When coarse-grained tasks are allocated to the various processors, load-balancing (in the initial assignment or dynamically as the compu-tations unfold) is also of some importance.

It is possible to view Fig. 4.5 as a special case of Fig. 4.4 in which the global-memory modules have been removed altogether; the fact that processors and (cache) memories appear in different orders is immaterial. This has led to the name all-cache or cache-only memory architecture (COMA) for such machines.

Dans le document Introduction to Parallel Processing (Page 96-99)