Survey of Parallel Systems Donna Quammen - Introduction Jean-Luc Gaudiot

Introduction Jean-Luc Gaudiot

1.5 Survey of Parallel Systems Donna Quammen

1.5.1 Introduction

Computers have long been considered ‘‘a solution looking for a problem,’’ but because of limits found by complexity theoryand limits on computing power some problems that were presented could not be solved.

Multimedia problems, image processing and recognition, AI application, and weather prediction may not

be accomplished unless processing power is increased. There are many varieties of parallel machines, each has the same goal, to complete a task quickly and inexpensively. Modern physics has continually increased the speed and capacity of the media on which modern computer chips are housed, usually VLSI, and at the same time decreased the price. The challenge of the computer engineers is to use the media effectively.

Different components may be addressed to accomplish this, such as, but not limited to:

. Functionality of the processors—floating point, integer, or high level function, etc.

. Topology of the network which interconnects the processors

. Instruction scheduling

. Position and capability of any master control units that direct the processors

. Memory address space

. Input=output features

. Compilers and operating systems support to make the parallel system accessible

. Application’s suitability to a particular parallel system

. Algorithms to implement the applications

As can be imagined there is an assortment of chooses for each of these components. This provides for the possibility of a large variety of parallel systems. Plus more chooses and variations are continually being developed to utilize the increased capacity of the underlining media.

Mike Flynn, in 1972 [Flynn72], developed a classification for various parallel systems, which has remained authoritative. It is based on the number of instruction streams and the number of data streams active in one cycle. A sequential machine is considered to havesingle instruction stream executing on a single data stream;this is calledSISD. AnSIMDmachine has a single instruction stream executing on multiple data streamsin the same cycle.MIMDhasmultiple instruction streams executing on multiple data streamssimultaneously. All are shown in Fig. 1.17. An MISDis not shown but is considered to be a systolic array.

Four categories of MIMD systems, dataflow, multithreaded, out of order execution, and very long instruction words(VLIW), are of particular interest, and seem to be the tendency for the future. These categories can be applied to a single CPU, providing parallelism by havingmultiple functional units. All four attempt to use fine-grain parallelism to maximize the number of instructions that may be executing in the same cycle. They also use fine-grain parallelism to assist in utilizing cycles, which possibly could be lost due to largelatencyin the execution of an instruction. Latency increases when the execution of one instruction is temporarily staled while waiting for some resource currently not available, such as the results of a cache miss, or even a cache fetch, the results of a floating point instruction (which takes longer than a simpler instruction), or the availability of a needed functional unit. This could cause delays in the execution of other instructions. If there is very fine grain parallelism, other instructions can use available resources while the staled instruction is waiting. This is one area where much computing power has been reclaimed.

Two other compelling issues exist in parallel systems.Portability,once a program has been developed it should not need to be recoded to run efficiently on a parallel system, andscalability,the performance of a system should increase proportional to the size of the system. This is problematic since unexpected bottlenecks occur when more processors are added to many parallel systems.

1.5.2 Single Instruction Multiple Processors (SIMD)

Perhaps the simplest parallel system to describe is an SIMD machine. As the name implies all processors execute the same instruction at the same time. There is one master control unit, which issues instruc-tions, and typically each processors also has its own local memory unit. SIMD processors are fun to envision as a school of fish that travel closely together, always in the same direction. Once one turns, they all turn. In most systems, processors communicate only with a set of nearest neighbors;grids, hypercubes, ortorusare popular. In the most generic system, shown in Fig. 1.17b, no set communication pattern is

dictated. Because different algorithms do better on different physical topologies (algorithms for sorting do well on tree structures, but array arithmetic does well on grids), reconfigureable networks are ideal but hard to actually implement. A variety of structures can be built on top of grids if the programmer is resourceful, but some processing power would be lost. It is difficult for a compiler to decide which substructure would be optimal to use. A mixture of a close connections supplemented with longer connections seems to be most advantageous.MasPar[MasPar91] has SIMD capability and uses a grid structure supplemented with longer connections. The Thinking Machines CM-5 has SIMD capability using afat-treenetwork, also a mix of closer and longer connections. The PEC network also has this quality [Kirkman91, Quammen96].

Of course, SIMD machines have limits. Logical arrays may not meet the physical topology of the processors but require folding or skewing.Ifstatements may cause different paths to be followed on different processors, and since it is necessary to always have synchronization, some processing power will be lost. Masks are used to inhibit issued instructions on processors on which they should not be executed. A single control unit becomes a bottleneck as an SIMD system expands. If an SIMD system were very large, it would be desirable to be able to use it as a multiprogrammed machine where different programs would be allocated different ‘‘farms’’ of processors for their use as a dedicated array. A large SIMD system should be sub-dividable for disjoint multiuser applications. The operating system would have to handle this allocation.

On an SIMD computer a loop such as the one below can be translated to one SIMD instruction as is shown. The formA(1:N) means arrayAindexes 1 toN:

(a)

(b)

Control P M

I/O

P = Processor M = Memory

• • • Network Control

P M P M P M P M

•

••

Shared memory

Control P

I/O M

Control P

I/O M

••

•

(c)

FIGURE 1.17 (a) SISD uniprocessor architecture. (b) General SIMD with distributed memory. (c) Shared memory MIMD.

forI¼1 toNdo

A(I)¼B(I)þC(I); A(1:N)¼B(I:N)þC(1:N);

endfor;

The code below would take four steps:

F(0)¼0;

forI¼1 toNdo F(0)¼0;

A(I)¼B(I)=C(I); A(1:N)¼B(I:N)=C(1:N);

D(I)¼A(I)*E(I); D(1:N)¼A(1:N)*E(1:N);

F(I)¼D(I)þF(I1); F(1:N)¼D(1:N)þF(0:N1);

endfor;

Compilers can identify loops such as the ones presented above [Wolfe91]. However, many loops are not capable of executing in an SIMD fashion because of reverse dependencies. Languages have been developed for SIMD programming, which allows the programmer to specify data locations and communication patterns.

1.5.3 Multiple Instruction Multiple Data

Perhaps the easiest MIMD processor to envision is a shared memory multiprocessor such as shown in Fig. 1.17c. With this machine, all processors access the same memory bank with the addition of local caches. This allows the processors to communicate by placing data in the shared memory. However, sharing data causes problems.Dataandcache coherenceare of major concern. If one processor is altering data that another processor wishes to use, and the first processor is also holding the current updated value for this data in its cache, there is a need to guard access to the stale value which is being held in the shared memory. This creates a need for locks and protocols to protect communal data. Inefficient algorithms to handle cache coherence can cause delays, or invalidate results. In addition, if more than one processor wishes to access the same locked memory location, a fairness issue occurs as to which processor should be allowed first access after the location becomes unlocked [Hwang93]. Further delays in accessing the shared memory occur due to the use of a single bus. This arrangement is described as a uniform memory access (UMA) time approach and avoids worst-case communication scenarios pos-sible in other memory arrangements.

To reduce contention on the bus as a MIMD memory system scales, a distributed memory organiza-tion can be used. Here, clusters of processors share common memories, and the clusters are connected to allow communications between clusters. This is called an NUMA (nonuniform memory access) organ-ization [Gupta91]. If a MIMD machine is to bescalable,this approach must be used. Machines within the same cluster will be able to share data with less latency than machines housed on different memory banks. It will be possible to access all data. This creates questions as to which sets of data should be placed on which processor cluster. Compilers can help with this by locating a code that uses common data. If data is poorly placed, the worst-case execution time could be devastating.

Message passing systems, such as the Transputer [May88], have no shared memory but handle communications using message passing. This can cause high latency while waiting for requested data;

however, each processor can hold multiple threads, and may be able to occupy itself while waiting for remote data. Deadlocks are a problem.

Another variation of memory management is cache only memory access (COMA). Memory is distributed, but only held in cache [Saulsbury95]. The Kendall Square machine [KSR91] has this organization. On the KSM distributed memory is held in the cache of each processor, which is connected by a ring. The caches of remote processors are accessed using this ring.

1.5.4 Vector Machines

Avector machinecreates a series of functional units and pumps a stream of data through the series. Each stage of the pipe will store its resulting data in avector register,which will be read by the next stage.

In this way the parallelism is equal to the number of stages in the pipeline. This is very efficient if the same functions are to be preformed on a long stream of data. The Cray series computer [Cray92] is famous for this technique. It is becoming popular to make an individual processor of a MIMD system a vector processor.

1.5.5 Dataflow Machine

The von Neumann approach to computing has one control state in existence at one time. A program counter is used to point to the single next instruction. This approach is used in traditional machines, and is also used in most of the single processors of the multiple processor systems described earlier. A completely different approach was developed at the Massachusetts Institute of Technology [Dennis91, Arvind90, Polychronopoulos89]. They realized that the maximum amount of parallelism could be realized if at any one pointall instructions that are ready to executewere executed. An instruction is readyto execute if the data that is required for its complete execution is available. Therefore, execution of an instruction is not governed by the sequential order, but by its readiness to execute, that is, when both operands are available. A table is kept of the instructions that areabout readyto execute, that is, one of the two operands needed for the assembly language level instruction is available. When the second operand is found, this instruction is executed. The result of the execution is passed to a control unit, which will select a set of new instructions to beabout readyto execute, or mark an instruction asready (because the second operand needed has arrived).

This approach yields the maximum amount of parallelism. However, it runs into problems with

‘‘run away execution.’’ Too many instructions may be about ready, and clog the system. It is a fascinating approach, and machines have been developed. It has the advantage that no, or very little, changes need to be made toold dusty decks to extract parallelism. Steps can be made to avoid ‘‘run away execution.’’

1.5.6 Out of Order Execution Concept

An approach similar to thedataflow concept is called out of order execution[Cintra00]. Here again, program elements that are ready to execute may be executed. It has a big advantage when multiple functional units are available on the same CPU, but the functional units have different latency values.

The technique is not completely new but similar to issuing aload instruction, which has high latency, well before the result of theloadis required. By the time theloadis completed the code has reached the location where it is used. Also afloating point instruction,again a class of instructions with high latency, is frequently started before integer instructions coded to execute first are executed. By the time the floating pointis complete, its results are ready to be used. The compiler can make this decision statically.

Inout of order executionthe hardware has more of a role in the decision of what to execute. This may include both the then and the else parts of an ifstatement. They can both be executed, but not be committed to until the correct path is determined. This technique is also calledspeculative execution.

Any changes that have been made by a wrong path must be capable of being rolledback. Although this may seem to be extra computation, it will decrease execution time if done well. Other areas of the program may also be executed, if it is determined that their execution will not affect the final result or can be rolled back. The temporary results may be kept in registers. The Alpha Computer [Kessler99] as well as the Intel Pentium Pro [Intel97] use this technique. This method is becoming popular to fully utilize increasingly powerful computers.

Compiler techniques can be used to try to determine which streams should be chosen for advance execution. If the wrong choice is made, there is a risk of extremely poor performance due to continual rollbacks. Branch prediction, either statically by the compiler, or dynamically by means of architecture prediction flags, is a useful technique for increasing the number of instructions, which may be beneficial to execute prematurely.

Since assembler instructions contain precise register addresses, set by the compiler, and it is unknown which assembler instructions will be caught in partial execution at the same time, a method

calledregister renamingis used. Alogicalregister address is mapped to aphysicalregister chosen from a free list. The mapping is then used throughout the execution of the instruction, and released again to the free list.

1.5.7 Multithreading

Multithreading[Tullsen95] is another method to hide the latency endured by various instructions. More than one chain, or thread, of execution is active at any one point. The states of different chains are saved simultaneously [Lo97, Miller90] in their own state space. Modern programs are being written as a collection of modules, eitherthreads orobjects. Because one of the main advantages of this form of programming is data modulization, many of these modules could be ready to execute concurrently.

While the processor is waiting for one thread’s data (for example, a cache miss or even a cache access), other threads, which have a full state in their dedicated space, can be executed. The compiler cannot determine which modules will be active at the same time, that will have to be done dynamically. The method is somewhat similar to themultiprogrammingtechnique of changing context while waiting for I=O; however, it is at afiner grain. Multiple access lines to memory are beneficial since many threads may be waiting for I=O. TheTera machine[Smith90] is the prime example of this technique. This approach should help lead to Teraflops performance.

1.5.8 Very Long Instruction Word (VLIW)

A VLIW will issue an instruction to multiple functional units in parallel. Therefore, if the compiler can find one operation for each of the functional unit internal to a processor (these instructions are usually RISC-like), which will be able to execute at the same time (that is, the data for their execution are statically determined to be available in registers), and none of the instructions depend on an instruction being issued in the same cycle, then you can execute them in parallel. All the sub-instructions will be issued by one long instruction. The name VLSI comes from the need that the instruction be long enough to hold multiple operation codes, one for each functional unit, which will be used, and the register identifiers, which they need. Unlike the three methods described previously, the compiler is responsible for finding instructions that do not interfere with each other and assigning the registers for these instructions [Rau93, Park97]. The compiler packs these instructions statically into the VLIW. The Intel iWarp can control afloating point multiplier, floating point adder, integer adder, memory loader, increment unit,andcondition testeron each cycle [Cohn89]. The instruction is 96 bits long and can, with compiler support, execute nine instructions at once. The code for the following loop can be turned into a loop of just one VLSI instruction as opposed to a loop of at least nine RISC size instructions.

forI:¼ 0 toN1 do A(2*I):¼ CþB(I)*D;

It is difficult for a compiler to find instructions that can fill all the fields statically, so frequently some of the functional units go unoccupied. There are many techniques to find qualified instructions, and frequently the long instruction can be filled. One technique is to mine separate threads [Lam88, Bakewell91], another successful technique tries together several basic Oblocks into a hyperblock. The hyperblock has one entrance but may have several exits. This creates a long stream of code, which would normally be executed sequentially, and allows the compiler to choose instructions over this larger range.

Roll-back code must be added to the hyperblock’s exit to undo the affects of superfluous code that was executed, but would not execute sequentially. Loops are frequentlyunrolled,several iterations considered as straight code, to form hyperblocks.Branch predictioncan also help create beneficial hyperblocks.

A newer approach is to dynamically pack the VLIW, using a preprocessor that accesses multiple queues, one for each functional unit. Realize that using queues is similar toout of order execution.

1.5.9 Interconnection Network

An interconnection network is a necessary component of all parallel processing systems. Several features govern the choice of a network. A scalable interconnection network for parallel processor would be ideal if it meets the following requirements for a large range of system sizes. For instance, it may be scalable by reasonably small increments from 2⁴to perhaps 2²⁰processors.

. Have a low average and maximum diameter (the distance between the furthest two nodes) to avoid communication latency.

. Minimize routing constraints (have many routes fromAtoB).

. Have a constant number of I=O port (channels) per node to allow for expansion without retrofit.

. Have a simple wire layout to allow for expansion and to avoid wasting VLSI space.

. Be inherently fault tolerant.

. Be sub-dividable for disjoint multiuser applications.

. Be able to handle a large range of algorithms without undo overhead.

The most popular parallel networks—hypercube, quad tree, fat-tree, binary tree, mesh, and torus—

fail in one or more of these items.

Meshes have a major disadvantage: they lack support for long distance connections. Hypercubes have excellent connectivity by guaranteeing a maximum distance between any two nodes of logNwhereNis the number of nodes. Also, many paths exist between any two nodes making it fault tolerant and amenable to low contention. However, the number of I=O ports per node is logN. As a system scales, each node would need to be retrofit to add additional ports. In addition, the wire layout is complex,

Dans le document How to go to your page (Page 73-81)