ALGORITHMS FOR A BINARY TREE - A Taste of Parallel Algorithms

A Taste of Parallel Algorithms

2.4. ALGORITHMS FOR A BINARY TREE

In algorithms for a binary tree of processors, we will assume that the data elements are initially held by the leaf processors only. The nonleaf (inner) processors participate in the computation, but do not hold data elements of their own. This simplifying assumption, which can be easily relaxed, leads to simpler algorithms. As roughly half of the tree nodes are leaf nodes, the inefficiency resulting from this assumption is not very great.

Semigroup Computation. A binary-tree architecture is ideally suited for this compu-tation (for this reason, semigroup compucompu-tation is sometimes referred to as tree compucompu-tation).

Each inner node receives two values from its children (if each of them has already computed a value or is a leaf node), applies the operator to them, and passes the result upward to its parent. After log₂p steps, the root processor will have the computation result. All processors can then be notified of the result through a broadcasting operation from the root. Total time:

2log₂p steps.

A TASTE OF PARALLEL ALGORITHMS 35 Parallel Prefix Computation. Again, this is quite simple and can be done optimally in 2log₂p steps (recall that the diameter of a binary tree is 2log2p or 2log₂p – 1). The algorithm consists of an upward propagation phase followed by downward data movement.

As shown in Fig. 2.11, the upward propagation phase is identical to the upward movement of data in semigroup computation. At the end of this phase, each node will have the semigroup computation result for its subtree. The downward phase is as follows. Each processor remembers the value it received from its left child. On receiving a value from the parent, a node passes the value received from above to its left child and the combination of this value and the one that came from the left child to its right child. The root is viewed as receiving the identity element from above and thus initiates the downward phase by sending the identity element to the left and the value received from its left child to the right. At the end of the downward phase, the leaf processors compute their respective results.

It is instructive to look at some applications of the parallel prefix computation at this point. Given a list of 0s and 1s, the rank of each 1 in the list (its relative position among the 1s) can be determined by a prefix sum computation:

Figure 2.11. Parallel prefix computation on a binary tree of processors.

36 INTRODUCTION TO PARALLEL PROCESSING

Data: 0 0 1 0 1 0 0 1 1 1 0

Prefix sums: 0 0 1 1 2 2 2 3 4 5 5

Ranks of 1s: 1 2 3 4 5

A priority circuit has a list of 0s and 1s as its inputs and picks the first (highest-priority) 1 in the list. The function of a priority circuit can be defined as

Data: 0 0 1 0 1 0 0 1 1 1 0

Diminished prefix logical ORs: 0 0 0 1 1 1 1 1 1 1 1

Complement: 1 1 1 0 0 0 0 0 0 0 0

AND with data: 0 0 1 0 0 0 0 0 0 0 0

As a final example, the carry computation in the design of adders can be formulated as a parallel prefix computation in the following way. Let “g,” “p”, and “a” denote the event that a particular digit position in the adder generates, propagates, or annihilates a carry. For a decimal adder, e.g., these correspond to the digit sums being greater than 9, equal to 9, and less than 9, respectively. Therefore, the input data for the carry circuit consists of a vector of three-valued elements such as

Final carries into the various positions can be determined by a parallel prefix computation using the carry operator “ ¢ ” defined as follows (view x ∈ {g, p, a} as the incoming carry into a position):

p ¢ x = x x propagates over p

a ¢ x = a x is annihilated or absorbed by a

g ¢ x = g x is immaterial because a carry is generated

In fact, if each node in the two trees of Fig. 2.11 is replaced by a logic circuit corresponding to the carry operator, a five-digit carry-lookahead circuit would result.

Packet Routing. The algorithm for routing a packet of information from Processor i to Processor j on a binary tree of processors depends on the processor numbering scheme used. The processor numbering scheme shown in Fig. 2.3 is not the best one for this purpose but it will be used here to develop a routing algorithm. The indexing scheme of Fig. 2.3 is known as “preorder” indexing and has the following recursive definition: Nodes in a subtree are numbered by first numbering the root node, then its left subtree, and finally the right subtree. So the index of each node is less than the indices of all of its descendants. We assume that each node, in addition to being aware of its own index (self) in the tree, which is the smallest in its subtree, knows the largest node index in its left (maxl) and right (maxr ) subtrees. A packet on its way from node i to node dest, and currently residing in node self, is routed according to the following algorithm.

A TASTE OF PARALLEL ALGORITHMS 37 if dest = self

then remove the packet {done}

else if dest < self or dest > maxr then route upward

This algorithm does not make any assumption about the tree except that it is a binary tree.

In particular, the tree need not be complete or even balanced.

Broadcasting. Processor i sends the desired data upwards to the root processor, which then broadcasts the data downwards to all processors.

Sorting. We can use an algorithm similar to bubblesort that allows the smaller elements in the leaves to “bubble up” to the root processor first, thus allowing the root to

“see” all of the data elements in nondescending order. The root then sends the elements to leaf nodes in the proper order. Before describing the part of the algorithm dealing with the upward bubbling of data, let us deal with the simpler downward movement. This downward movement is easily coordinated if each node knows the number of leaf nodes in its left subtree. If the rank order of the element received from above (kept in a local counter) does not exceed the number of leaf nodes to the left, then the data item is sent to the left. Otherwise, it is sent to the right. Note that the above discussion implicitly assumes that data are to be sorted from left to right in the leaves.

The upward movement of data in the above sorting algorithm can be accomplished as follows, where the processor action is described from its own viewpoint. Initially, each leaf has a single data item and all other nodes are empty. Each inner node has storage space for two values, migrating upward from its left and right subtrees.

if you have 2 items then do nothing

else if you have 1 item that came from the left (right) then get the smaller item from the right (left) child else get the smaller item from each child

endif endif

Figure 2.12 shows the first few steps of the upward data movement (up to the point when the smallest element is in the root node, ready to begin its downward movement). The above sorting algorithm takes linear time in the number of elements to be sorted. We might be interested to know if a more efficient sorting algorithm can be developed, given that the diameter of the tree architecture is logarithmic (i.e., in the worst case, a data item has to move 2log2p steps to get to its position in sorted order). The answer, unfortunately, is that we cannot do fundamentally better than the above.

38 INTRODUCTION TO PARALLEL PROCESSING

Figure 2.12. The first few steps of the sorting algorithm on a binary tree.

The reasoning is based on a lower bound argument that is quite useful in many contexts.

All we need to do to partition a tree architecture into two equal or almost equal halves (composed of p/2 and p/2 processors) is to cut a single link next to the root processor (Fig. 2.13). We say that the bisection width of the binary tree architecture is 1. Now, in the worst case, the initial data arrangement may be such that all values in the left (right) half of the tree must move to the right (left) half to assume their sorted positions. Hence, all data elements must pass through the single link. No matter how we organize the data movements,

Figure 2.13. The bisection width of a binary tree architecture.

A TASTE OF PARALLEL ALGORITHMS 39 it takes linear time for all of the data elements to pass through this bottleneck. This is an example of a bisection-based lower bound.

Dans le document Introduction to Parallel Processing (Page 59-64)