A Parall el Algorithm of Constructing A Voronoi Diagram on Hypercube
Connected Computer N etworks
By WenmaoChai
AthesissubmittedtotheSchool of GraduateStudie5 inpartia]fulfiUmenl oftherequirements {orthe degreeof
Mu ter of Science
Departmentof ComputerScience Ml"moriRlUniversityofNewfoundland St.John's Newfoundland CaDada
Contents
Acknowledgment
Ab stract
1 Int r odu ctio n
1.1 Parallel Computation Models 1.1.1 ParallelRandom AccessMachines.
viii
1.1.2 ProcessorNetworks ...
1.2 Literat ureReviewcl Parallcl Algorith msfor Construct ingVoronoi Diagr a m
2 Hypercube ConnectedComputerNetwo rkandItsPundn-
ment alOpe rations 18
2.1 SIMD Hyp ercube Connec tedComputerNetwork. 18 2.2 Fund a.mental OperationsonHyp e rcube ConnectedComp uter
Network, 22
2.2.1 Maximu m. 2'1
2.2.2 Ranking 25
2.2.3 Concentration . 29
2.2.4 Distr ibuti on. 32
2.2.5 Generali zation. 33
2.2.6 Merging and Unmerging 38
2.2.7 Sorting. 41
2.2.8 Random AccessRead. . 45
2.2.9 RandomAccesswrite 50
2.2.10 Precede 55
2.2.11 Summary
"
:\ APa rall elAlgorithmforCons truct ingVoronoi Diagr amson nHyperc ube Conn ect edCom p u t er Ne t wo rk
3.1 Definitionand Feature s of Vorcnc i Diagram 3.2 ConvexHulland InversionTransform.
3.2.1 Convex Hull.
60 61 63 63
3.2.2 InversionTransform 64
3.3 AParallelalgorithm to cons tructa Voronoi Diagra m 64 3.3.1 Parall elalgorith m forconstructing the 3-d convexhull 73 3.3.2 Aparallelalgorithmto test externnl faces andinte rnal
faces
3.3.3 A parall el algorit hm to construct asphericalVoronoi diagram
77
82
3.3.4 Aparallel algorithmtolocatepointson a spherical
Voronoi diagram 87
3.3.5 A parallelalgorithmto addnewfacesto the convex
polyhedron 92
3.4 Summary 4 Conclusio nand Discussion
4.1 Parallel Algorithm toConstructa VoronoiDiagram in1.1(I"....) on a hypercube connectedcomp uternetwork..
99
100 4.2 OptimalParallelAlgorithms to Constru ctVoronoiDiugruma.lOa
iii
List of Figur e s
1.1 Par allelRand omAccess Machine 1.2 Mesh wit h"=16 processo rs.
1.3 Cube-co n necte dcyclesnetwork withd=3 andn=24 10 1.4 A4-st ar
1.5 A-l-p an cake.•. .
2.1 Block dia gr amofanSIM Dcom p ute r 2.2 16 processo r hy per cube.
11 12
20 21
3.1 Voronoidiagram 62
3.2 Inversion.. 66
3.3 Relat ion between3-dconvex hullandz-dVoranoidiagram 70 3.4 The algorithmofconstr uct ingaVoranoiDiagram 72 3.5 The algorit hmof a threedimensionconvexhull merge. . 76
3.6 The two dimensional analogy 7'
3.7 Thealgori thm
or
externaland internal face test 83 3.8 The algorithmforconstru cti ng spherical Voronoi Diagra m 86;'.9 Representation ofchain. all
3.10Thesearchalgorithmforpoints location 9:1
3.11Thealgorith mforaddingnew(aces ofconvexpolyhedron . 96
4.1 Voronoidiagram in1,,\met ric . 102
List of Tables
2.1 Exampl eto computemaxim um inallSIMDhypercube 26
2.2 Program for SIMD Maximum .. 27
2.3 Exampleto compute ranksinan SI MD hypercub e . 28 2.4 Progra m IcrSI MDranking procedu re ... 30 2.5 Exampleto co ncent rateinSIM D hypercu be 31 2.6 Program foe proce dureto concentraterecords 32 2.7 Exampleto distri bute inan SIM Dhyperc ube 33 2.8 Programfor procedu reto dist ributerecords 34 2.9 Exampleto gener a liseinanSI MDhypercube 35 2.10 Programfor pro cedureto generalizerecords 37
2.11Program for SIMD Reversing 39
2.12Bitonic sort into nondecreasi ngorder.. 40 2.13Iterativebit oni c sort forII,apower of 2 41 2.14Powerof2 bitonicsort [ncndec reaslngorder). 42 2.15 Powerof2bitonicsort (nonincreas ingorder). . . 43
2.16 ExampleofII.randomaccess read .. 46
2.17Algorithm (ora.random access read.. 48 2.18Example o(anarbihary randomacceJIwrite... 52 2.19Algorithm(oran arbitra ry rand omacee..write •. . .•. 54 3.1 TheAlgorit hmofConlt ructing3·DimeJllionalSpace Convex
Hull
vii
Ack nowledgme nt
Iwouldliketotakethis opportunity to expressmy sincerethanksto mysuperv isor,Dr.CaoanWang .He stimulated myint erest in thefield of computationalg~metry.Hi.constant encouragement Andguidanceduring thecourse of mystud yandresearch ledto the completionof this thesis.
Las t butnot theleast,1wouldliketo give mythenketo my husband, Zhcnpc n You ng,for hislove andhelp tbrou ghou tthecomposit ionofthis
viii
Abstract
Com putatio na lgeome tryis abra nch ofcomp uter scien ceCO UC C fl ll 'l!
with thedesignandanalysis ofalgorit hms to solvegeomet ric prubloms.TIll' Voronoi diagramor aset.5ofIIpoints (calledsites)is awellknownst mr-ture in comput ati onal geometry .Inthe Voronoi diagram ,eachpoint is surrounde d bya convex polygonenclosing tha t territorywhich isclosertothe surrounded point thanto anyot her pointintheset. Voronoi diag ra ms arcusefulill solvinggeomet ricproblem s suchas proximit y problem sand theEuclidean minimum spanning treeproblem. Voronoi dingmmsalsohavenpplicatjons in diverseareas likebiology,visua lperception,physics,andarcheology,
Thereexist manymethodsto const r uct Varonoidiagramsonn sill- gle computer.Twoof themareprop osedby Shamosand Brown .In 1075, Shamesapplied two-dime nsion al Vcronoi diag rams toobtain elegant110111- tlensin compu tationalgeometry , such asfinding theneare stncighlmrlUlll construct ion ofminimumspanni ngtrees. Shamesdescribedan0(,,101;11) timesequentialalgorithmto con structthe planarVoron oidiagramforasd ofplanarpoint s.The str at egyheusedin theserial algorithmisdi vide-and- conquer. In1979,Browndemons trated an interestinglinkagebetween till:
Voronoidiagramand the convexhull.He presentc dan0(11log" )algorithm for constructingtheVor onoidiagram,by transformingtheproblemofCOII- structinga planarVoron oidiagramforan »<pcintssettotheconstruction
ix
ofIIconvexhullof"pointsina-dimensionalspaceviaa geometrictransfer- million knownasinversio n.
Duetothe natureofsomeapplications in whichgeometricproblems arise,fasLandevenreal-time algorithmsareoft en required.Here,as in many ot her areas,parallelism seemstoholdthe greatest promiseformajor reductionincomputatio ntime.Theidea isto useseveralprocessor swhich cooperate to solveagiven problemsimult aneously inafraction of the time takenbyII.single processor.Therefore,itis not surprisingthattheinterest inparallel algorithm sforgeometri cproblems hasgrowninrecent yea rs.
BasedonShamos's and Brown'smeth ods,severalparallel algorithm s to construct Voronoi diagramshavebeenpresented.Someofthemareim- plement ed on parallelrandom accessmachines.Someof themarerunon processornetworkssuch asmesh,cube-connectedcycles, starsand pancakes.
Wewilldiscuss themin theliteraturereview.With thedevelopmentofcom- munlcntiontechnique an dconcurrent programming,compute r network shave become popular. Themostpopularprocessorinter connecti on topologytoday isundoubtedlythehyprl'Cllbc.Hypercube!have several advantages. First, thenumber of nodesina hypercubegrows exponenti allywith the number ofconnect ionspernode, sothat asmall increasein thehardwar e ateach nodeallowsa large increaseinthesizeof thecomputer.Second,the number of alternat ivepaths betweennodesincreases wit hthesizeof thehyp ercube,
whichhelpsrelievecongestion.Third,efficientalgorit hmsarc knownforrout- ing messagesbetweenprocessors in a hypercube.Finally,andtoday mod importantly,alar ge corp usofsoftwareand programm ingtechniquesexists forhypercube .
Thehypercubeis one of themostversati leand efficient networksyei discoveredfor parallelcomp utation.Inthisthesis,a single instru ctionIIm!- tiple datast reamhypercubeconnectedcomputernetworkis chosenas our par allelcompu tatio n model.In ahypercube connectedcom put er ncl work, local computationsas wellas messageexchanges arctakenintoconsiders- tio nwhenanalyzingthetime taken by theprocessornetworks tosolvefI.
problem.Based on NassimiandSah ni'spaper fund amental operationson hypercubeconnectcomputernetwor ksare descriptively discussed.TheCOt- respondingprograms are alsogiven. Based on Brown'sapproach,II.parallel algorit hmto constructVoronoi diagramsaredeveloped.Our algorithmrU/IS in O(log3n)time onan O(ll)-processor hypercube connectedcompute r net- work.Our algori thmhas severaladvantages.First, our algorithmisbased onBrown's methodwhichtransformstheprobl-em ofconstr ucti onof a planer aVoronoidia gra m forann-pointsettoconst ructio nofa convexhull orII
points in threedimensionalspace,Comparedwiththepar allel IIlgoriUllns whichare based on thedivide-and-conquer approachused by Shamos,our algorit hmcan be usedtosolvetwocomputationalgeomctryproblems:con- struct ing 2- dimen sional Voronoi diagrams and3- dimensiona lconvexhulls.
xi
Seco nd, compa ring with Chow's meth odswhichruns on aO(n)processors CCC (Cube-ConnectedCycles)model has O(log·n] timecomplexity,our nlgorithm has better time complexity.Third,comparingwithChang-Su ng Jeon g's algorit hm which runs inO(
J'Ti)
onan.jilx..;n
mesh,our parallelcom puta tion model ismoregeneral becausemost other popular networks can be easilymappedontoa.hypercub e.
xii
Chapter 1 Introduction
Programmin g computerstoprocess pictorialdat a efficientlyhas been an activityof growingimportanceoverthelast40 year s.Thesepictoeia.l datR maycome frommanysources.We dist inguishtwogen eralclasses:
1.Most often, thedat aare inherentlypictorial, suchasthe images aris- ingin medical, scientific, andindustrialapplications.WeAthermaps receivedfromsatellitesin out erspace are agoodexample.
2. Alternatively,the data areobtainedwhena mathematica lmodelisIISl'(1 to solve a problemandthemod elrelieson pict orial da ta.Exarnpks hereinclude computingtheaverageof a setofdata(representedas pointsinspace) in the presence ofoutliers,computingthevalueof a function thatsatisfies a set ofconstrai nts,andso on.
Regardle ss oftheir sources,thesecomputationsmay includethe ope r- ationsof iden t ifyingcontours ofobjects,"noise" removal,featureenhance- ment, patternrecogniti on, dete ctionof hiddenlines, andobtaini n ginterse c- tiona among various comp onents.At the foundat ion ofall theseeornp utation a areproblemsof a geomet ricnature,tha t is,problemsinvol vingpoints,line s, polygons, andcircles. Cmnplttalionalgeometryisthebranchof computer scienceconcern ed with designingefficientalgorithm sfor solvinggeomet r ic proble msof in clusion,intersecti on,and proximity , to namebut afew.
Untilrecen tly,theseproblemswere solved using conventiona lscqIIC/lti(l{
comp u ters,computers whosedesignmore orlessfollowsthe model proposed by JohnvonNeumannandhis tea m in thelate19405.Themodel consistsof nsin gle processor capa bleof executing exactly oneinstru ction of aprogra m durin geach timeunit . Comp utersbuiltaccordi ngtothis para digm have beenableto perform attreme n dous speeds.However,itseemstodaythat this approachhas been pushedas far asitwillgo,and thatthesimple laws of physicswill standin thewayof furtherprogress. For example, the spee d of light imposes alimit thatcannot be surpassedby any electronicdevice.
Onthe otherhand, our appetitegrows con tinually fot ever morepow- erful compute rseapeb leofprocessinglarge amountsof data atgreat speeds.
Onesolution to this predicamentthathasgained credibilityan d popularity isIlflmUr:!pIYJrr...illg.Hereacomputa ti onal problemtobe solvedis broke n
into smal lerpartsthat are solvedsimulta n eously bytheseveralproc essorsof apamllefcompltlcr.The ideaisa.natura lone,an dthe decr easing cost and sizeof electronic compo ne nts havemadeit feasible.Latel y, compu t e rscien- tistshav e been busy build ingpa rallelcomputers anddevel opingalgorithms andsoftware to solveproblemsonthem.One Il.rea that hasreceived its rair share of interest is thedevelopment of Ilnn/llr1f//!lnl'il/lIl1.~for ccmp utatioual geomet r y.
TheVoronoidiagr am isa mathematical concepl attributedto math- ematician Voron oi[I). Voronoidiagram shavebeen well studiedinCOIll -
putation al geom etrysin cethework of Shernce(2Jpartlybecauseortheir applicati onsinsolvinggeometri c problem s such asproximityproblems and the Euc lidean Minimum spannin gtreepro blem, aswell astheir ap p lications insuchdiverse areasasbiology,visual perception,physic s , and archeology [3J.
Thereexis t manymethod sto con st ruct Vor inoidiagram on asingle c':....puter . Twoof themareproposedbyShamosand Brown. III 1975, Shamosapplied two-dimen sional Voronoidiagramsto ob t ainele g antsolu- tionsincompu t ational geometry, suchas findingthenearestneighborand constr u ct ion of minimumepen rringtrees,Shemo edescribed all. 0(11 logu) time seq uential algorithm toconstr ucttheplana rVorono i diagramfornset of planar points.The strategy he used in theseri a lalgori t hm isdivide-end -
conquer. In 1979, Browndemonstr atedaninterestinglinkagebetwe enthe Vorona jdiagram and theconvexhull.Hepresented an O(nJogll)algo rithm foroonst ruc!ingtheVoronoidiagram,bytransformingtheproblemofcon- structing aplan arVoronoi diagra mforana-pointssettothe construction ofa convexhull of1/points in3-dimen!ional8paceviaa.geometrictran!fo··...
malionknown as inversion .
Inthis thesis,thehypercube connectedcom puter net work is chosen astheparallelcomputa t io nmodel. BasedonBrown'sap proach[41which transfoems theproblem ofconstr uct ionof aplanaraVoron oidiagram foran u-pointsotto construct ionofaconvexhullofIIpoints inth reedimensional space,aparallel algorithmto const ruct Voronci dia gramswillbedeveloped.
Inthis chapte r,after we introduceparallelcomp u tatio nmo dels,wewillre- viewpar allel algorithms to const ructVoron oi diagra ms. Finally,the outline ofthethesiswillbegiven.
1.1 Parallel Computation Models
Inorder toreview theparallelalgorith msto constructVoronoi diagram , we firstreview some exis t ing models of par allelco mputationinthissection.
1.1.1 ParallelRandom Acc e s sMachines
In parallelrandomaccessma chine (PRA M),a common mem ory isused as abulletinboa rdand alldataexchangesareexecuted through it.Any pair ofprocessorscan communicate thro u ghthisShH ~dmemoryinconstant time.As showninFig.1.1, an interconnection unit(lU) allowseach process or toestabli ~hapa t h to eachme m o ry locationfor thepur poseof rend ingOf writin g ,
The processors operatesynchronouslyand eachstepor a compu ta tio n consistsof threephllSes :
1.The1"Cndphase,inwhich the processors readdatafrom me m ory;
2.Thecom p utephas e,inwhich arithmetic andlogir; opeeafiorrsareper- fo r med;
3.Thewrit ephase, inwhichtheprocessors write dat a tomem o r y.
Dependi ng onwhethertwo ormote processor s ate allowedto rea dIrom and/orwriteto thesamememory10cII.tionsimultaneous ly,three submod- els of the PRA M areident ified ;
1.Theexclusive-rea dexclus i ve-write(EREW) PRAM,where bothreud and writeaccesses bymore thanone pro cessortothe samememo ry lo cation arenotallowed.
Processo rs
Inter connection Unit (IU)
Shared memory Locat ions
Figure 1.1: Par allelRando m Access Machine
2. Theconcurrent-readexclusive-write (CREW)PRAM, wheresimultn- neous readin g from thesamememor y locat ionisallo wed, but not si- multa neouswriting.
3.The concu rr ent-readconcurrent-write(CRO W) PRAM,whe reboth forms of sim ultaneo usaccess are allo wed.
1.1.2 Proces sorNetworks
Ina processornetwork,an int erconnecte d setofprocessors cooperate to solveaproblemby performing localcom p ut ations andexchang ingmessages . Severalofthe most widely usednet work sareout line dbelow,
1.Mesh orTwo-Dim ensiona l Ar r a y
Atwo-dim ensionalnetworkis obtainedby arrangingtheII peoces-
aorsintoanTIl XTna.rray,whereT1I
=..;n,
The processor inrowiandcolu m nkis denoted by(j,k),whe re0:5j :51/1- I an d
o
:5" k:5"m -1. Atwo-waycom m unicat ion linelinks(j,k)toitsneighbors(j
+
1,k),U-I,k),(j,~:+
1),andU,k- 1).Processorson the bound a ry rowsandcolumns have fewerthanfour neighborsand hen ce fewerconnectio ns.Suchanet work isalsoknownasthe,111,.../t orthemesh-connected commuer(MeC)mode l.Fig.1.2showsamesh withn=16processors.no.
Number
o
Column Number
Figure1.2:Mesh withn
=
16proeesson2. Cube-ConnectedCycles
Toob t ai n a cube-connected cycles(CeO)network,we begin witha d.
dimensional hy p ercube, then replaceeach ofits 2" corners withacycle ofdprocesso rs.Each processorina cycleis connected to a processorin aneighboringcycle in thc eeme dimension.See Fig.1.3forancxample of acecnetwork withd:;;;3andII= 2'1.//=24processors.Inthe figure,each processor hastwo indicesi,i ,whereiis theprocessororder in cyclei.
3.StarsandPancakes
Star andpancake are two interconnectio nnetworkswith the property that fora given integer 11. eachprocessor correspo ndsto adistinct per- mutation of '1symbols,say{I , 2,...,l}}.In otherwords,both networks connectII='1!processors, and each processorislabeled withtheper- mutationtowhi ch it corresponds.Thus,for '1
=
4,a.processor may have thelabel2134. Inthe.~tarnell/Jo r k.denotedby S",a processor visconnected to a processor uifandonly if the label ofItcanbe obtainedfrom that ofvby exchangingthefirst symbolwith theiU' symbol,where2$i$1/.Thus forI}=4.if v=2134 and11=
3121, 11and v areconnectedby a two-way lin k inS,,,since312~and 21M can be obtained from one another by exchangi ngthe first and third symbols .Fig .1.4shows Sf.14 16
10 12
30 20
15
32 22 17
31 21 23 33
Figure1.3:Cube-connectedcyclesnetwor k withd=3andn=24
10
3214
2314
4312
1342
b/
1234
3142
Figure 1.4:A4·st8.r
11
2143 243(
3421
1423
4123
1231 4321
32]4
23]4
4132
1432
2341
3241
1423
4123
3412
Figure1.5:A 4-p!l.lIcake 2143
III theIItIll"'lhIIt/wor/':.denotedby1-'",a processorvisconnecte d toa.processor1/ifandonlyifthelabelof11canbeobtainedfromthat oft'byflippi ng the nrstisymbols,where2$iS 'I. Thus for'I
=
4, ift'=
2134andu=4312,n andvareconnect ed by a two-waylinkin lIt.since4312 canbeobtained from2134byflippin g the foursymbols, and viceversa.Fig. 1.5 showsPl'12
1.2 Lit er ature R eview of P a rallel Algori thms for Co n str ucti ng Voronoi Di agr-a m
In1980,Chow inherPh.Dthe~i8151propo sedthreeparallelalgorithms (orconst ructing Voronoidiagrams.Heralgorithmswere basedallDrown's method \41.Byusingtheinversiontechnique , she transformedthe constr uc- tionof the 2-IiVoranoi diagram to theconst ructionofthe3-,Jconvex hull.The computationmodel e she usedwereCREW PRAM and
cee
per- allclcomputernetwork s.On the CREW PRAM model,her algorithmrunll in O(log3 n)time with0(,,)processors. On thecce
model,one of heralgo- rithmsruns in O(log"'71)timeand uses0(11)processors,and theot her runs in0(k10g3,, )tim e and usesO(n1+1 / k)processors,where,1::;k$logII.In1986,Mi Lu161presented a parallelalgorithmto constr ucttile Voronoi diagramofaset ofplana r points.The algorithmisbased on Brown's approach{4Jand has O(.jii'log ll)timeeomplexltyonO( .jIix
v'ii)
MCC with constantstorageperprocessors.In 1988,Preilowskiand Mumbeck{VIsuggesteda time-optimal pamllcl algorit hmto computeVoronoidiagrams.Thealgorithmrunsin O(log,') timeusing0(11-3 )processorsonaCREW PRAM.Thisresultis time-optimal becausethesorting problemcan be reducedto the Voronoi diagramproblem.
The authors showedbow their algorithm can compute all edgesof the Vcronoi
13
diagramin0(1)timewith0(114)processors.
In 1988,Leveepeul oe,Kat ll.jainenand Lingaa[8} gavea.par allelalgo- rit hmforcomputi ngthe Voronoidiagramofaplanar pointsetwithin8.square windowW.Thealgorithmusesmultilevel bucketing and runsinO(logII) averagetimeonCReWPRAMwithO(I,/logII)processorswhen thepoints aredrawn independentlyfroma uniform distribution.
In1989, Evans andStojmenvoic [9Jpresented an 0(10g3,,)algorithm for constructingtheVoronoi diagram of aset ofnpoints onCREWPRAM.
OntheCReWPRAM modelthealgorithmrunsin0(log2n )time.The algorit hm uses Shamos's divide-and-conqu ermethod[21.
In1990, Chang.Sung Jeong[10/ gave a parallel timeoptimal algorithm whichrunsinO( y'ii) onan
..fii
x..fii
mesh. Thealgorithmisbased on thedivide-and-conquer approachused byShamos(21.The setofpointsis sort edbyor-coordinate and divided inhalfintotwo setsLandR bya vertical separatingline1 suchthat points inI~areto theleft ofIandpointsinR arc totherightofI.Recur sively,theVoronoidiagra.mVo/·(I..)andVor(R ) arc compute dforthesetsI..and R,respectively. Thetwodiagrams are thenmerged ,resultinginV01'(I..UIl).II01'(I..)istheVoronoidiagram ofI..,1"111'(/1)istheVoronoidiagramofRandVor(LUR) is theVoronoidiagram
ofl..u/I.
14
In1992, S.GAkl[11\presentedparallelalgorith msthat computethe VoronoiDiagram ofI'points onthestarandpancakenetworkcomputers.
For anu-starorn-pancake withII
=
,,1 processors,given/Iplanar points sto redin the processorssuch that eac hprocessorhl)ldsonepoint.andlias a memoryofconstantsize, the Voronoidiagramof these point s canbeIound inO( n~log2 ,,)time.In thisChapte r,we review parallelcom putat ionmodelsandparallel algorithmsof constructingVcronoi diagrams.Themost popularprocessor inte rconnectio ntopology today isundoub tedlythehypercub e.Thehyper- cube is oneof the most versatileandefficientnetworks thusfnrdiscovered for parallelcomputation112Jbecause:
•In a hypercub e,using(!connecti ons perprocessor,2"processormay beint erconnect ed such that themaximumdist an cebetweenanytwo processorsis d.While lineararray,tree,Mesh and Mesh of treeuse a smaller number ofconnectionsperprocessor,themaximum distance betweenprocessors islarger.
• Most otherpopular networksare easily mapped intoa hyp ercube.In part icular,thea-node hypercu becansimulateany O(II)-node array, tree,ormeshoftreeswith onlya smallconstantfacto rslowdown\121.
•Hypercube has the advantage of beinga wellstudied networ k.Efficient algorit hmsareknownforrout ing messagesbetween processorsinII.
15
hypercube.A large corpus ofsoftwareand programming techniques exist for hypercubes [13].
•Ahypercubeis completely symmetric. Everyprocessor'sinterconnec- tionpattern is like thatofevery otherprocessor.Further more,ahyp er- cube is completely decomposableinto sub-hypercubes[i.e., hypercubes of smallerdimension).This property makes itrelatively easy toimple- mentrecur sivedivide-and-conqueralgorithmsonthehypercube (14].
In ahypercubeconnectedcomputer networks, localcomput ationsas well asmessageexchangesaretaken into considerationwhen analyzing thetime taken bytheprocessor networksto solveaproblem.Whendesigningan algorithmfor aprocessor network, therouti ng of messages fromonepro- cesser toanotheris the responsibilityofthealgorithmdesigner.In Chap- ter2, we will introducethe parallelcomputationmodel usedinthe thesis.
Based onNa.ssimiand Sahni'spaper[15), severalfundamentaloperat ions onhy percube connectcomputer networksaredescribed.The corresponding progra ms arealso given. All thoseoperation swillbeused in chapter3.In Chapte r3, Based onBrown's metho d,wewill developa parallelalgorit hm to constructVoronoi diagrams. Ouralgorit hm runs in0(log3»)timeon an O(II).processorhypercubeconnectedcomputernetwork.Ouralgorithmis buedonBrown'smethodwhich transformsthe problemof const ruct ionof a planar aVoronoi diagram for ann-pci ntsetto constructionof a convexhull ofIIpointsinthreedimensionalspace.Comparingwiththeparallel elgc-
16
rithmswhicharebasedonthe divide-end-conquerapproachused byShames, ouralgorithmcanbe usedto solve two computationalgeometryproblems:
constr ucting2-dimensionalVoronoi diagram and3- dimensio nalconvex hull. Compari ngwit hChow's metho ds(5)which runs on(\.0(11 )processors CCC (Cube-Con nectedCycles) model has 0(log·11I)time complexity,our algorit hmhasless timecomplexity.ComparingwithChang- Su ng Jeong's algori thm[l O]whichruns inO(
vn)
on anvn
x.fii
mesh,ourparallelCOIl1-put at ion model ismoregeneral.Most otherpopularnetworkscanbeeasily mappedontoa hypercube{12].
17
Chapter 2
Hypercube Connected Computer Network and Its Fundamental Operations
Inthis chapte r,theSIMD hyper cubeconnectedcomputernetwor k will be defined.Somefundamental operationsonit will bedescribedandccrre-
spo nding programs.ilIIll">begiven.
2.1 SI MD Hyp erc ub e Co nnected Computer Network
AccordingtoMichaelJ.FInn'. (161taxonomy ofcompu te rarc hitect ure, parallelcomputersaredividedintotwo cetegoriea, single instructionmul- tipl,. datastreams (SIMD)and multipleinstruct ionmultiple datastr eams
18
(MIMD).Ablock diagram IorII.SIMD compute r is givenill Figure 2.1.
AsCl'IlIbeseen,an SIMDcomputerconsistsofu procesaingelemenh(PE's}.
ThePE's areindexed0 through"-1and maybereferenced as1>/o:(i).Eaeh PEhas i'.slocalmemory.The PE'sare synchronizedandoperateunder the: control program.PE'smay be enabled or disabledsothat thecommon instr uctionfor any giventime-unitis executedonlyonenabled PE's.This enabling anddisablingof PE's can be donewit hout theuse ofseparatecon- trollines foreachPE as long aseachPEknows its own index
1 171.
The PE's areconnectedtogetherviaan int erconnection network.Differentin- terccnnectionnetworks lead todifferent SIMD architectures.In this thesis, hypercubeconnect SIMD compute rsareconsidered.Assume that n
=
2"andlet;J_I, ••io bethebinaryrepresentation of iforiE[0, 11-1].Leti(~lbe the numberwhosebinary representat ion is i"_1...iH1r;;ib_1...io, where;;; is the complementofiband 0:5 "<,I.That is,i(~Jis obta ined by complementingtheb'th bit ofi's binary representat ion.Inthe hypercube model processoriis connectedto processors;lbJ,0:5" "< ,I.
Fig. 2.2showsan exampleofIf=16processorhypercube.
The hypercube is an excellent(andpopular)choiceforthearchitect ure ofamultipurpose par allelmachine. In this thesis,we choose theSIMD hypercube connectedcomputer networks as our parallel comput at ion model.
19
I/O
I/O
I
Inter- connect ion Network
Figure2.1:Block diagram ofan SIMDcompute r
20
1100
1(100
.> /
1110,:"
-, ,.y o .:---
0111010 011
000 _0011
o~ ~
--
001..
~~~- --
V .:
1010Figure2.2: 16processorhypercube
21
1111
ion
2.2 Fundament al O p e r a tions on Hypercube Co n ne cted Computer Network
Inthis section ,based on Nassimi and Sahni'.pap er ll S), several Iunda.
mente!operationsonhype rcubeconnectedcomputernetworksarcdescri p- tivelydiscu ssed. Thederivationsandthe correspondingprogramsarcalso givenafter thedescriptionofeachopera tion. Allthoseoperations willbe usedin chapter3.Therefore,thissectioncanbeconsideredas theprepa ra- tion(orehapter3.Only operations whichareusedwillhediscussed. Based on this,somevery buicoper ations, suchasdat a accumulation and eonseeu-
livesumonhyp ercu be connectedcomputernet work.willnotbementioned in thissection.Formoredetails, [18] wouldbe a goodreference.
Beforethefund&mental operat ionsonhypercu be connectedcomput er networks are introduced, some program mingnot ationusedi.givena.Icllcwsr
I.Thenotati on;1.'i.torepresentthe numberthat differ.fromiin enctly bitfl.Thesquarebracket.(1))areused toindex anarray and the parentheses('()')are usedto index PEs.ThusA[llrefersto thei'th elementofarrayAandA(i)referstotheA registerofPEi,A(j](i) referstothe j'thelement ofarray A in FEi,Thelocal memoryineach PEholds dataonly[f.c.ItOexecuta ble instructions).PEsneedto be
22
able to perform onlythe basic arithmeticoperatio ns.
2.Thereareaseparate program memory andII;cont rol unit.TheCOil -
trolunit performs instruction sequencing, fetching,anddecoding.III addition,instructionsand masksare broadcastedby thecontrolunit tothe PEsCorexecution. AninstructionIII(I...~·isIIbooleanfunction usedto selectcertain PEs to executeaninstruction.Forexample,in theinstruction
A(i): =A(i) +1,(i o =1)
(io
=
1)isa maskthat selectsonly those PEs whoseindexhasIllt 0 equal to 1.3.Interp rocessor assignments are denotedusingthesymbol" _ ", while intre processcrassignments are dcaoted using thesymbol":=".Thus theassignmentstatement:
on a hypercubeisexecutedonlyby the processorswithbit2 equal to O.
TheseprocessorstransmittheirIiregister datatothecorresponding processorswith bit 2equal to1.
4.A d-dimensional hypercube can be partitionedinto windowsofsi?.c 2kprocessorseach.Assume thatthisis donein sucha waythatthe processorindices ineach windowdifferonlyintheir least significantk
23
bits. As a result,the processorson each windowforma subhypercube of dimensionk.
2.2. 1 Maximum
Assumethata dimensiondhypercubeis partitionedinto subhypercubes (orwindows) of dimensionk,whereP
=
2dandW=
2k•Iff==iW+q (0:5(/::;W)isa processorindex ,thenprocessorfis theq'thprocessorin windowi,This processoris to compute themaxim um elementofallelements, where, O::;i <P/W , O$ q <W.Themaximumrelative tothe wholesize~vwindow areobtainedas below:
1.Ifa proc essoris in theleft2 k-1snbwindow,then its maximum isun- changed.
2.The maximumofa processorin the rightsubwindowis itsmaximum whencon sidered as a memberofis.2k-1window compared to themax- imum oftheAvaluesin the left subwindow.
Table2.1 givesanexample-maximumcomp utation.The numberof processorsandthe windowsize IV
=
2kareboth 8. Line0 givesthe initial Itvalues.Themaximumin thecurrent windowsare storedin theSregisters and the maximumof theJI valuesof theprocessorsinthe currentwindows24
arestored intheT registers. Webegin withwindowsoCsize1.The initial SandTvalu es are given in lines1and2,respectively. Next,theSandT valuesfor windowsofsize:2 are obt ained. These aregiven in lines 3 and.1.
Line 5 and6givetheSandT valueswhen the window size is>1andlines 9 and 10givethese valuesforthe casewhenthe wind ow sizei~8. The program in table2.2is theresultingprocedure. Its timecom plexityisO{k}.
2.2 .2 Ranking
Associatedwith processor,i,ineachsize2kwindowof nhypercube is
aflagsel~r;led(i}which istrue iffthisis a selected processor.The objective
of rankingistoassigntoeach selected processoraI'illlksuch thatnwk(i)is the number ofselectedprocessors ofthe window withindexless thani.Line oof Table2.3shows theselectedprcceesora in awindowofsize eightwillI an
*.
The ranksto be computedareehcwnin line1.Theranks ofthe selectedprocessorsinawindow ofsize2kcnnhe computed easilyifwe knowthefollowing informationCor theprocessors in each ofthesize2k-1subwindcwsthatcomprisethesize2kwindow:
1.Rankofeachselecte dprocessorinthe2k-Lsubwindows 2.Tot alnum berof selectedprocessors in each2.1:-1subwin dow
Ifaprocessorisin the left2.1:-1subwindow then its rank inthe2l 25
PE I 2 3 4 5 6 7 line 0
0 2 4 3 1 5 2 8 I A
2 4 3 5 2 8 I S
2 4 3 5 2 8 1 T
2 I 3 3 5 5 8 8 S
4 4 4 3 3 5 5 8 8 T
5 2 4 4 4 5 5 8 8 S
8 4 44 4 8 8 8 8 T
7 2 4 4 4 5 5 8 8 S
8 8 8 8 8 8 8 8 8 T
Ta ble 2.1:Exampletocompute maximuminan SIMDhypercube
26
pr oc edu r eS/MIJMI/J'illlll/ll(;\,~·,SO);
{ComputetheMaximum ofrIin windowsofsize2k}
begin
{Initia lize forsize 1windows}
S(i ) :=A(i);'I'(i) :=A(il;
{compute {or size2~+Iwindows } forb:=0to~'-1 do begi n
l3(i (~))__'r(i);
S(i ):=H(i),(S( i)<l1(i)and i,.
=
1);,.(;), ~HUl,('I'U)<H(i)) ; end;
en d; {ofS/ !II/J AlII.rimum }
Table2.2:ProgramforSlMD Maximum
windowis thesameasitsrank insubwindow. Ifitis in the rightsubwindow, itsrank is its rank inthesubwindowplus thenumber ofselecte dprocessors in theleft subwindow. Lin e 2 of Table 2.3shows therank ofeachselected processorrelative tosubwindows ofsize4.Line3showsthetot alnumber or selected processorsin eachsubwindow.
LetH(i) andSU),re spectively, denote therankof processori(if it is a.selected processor) andthenumber ofselected processorsin the current windowthatcon tai ns processori,The strategyto count rank s inwindows or size2kistobeginwithU.and SOfor windowsofsizeoneand then repeat edly double thewindowsizeuntil reachinga.windowsizeof2k•Forwindowsor sizeone,itis given:
27
,,!,E 0 I 2 3 4 5 6 7
line
0 1 2 00 3 4 H
2 00 0 1 0 00 1 2 R
2 2 2 2 3 3 3 3 S
0 0 0 0 0 0 0 0 R
0 I 1 0 1 0 I 1 S
0 0 0 0 0 0 0 1 R
1 I 1 1 I 2 2 S
0 0 1 1 0 0 1 2 R
2 2 2 2 3 3 3 3 S
10 0 0 1 2
,
3 4 R1] 5
s
5 5 5 5 5 5 STable2.3:Exampleto computeranks in an SIMD hy percube
28
lI(i) S(i)
_ 0
101 ifiisselected
1
otherwiseLines4and 5of Table2.3give theinitialIiandSvalues. Lines6 and 1givethe valuesforwindowsofsize2. Lines8and 9 givethesefor windowsofsize 4,and lines10 and 11givethem for awidowsizeof 8.The ranksfor theprocessorsthat arcnotselectedmaynowbe setto00to gelthe configurationof line1.The pro cedureto compute ranks isgiven inTahle2.'1.
Thisproced ure is dueto Nass imiand Sahni[lsIandits complexi t yisreadily seentobe0(.':).
2.2.3 Conce ntratio n
Ina data conc entra tionoperation,itbegins with onerecord,
r:,
ineachoftheprocessorsselectedforthisoperation.The selectedprocessorshave been rankedandthe rank infor mation isin a fieldIioftherecord.ASfoumc thewindow sizeis2k•Theobjective i$to move therankedrecords inench window to theprocessorwhosepositi on in thewindowequa lstherecord rank.Line0of Table2.5gives aninitial configurationfor an SIMD eil;ht processorwindow. Therecordsare shownaspairs withthesecond entry ineach pair being therank. Itis assumedthatthepro cessorstllalarenot
29
pro c e du reIlwJ.{k)j
{Com putethe rank ofselected processorsin windowso( size2k} {SIM D hypercube}
begin
{Initializefor .i7.e 1windo ws) lI(il
,=
0,iflIrlrclcd(i) thenS(i):=1 else5(i):=0;
{Comput e(orsizezklwindows) fo r":=0to~- 1 do begin
1'(i(·I)-S(i);
lI(i),=lI(i)+7'(i) ,(i. =1);
S(i)
,=
5(i )+7'(i);end:
Il(i):=00,(not...tlrrlcd(i)) ; cud; {of ml,k}
Table2.4:Progr a.m forSIMDrankingprocedure
30
~
line0 (-,00) (B,O) (-,00) (D,l) (E, ' ) (-,00) (G,3) (11,1) (B, O)(D, l ) (E, 2) (G,3) (H,4) (-,00) (-,00) (- , ~I (B,O) (-, «) (-,00) (D,I) (E,2) (-,00) (H,4 ) (G,3) 3 (B, O) (D,l) (-,«) (-,=) (H,4) (-, 00) (E, 2) (G,3)
Table 2.5:Example toconcentrate in SIMD hypercube
selectedfortheconcentration operationhave a tank of00.The resultof the concentrationisshownin line1. Let D,D,E,11 berepresentedindividal records.
Data concentra tioncanbedonein0(1.: )timeby obtainingthe agree- mentbetween thebitsof thedestinationofII.recordanditspresent locution intheorder 0,1,2, "', k - 1[151.Forexa mple, let us seck agreement on bito.Examiningthe initialconfiguration(line0),itcanbe seen thatthe destinationand presentlocationofrecordsB,G andHdisagr ee onbito.
To obtainagreement ,theserecordswiththe recordsin neighborprocessors alonghit 0 areexchanged. Thisgives theconfigurati onof line 2.Examining thebit1of destinationand presentlocatio n in line 2,it can alsobeseen thatrecordsD, E andHhavea disagreement. Exchangingthese records
31
proced u rer.rHlet:'draleCO,k);
{Concentra.te recordsG'in-..leet edprocessors.2·itthewindOW'size ) {H isthe renkfidd ofareerd]
begin
forb:::::0tok -ldo begin
F{,1i')_C:(i) ;
G(i)~'"(i),((G(i).R#00an d(G(i).R),I
i.»
or(I-'(i ).11#-00an d(F(i).Hh#-i&)) );
en d; cud: {ofNJIIt1':/I/""/r.}
Table2.6:Programfor procedur eto concentraterecords withtheirneighb orsalong bit 1 yieldsline3.Finally,letusexaminebit 2 of thedcsfi nation and presentIeeationoCre cordsin line3an ddeterminethat recordsEandG need tobeexcha nged with theirneighborsalongbit2.This rel tlltsinthe desired finalconfigu rationofline 1.
2.2.4 Distribu t ion
Da ta distri butionis theinve rseoCdataconcentration,Itbegi ns with records inprocessors0,••., rofiI.hypercub e wind owofsize2t.Eachrecord hasa destinatio nO(i)associatedwith it.Thedestinationsin ea ch window
UClUththat0(0)<D(I)..•<D{r ).The recordthatisinitiallyin proces- soriofthe windowiatoberout ed to theO(i)'thprocessoroC th e window.
Note thatr may vary fromwind owto window. Line 0 of Table2.7gives
J2
PE line
0 (A, ') (B, 4) (C,7) (-,<X» (-,00) (-,00) (-, 00) (-,<X»
I (-,0<) (-,0<) (-, 00) (A , ') (B, 4) (-,00) (-, 0<) (C , 7) 2 (A , 3) (-,0<) (-,00) (-,<X» (-,0<) (D,<) (C,7) (-,<X»
3 (-,00) (-, 00) (A,') (-,<X» (-,00) (B,<) (C,7) (-,<X»
Table 2.7:Example todistributein an SIMD hypercube aninitialconfigura tionfo rda tadist ributioninaneightI)TOre SSOTwindow of anSIMD hypercube. Each. recordisrepresen ted as a tuplewiththesecond entry being thedestination.Line1givestheresultofthedistribution .
Since dat adistribut ion is the inverseofdata concentration,itcan be carriedoutby ru nning theconcent ra tionprocedure inreverse [151. Tile resultis theprograminTable2.8. Lines2, 3,and 1 ofTable2.7give the configurat ions(orthe examplefollowingthe iterationsb ;;;;2,1,and 0, res pectivel y, shown in Table 2.8.
2.2.5 Generalization
The initial configuration forageneralizationissimilartothat (or a da ta.dist ri bution . Itbegi nswith records,(J,in processors0,..,rofII.
33
proceduredjslri~utc(G,k);
{Dist ribute records G.2kisthe window size}
begin
for" ;=k- ldownto 0 do begin
1" (i(~) )- (,'(i l;
G(;)_/,'(;) , ((G(;).IJ"coand(G(;) .D)o,,;.)) odl'( ; ).IJ"cc and(l'U).IJ).,H.)));
end;
end; {ofdi."" 'iblllr}
Table 2.8:Programforproceduretodistribu te records hyper cubewindow ofsize2k. Eachrecord,GU),has ahigh destination (:(i). 11associatedwithit, 0:::;jS;r,Thehigh destinatio nsin eachwindow are such tha tG(O).l1<G(I).1I<.·.0(1·).11. LetG(-I ).1I:::O.The record which isinitiallyin processorjofth e window is to berouted to processorsG(i-1)./ / ,O(i-1).11
+
1,. ",G(i)./fof thewindow,0S;i $ r.Not e that,.may varyCromwindow to window.Line 0aCTable 2.9give s an initialconfiguration for datageneraliza tion inan eight processorwindow of an SIMD hypercube.Each record isrepresen t ed asa.tuplewiththesecond entrybeing thehighdestinat ion.Line 1gives theresult of the generalizat ion.
Data generalization is done by repeatedl yreducingthe windowsize by half[15].Whenthewindow size ishalved,it shouldbe ensuredthat all record sneeded inthe reducedwindowarepresentinthat window.Beginning with a window.~izeofeight and line0 ofTable2.9, eachprocessor sends its
34
~E line
(A,') (B,') (C, 7) (-,00) (-,00) (-,00 ) (-,00) (-, oo) G (A,') (A,3) (A,3) (A,3) (B,.) (C,7) (C,7) (C,7) G (-,00) (-,00) (-,00 ) (-,00 ) (A,3 ) (D,.) (C,7) (-,00) F (-,00) (-, 00) (-,00) (-,00) (-,00) (B,') (C,7) (-,00) F (A, ') (B,') (C,7) (-,00) (-,00) (B, ' ) (C,7) (-, 00) G (C, 7) (-, 00) (A,3) (B,') (C,7 ) (-,00) (-, 00.) (Il, ') F (C,7) (-,00) (A,3) (B,') (C,7) (-,00) (-, 00) 1-, 00) F (A,3) (B,' ) (A,3) (B,' ) (C,7) (Il,') (C,7) (-, 00) G (B,') (A,3) (B,.) (A,3) (B, .) (C,7) (-, 00) IC, 7) F
Table 2.9:Exampleto genera li zeinanSIMO hypercube
35
recordtoittneighborpro cessoralongbit2. Theneighborprocessorreceives therecordillF,Line 21howl theFnluesfollowingthetransler.Next,some
""IandG'sare eliminated.This isdonebycomparingthehighdestinalklll o(
arecordwiththelowestprocessorindexin thesizefourwind owthatcontains therecor d.IfthecomparilOnsucceeds,then,tberecordisnot neededinthe size four window,Applyingthiseliminatio ncriterio ntoline 0result sinthe eliminat io nofno G,However,when thecri terionill applied totheF's afline 2,/0'(4)iselimin atedand theconfiguratio nof line3 will begiven. Atthis pointeach window ofsize four hasall therecordsneededinthat window.
Thereco rdsare,however,in bothFan dG.Toeoasclidetethe required recordsintotheC'Ialone, thefollowing consolidation crite rionisused:
replacea(i)byF(i) in caseF(i).11<G(i),l1 i.t .,of thetworec ord: ina PE,the onewit hsmaller high dettinationsurvives.
Applyingthe consolidat ioncriteriontolines 0 and 3resultsin line 4, Next, record.aretransferredalong bit1.TheFvaluesfollowing this tratllferere givenin line 5.Following the applicat ior.oftheeliminationcri- terion,theFvaJueoflin e6 isobtained.TheCval uel arcunchanged.When thecone clideti oncriterio n isapplied, theG valuesareasinline7.Line8 showsth eFvalue.following atransleralongbit 0,Theeliminationcriterion resultsin line1. Procedur egClIcmli:eshow nin Table 2.10 implementsthe generaliza tion strategyjUlt outlined,
36
proceduregrll emfi: c(G,k);
{General ize record sG.2~is thewindowsize}
begin
forb:=k-l downt o 0do begin
{Transfer to neighboring window ofsiee2~}
F(i1bl }_ C(i);
{Elimina tioncriterio n}
G(i).1f:=oo,(G(i).1f<i-ib-l..o)i F(i).1f;=oo,(F (i).1I<i-ib_1..u):
{Consolidationcriterion}
C(il'~F(i), (F( i).1f<C(i).II );
end;
end; {ofgcnemli:c}
Table 2.10:Programforproceduretogene ralize records
37
ForII.natu ralnumbe rI,jn..,~isthenumberobtained{romiby flipping allthebitsinthebin a ryreprese ntat io nofifro mthent h tothemthpla ce.
2.2.6 Merging and Unmerging
Giventwosorte dsequenceseachstoredinahyperc ubeof sizen/2,their merg ingcan be doneinO(log n]time.Mergin g two sortedsequ encescanbe doneby bitonicsort.
Abi/tHlicsequ o u:eisa.nonincrcas ingsequenceof number sfoUowedby ancn decrea ain gsequence, Eit her(orboth ) ofthese ma.ybe empty. The seq uenceha sthefor mXI ;:::X2?:".;::: Xk:::; Xk+l::;;'' ':5Xn,{orsom ek, 1S~,:5u.Thesequences10 , 9,9,4, 5,7,9;2, 3, 4, 5,8;7,6,4,3, 1;and 11, 2, 5, 6, 8 , 9 areex amplesofbit onicsequences.
Abit onlc sortis aprocesswhich sorts abitoni c sequence into either noni ncreasingorno n decreasin gorder. Suppose wearegiventwosorte dse- que neea"1:5112S...:5VIand10,:SIt'.!.:5..::5w"" Firs treverseone ofsequences. Thiscanbedone inO(logn] timebydoingthereversin,9 procedure.
Suppos ewehav e arevi sing seque nce{VI,lJ-l,",VI),we get seq uence (1". l'l_h' "•1'2.1' l }' Byconcatenatin g twosequencesto obtainthehitonic seque nce1'1?:111_1;:::, ..?:1'1?Ull:::; Wl:5 •• •:5tum=Xl?:X2?:••?:Xk
38
procedureSfMDRcvisil1!A.tI,k)j
{These quence is stored in window ofsize2kI{Re visinga.1Ithe elements in window s
or
size 2"jbegin
forb:=0 toL· - 2do A(jli) )...A(j) end;{afSIMDRcvisi ll g}
'Iable2.11:ProgramforSIMD Reversing
~Xk+15...=:;:r~wher e11
=
I+
III.The resultin g bilon icsequen ceisthen sorted usinga.hitonicsort toobtainthedesiredmerged sequence.So,Ioe examp le,ifitisdesiredtomerg etheac q uc nccs(2, 8,20,24) l\ncl (1,9,to, 11,12 ,13, 30),the bit onic sequ ence(24, 20,8,2, 1,9, 10,II,12,13,30) should be firstcreated,Bajcher's Monicsort[191isideallysuitablefor implementationon ahypercubecomputer. Batcher's algo r ithmto sort the hitonicsequen ce
Xl,' " ,xnint o ncndeere esingorderis give n in Table2.12.
Example: Considerthebitanic seque nce(24, 20,8,2, L 9,JO,11, 12, 13, 30). Supp osewewishtosort this intonondecreasing order. Theod d sequen ceis (24 ,8,1, 10,12,30)andtheevensequen ceis(20,2, 9,11, 13).Sortingthese,thesequence s(I,8,10, 12, 24,30)and(2, 9, 11, 13, 20) are obtained.Puttingthesortedoddand evenpartstoge ther,thesequence (I, 2,8,9, 10, 11,12,13,24,20,3 0)is given.Afterperfor mingthelltJ2J comparejexchen gee of step3,the sortedsequen ce(I, 2, 8,9, 10, 11,12,13,
39
Shp I: [Sortoddsubseque nce]IfII>2thenrecursively sortthe odd bitcnic subsequenceXI,X:hX5, •••into nondecreesi ng order
SI'-IIfl: [Sortevensubsequence]Ifn>~then recur sively sortthe evenbito nicsubsequence:f2,.1.'4 ,.rc"...intonondecrees- ingorder
Hhptt: [Compare/ exchange] Comparethepairsofelements.rj
and.£;+1(oriodd andexchangethem incaseXi>Xi+l
Table 2.12: Bitonic sort intonondecreasing orde r
20, 24,30)isobt ained .D
When11isapower of 2, the recursion in Table2.12can beunfoldedto obtain the comp aring/exchangingalgorithmof theprogramin Table2.13.
Ineachiterati on ofthe whileloop,each sequenceelement ispaired with exactlyoneothersequence elementthatisadist ance(ffromit . The pairs neeform edfromleft to tight. To obtai n a nondecreasing sequence each com- ['aring/ exchangingcausesthe smallerelementof the pair tomovetothe left positi on.Ifa nonincreasingsequence is desiredthesmallerelementismoved to theright.Table 2.14showsaneightelement bitonicmer gethat result sin a nondecrcesingsequence and Table 2.15 givesanexa mple thatresult sina nonincreesing sequence. Theexamplesassumetheelements to be sortedarc storedin processorsofahypercube withoneelement per processor.As can be seen, theelements that form eachofthepairs for the compa ring/exchanging opera tionareinprocessors thatarehypercube neighbors. Hence each iter-
40
procedureIJiloll;CSOr/(II);
{Sort the bitonic sequence;'1, ··..I'Il}
{IIisa power of2 } begin
d=JI/2j whiled»0do begin
compare/ exchan ge elementsdapart d=fl/2;
end;
end; {oflJilullirSorl}
Table2.13:Iterati ve bitonic sort forII, Itpower of2 arion ofthe whileloopof theprogramin Table2.13lakes0(1)timc011a hyper cube .Thetotaltimetosortan"element bito nic sequenceistherefore O(logn}.
Given asorted sequenceofclement s sothatpart of theelements belong to a setA(thu sthe remainin g belong toIf)andeach elementknowsthe corresp onding rank inAorA, permut e the sequencetoreturneachAand A.Thiscan be doneby running the merging algorithmin reverseord"'f Therefore,Unmerging can bedoneinO(logII)time.
2.2.7 Sor t in g
To sortnelementsusingbitonic sort , we begin with sorted sequencesof sizeone.Adjac entpairsof theseform bitonic sequences that arc sorted (in parallel)to obt ain sortedsequencesof size two. Thesortingisdone such
41
PE
line 0 1 2 3 4 5 6 7 d
o 7 6 4 0 1 23 5 4 '-=LI:=~-_I
)J
1 2 3 0 7 6 4 5 2
~.I '---t:=-.J
1 0 3 2 4 5 7 6 1
I---J 1---1 l---J
3 0 1 2 3 4 5 6 7
Table2.14:Power of 2 bitonic sort[nondecreasingorder)
42
line 0 1 2 3
PE
5 6 7 d
o
7 6 4a
I 2 3 5 4'---<=F .cC[::'-".'.,, '.J .
7 6 3 5 1 2 30 2
~ '-c:::...l
21 6~ U ~_.~ 1
3 7 6 5 4 3 2 1 0
Table 2.15:Power of2bitoniesort(ncninereesingorder]
43
that thesize two sequences are alternatelynon-ncreesingand nondecreasing sequences[I,e, the lirst , third, fifth,... ,sequences arenonincreasing andthe remainderarenondecreasing).Consequentlyevery pair of adj acentsize two sequencesforms abitonic sequenceofsize fourwhichcanbe sorte d using bito nicsort. The sizefour sequences are alsosortedalternatelyinto nonin- creasing and nondecreasin gorder . Continuinginthis way,we canobta inII.
sortedsequenceofsize11after log"bitonicsort ingsteps.Note thatifthe sortedsequence istobe innondecreesing order,then thelast biteniesort stepshouldsort thefirstandonlyresulting sequence intothisorder.The total timeforthe sortingis 0(log2II).
Exam p les Sort ingfollowing sequence c nmLhepdgjl k beio
into nondecreasing orde rand that a<b< ...<0<p.Thepairs (c n), (m f),(h a),(pd),(8 j),(I k), (be)and(i0)are bitonic sequencesthat are sorted byusingbitonic~ortto obtainthe:sequence:
n cI m h a daj gk Lebie
Notethattheodd pairswer e sorted intononincreasingorderwhilethe even ones weresorted intonondecreasing order.Nextlet usconsider the adjacent sequencesof lengt hfour.These are(ncfm), (ha dp),(jgkI) and (ebi0).
Sinceeach is a bitonicsequence,it maybe sorte d usingbit onie sort . The resultis:
44
nmfc adhplkj gb ei o
Once again theoddsequences aresorted into nonlncrensingorde r whilethe evenones aresorted int onond ecreasing order. Two bitonic sequencesof length eight,(nmfcad h p) and(lk j g be i0),nrcgiven.Sortin g them willgive thesequence:
pnmhfdc a b eg ij kl o
Sortingit intononde crcasln g order resultsin the sequence:
ab cd ef g hijklmn op
o
2.2.8 Random AccessRead
Ina randomaccessread (RAR),someofthe processor s of thehyper cube wish toreaddata from other processorsof the hypercube .LetA(i )bethe PEfromwhich proceescriwishestogetdata.Thedata tobe obtainedis D( A(i)).In caseFEidoes not wishto read dat afromanyothe r PE,then A(i)""00. Line 0 of Table2.16givestheAvaluesfor anexample RAR inaneight processorhypercube .Not ethatinanRAR,several processors may read (rom thesame PE.An RAR can bedonein0(log11l)tlme inlIn1/
processorhyper cubeusingthealgorithmof the program shown inTa ble 2.17 [15J.
-, - ,
N~
46
Sl epI: Eachprocessorf/crea tes a triple(..t{II),II,jlll.'{)where flagisa Boolean entity that.isinitiallyhue.
Slep2: {Sortl Sor t the triplesintonondecrensin g order of tileread addr esslI(fl) .Tripleswiththesameread address arc in nonde creasingorder ofthePE indexfl. Furthermore, during thesort, thelingent ry ofatripleis settofalse incasethere isa tripleto itsrightwiththesa me rend address.
Step~ : [Rank]Processorwith triples whosefirst compone ntof00 and whose thirdcompo nent(i.r.oJIIlY)istrue areranke d.
Stc114: Eaehprocessorb thathasa triple (A(I/),Il,true) with A{a)f.00crea tes atripleofthe form(fi(l,) ,A(II), h) whereR{b)is therank computed in thepreceding step.
Step5: {Conce nt ra te]The triplesjustcreated are concentra ted.
Step6: Each processor r:that has a concent rate d triple (r(b),A(a),b)create s a tupleoft heform ((:,lI{tt)}.Note thlLtsincec
=
U(IJ)thistupleisjustthefirsttwocom- ponentsof thetriple.Slcp7: [Dist ribute] The tuples aredistr ibutedusingtheseeond com po nent asthe destinationaddress.
Slep8: EachprocessorA((I)that receivesa tuple(I:,A(II))ere- atesthetuple(c,D( A(n))) .
SICI!9: [Concentrat e]Thetuples create d in theprecedi ng step areconcent rat edusing the first componentastilerank.
47
Slcp/0: Each processorc that received atuple [c,D(A(a»)in the last step aha has a tripleofthe form(R(b),A(a),b)that it received inStep5(noticethatc=R(b)).Usingthis tripleandthetuple received.in Step 9it createsthetriple (1.,D(A(n))).
StepfI: [Generalize]The tuples(b,D( A(a)))are generalizedusing thefirst componentas thehigh destination.
Stcpl!l: Each processorthatrecei veda tuple(6,D(A(a») in Step11alsohasatriple(A(a),«,jlag)thatit at ob- tained as aresultofthesort of Step 2. Using informa- tion from the tuple andthetripleat createsa newtuple (a,D(A(a))). Processe-sthatitdid notreceive atuple U5e thetriple theyreceivedin Step 2 and form the tuple (n,- ).
SfepIfl: {Sort]The newly createdtuples ofStep12are sort edby theirfirst component.
Table2.17:Algorithm{or a randomaccess read
48
Consider theexample of Table2.16.InStep1,eachprocessor creates a triplewith thefirstcomponent being theindex oftheprocessorIrom which itwants toreaddata; thesecondcomponentisits own index;end thethird componentisa flagthat is initiallytrue(t )for alltriples.Then, inStep 2 the triplesaresorted on the first component.Triplesthat havethesame first componentare inincreasingorderoCtheir secondcomponent.Withineach sequence oftriples that havethe same firstcomponent only thelastonehns a true flag.Theflag(ortheremaining triplesisfalse.Thefirst components ofthe tripleswith atrue flag giveallthe distinct processorsfrom whichdat a is toberead.
Processors 0, 2, 3,and 5 are ran ked in Step 3. Sincethe highestrank is three,datais toread from only four di.dinctprocessors.In Ste p 4,the rankedprocessors createtriples oftheform(ll(b),;1(11),11).Thetriplesnrc thenconcentrate d. Processors 0 through3 receive theconcentrate dtriples and form tuplesofthe form (c,A(a)). Becauseofthesort of Step 2,the secondcomponentsoCthese tuplesarein ascendingorder.Hence, theycan berouted totheprocessors givenby the secondcomponentusinga dat a dist ribut ion asinStep 7.Thedestination processors of these tuplesarc the distinctprocessors whoseda.taistoberead.Thesedestina\.ionprocessors create,inStep8, tuplesofthe form(e,IJ(A(n)}) wherec istheindexof the processor that originatedthetupleit received.Thesetuplesarcconcentrated inStep9 usingthe firstcomponentes the rank,
49
In Step10,thereceiving prccesscee{i.e.,0 through 3) usethe triples receivedin Step 5 &J1dthetuplesreceived inStep9to createtuplesof the form(ii,IJ{A{II))).The tint com ponent isthe induof the processortha t originated thetriplereceived in Step 5.Sincethe trip lesreceivedinStep5are theresult of aconcent ra tion,thefint comp onentofthe newly forme d tuples arcin ascendingorder. Thetuples are thereforeread yforgenera lizati on using thefirst component ...thehigh index.ThisitdoneinStep11. After thisgeneralizationwehavethe right numberof copies of eachdat a. For examp le, twoprocessors(4and 5) wantedtoread fromprocessor 7 and wenow havetwocopiesofD(7) .Comparing thetriplesof Step2 and the tuples of Step11,weseethatthesecon dcomponentof thetriples tellsus where the data in the tuple!i.toberoutedto.InStep12 we createtuples thal a:lntai n thedestinat ion processorandthe data .Sincethedest ina tion addresses arenot in ucending orderthe tuplescannotbe routedtotheir destinat ion processors u.inga distribute.Ratb ee,Ihey mullbesortedby delltinat ion.
2.2.9 Ran d o m AccessWrite
A random&CCC'ISwrite(RAW) islike arandom accessreadexcept thai processors wishtowriteto otherprocessors rat her thantoread fromthem. A rand om accesswriteueesmanyofthe basic stepsusedby arand om access read.Itis,however,quite&bitsimpler. Line0 ofTable2.18 givesthe
50
indexA(i)ofthe processorto whichprocessor;wants towriteitsdataV(i) . A(i)=00wh enprcceescr;i.5not towriteto anotherprocessor.Observe that itisj-cesible £OrI5everaJprocessorstohave thelame writeaddre"A.
Whenthishap pen., it ils&idthattheRAWhaa eellisiens.Itispcssjbleto formulateseveralstrategieatohandle collision•.Threeoftheseare:
1.Arbitrar yRAWofaUthe processorsthatattemptto write tothe sameprocessoreXll.ctlyonesucceeds.Anyof thesewritingprocessors maysucceed,
2.High est/lowestRA\Vofalltheprocessorsthatattempttowriteto thesameprocessorthe one with thehighest(lowest )indexsucceeds.
3.Com b in ingRA'Valltheprocessors succeedin gettingtheirdatato thetargetprocessors.
Considerthe e:xam pleof line0of Table2.18.InanarbitraryRAWany one of0(0),0(2)and0(7)will gettoproeessce3.Oneof/J(I)andV(5) willgettoprocessorO.And0(3)and/) (4) willgetto processors " and6, respectively.Inahighe5tRAWD(7),/)(5),D(3),and0(4),respectively, gettoprocessors3,0,4,and6,respectively.InalowestRAW0(0),/J(J), D(3)andD(4)get toproeesecrs3,O,4,and6, respectively.Ina combining RAW0(0), D(2),andD(7)allgetto processor 3,Both0(1)andD(5)get toprocessorO.And0(3)andD(4)get toprocessors4 and 6,respectively.
51
~-=: ~
~
-r~ ~~ _~ ~ i.e c.
~ ':::"~~ ~ ~~ 5'~
:i..c.
e.-,,;-::-~
M
M~~
!i.!i
- -
....:...:..o~~ o~~ o~~
-:;-~
o
M~~
!i~
"
.s
'0•
The steps involvedin anarhit rary'RAWarepveninTable2.19(15).
Letus go through the example inTable 2.18.Eachprocessor first creatt'S tripleswhose first componentisthe index,A(a),ofthe processorto whichit isto write.Itssecondcompon ent is thedata,l1(fl),to bewritt enand the third component istrue.Thetriplesarethen sortedonthefirst com ponent.
DuringthisSOT!thef1.ag entr yof a triple is changed tofalse incase there is atriple with the same writeaddresstoitsright .Onlythetriples with
atruef!>,~areinvolved intheremaind eroftheAlgorithm.Noticethntfor
eachdistinctwrite addren therewill beexactlyonetriplewith n truenag.
The processorsthat have a triple withatrue flagareranked (Ste p 3)anti theseprocessors create new tripleswhose firstandsecond componentsarc thelame as inthe old tripleshut whose thirdcomponent isthe rank.The triplesarcthen concentratedusingthisrank informa tion.Since the triploo arein ascending order ofthe write addreues(firstcomponent)they lila,.be routed to theseprocessorsusing a data distributeoperation.Notethat for Step 6 thethird component(i.e.,rank)ofeachtriplemay bedroppedbefore thedistribu tebegins.
Thecomplexity of arandomaccess writ eis determined bythesortJ\cp which takes 0(log2u}timewhere11isthenumber of proeesecte.
Ahighest (lowest )RAW canbe donebymod ifying theprogramill Table2.19 slightl y. Step1creates 4-tuplesinstead oftriples.Thefourth
53
Slip I: Each processor1/creates atriple(A(a),D(a),flag) where flfl.fJisaBooleanentitythatisinit ially true. ,"'!fIJt.!: (Sort]Sortthetriplesintonondeereasing order ofthe
write addressl1(a ).Tiesarebrokenarbit rarilyanddur- ingthesort theflagentryof a tripleissettofalseincase thereis atripletoitsrightwiththe same writeaddress.
Slfjl..t: [Rank]Processorswith tri ples whosefirst comp onent is
not0;:>and whosethird component(Lc.,J/'Ig)istrueare
-anked .
."'lIpf: Eachprocessor bthl\t hasa triple (A((I),D(a),true)with 11((1) '" 00 createsa.triple of theform(A((I),D(a),R(b)) whereU(II)is therankcomputed in thepreceding step.
S/flJ,r,; [Concent rat e] the triplesjustcreat edare concentra ted.
,"'!r'JI'i: [Distri bute] theconcentratedtriplesare distrib uted using thefirst componenta.sthed.,tination address.
Table2,19:AlgorithmCor an arbitrary rando m accesswrite
54
component isthe index oftheoriginating processor .In the sortstep (Step2) ties arebrokenby thefourthcompone ntin suchaway thattherightmo~l 4-tupleinanyseque nce withthe samewriteaddr ess isthe -t-tuplc tobe suc- ceeded(i.e.,highestof lowes t fourthcomponent in thesequence) .Following this the fourthcompo nentmay be droppe d fromeach-t-tupl c. The remAining stepsareunchanged .
The 6leps{or a combin ing RAW arealso similar tothoseintileprogram shownin Table2.19. When theranking of Ste p3isdone, nversion of procedur e rank(Ta ble 2.14) isused,which docsnot containthelastline (R(i):=oo,(n o t$r.lcclc.l(i ))).As aresult processor0{Table2.[8) has a rank of0 and processors 2 and3have a rankof1.During theconcent ration step (St ep 5)morethan onetriplewilltrytogetto thesa meprocessor.
Procedur econce nt rat e(Tabl e2.14)is modifiedto combine together triples that have thesame rank.Thesemodi ficationsdonot changetheneymptotic complexityof the RAWunless thecombiningope ra tionincreasestilelriple sise[asin a concatenate). Incuerldata valuesaretorench theSIUliC destina ti on ,thecomplexit yis O(logl"+rilogII).
2.2.10 Precede
Given two sortedlis tsA=(111)'11... ,II..)andIJ ""(I'I,lI~,...,1J,,),we definethe predecessor of(Iias the leastelemen tIljinIJpreviou s toIt;in tile sorted H3t ofAandH,i.e.hj:5(Ii::;/Ij+l'Ifthere is nosuch Ilj,weassumeits
ss
predecessoris theleastelementb",inIi.The Precedeope rationcomputes, (oreachelement(/iinA,its predecessorinH.ThePrecedeoperation can be performedin thefollowing way:Foreach element in listB,we assign a flag.
Merging two sort ed listsAand11,we can get asortedlistC.Performing rankoperatlcn, foreachelement(Ii,we can get its predecessorinB.The Precedeoperationcan bedonein O(logII) time.
2.2.11 Summary
In thissubse ction, weprovidea summary.In thefollowing,1"01and IV arepowersof2.Theyrepresent the size(i.e., the numbe rof processors)in a eubhypercube.Unless otherwise stated , the size of thefull hyp er cubeis denoted byP.
1.Maximum
Ta sk .Thiswork s oneach dimensionk,k=log IV, subhypcrcu beof an SIMD hypercub e.Thehypercube PEI=ill'
+
'1,0:s:q<Wis the r/th PE inthe j'thdimensionksubhype rcub e.ThisPE computes,in its5'register,the maximum of theAregister values oftheO/ththough r/thPEs in its sub hypercube,0:5q<1-1',0 :::::i<w,where"' isthe numberofsubhype rcubesof dimensionk.Com plexity: O(q.
2. Ranking
56
Task:Rank the selected processorsineach size2~windowoitheSIMI) hypercube.
Complexity : O(! ·).
3. Concentration
Task:LetG(i).H be therankofeachselectedprocessoriinthewindow ofsize2kthat it is cont ainedin.ForeachselectedPE,i,therecord G(i)is sent tothe G(i) . fl'th FEinthesize2kwindowthatcontains PEi.
Complexity: O(!')' 4.Distribution
Task: Thisistheinverseof aconcentration.
Complexity: OU')' 5.Gener aliza t ion
Task.:EachrecordU(i)hasahighdest ination(,'(i ).II. TIlehigl.
destinationsineachsize2kwindowarein ascendingorder.Assumethat G( -1).//
=
O.The recordinitiallyin processoriis route dtopro cessors G(i-1).11throughG(i ).11of thewindowprovidedthatf:{i)./1fQ("IfG(i)./1