The GPU-Oriented Tree Representation Based on the Method of Finding the Remainder

(1)

Method of Finding the Remainder

VladimirVoronkin

vl.voronkinraxperi.om

Andrey Malikov

malikovnstu.ru

Elmira Azarova

elmira.azarovamail.ru

AlekseyShhegolev

a.wegolevgmail.om

North-CauasusFederalUniversity,

Institute ofInformationTehnologiesandTeleommuniations,

RussialFederation

Abstrat

Thispaperdesribesanewmethodofthetree-likestruturerepresen-

tationindatabases.Thedevelopedmethodofndingrelationshipsbe-

tweennodesisbasedontheuniqueprimefatorizationtheoremwhih

statesthateverynaturalnumberanbewrittenasaprodutofprime

numbers. ThedevelopedmethodisimplementedusingNVIDIAgraph-

isproessingunitparallelomputingandNVIDIACUDAtehnology.

Thepaperanalyzestheproessoftreeonstrutionwiththedeveloped

methodapplied, representsthealgorithmsofGPU-basedproessingof

treesodedusingthedevelopedmethod,andalsotheresultsofperfor-

manesnapshot.

1 Introdution

Computersystemsaredevelopingrapidly,proessorarhiteturesarehanging,themainanddisk memorysizes

are inreasing. Nowadaysthe rapid inrease of powerand data volumes allows for the development of high-

performanedatamanagementandproessingsystems.

Alargeamountofdataintheworldin oneformoranotherisurrentlyrepresentedasahierarhyormutual

dependenywheresomepiees ofinformationdepend ontheothers.

What ismore,datamanagementsystemsarebeginningto migratefrom thedisk-orientedtehnologyto the

memory-oriented. The development of DBMS has led to integrating in-memory tehnology support by many

databasemanagementsystems. [GBE13℄

Data proessing in themain memory enablesaonsiderableinreaseof thedata proessing ratein favorof

safety. ModernDBMS, supportingin-memorytehnologyproessdatainthememorymirroringhangesto the

disk.

Theappliationofmemory-orienteddataproessingtehnologiestogetherwiththenewmethodsofrepresent-

ingtreesusingGPUomputinganinreasethedataproessingrate.

This researh aims to develop a new method of hierarhy-representation in databases that exhibits high

performaneintreemanipulation operations.

Copyright

^by^the^paper'sâuthors. ^Copying^permitted^for^privateândâademi^purposes.

In: A. Editor, B. Coeditor (eds.): Proeedings of the XYZ Workshop, Loation, Country, DD-MMM-YYYY, published at

http://eur-ws.org

Copyright c 2017 by the paper’s authors. Copying permitted for private and academic purposes.

In: S. H¨ olldobler, A. Malikov, C. Wernhard (eds.): YSIP2 – Proceedings of the Second Young Scientist’s International Workshop

on Trends in Information Processing, Dombai, Russian Federation, May 16–20, 2017, published at http://ceur-ws.org.

(2)

Thisworkisdevotedtothedesriptionofthedevelopedmethodofrepresentingtreesindatabases. Thedevelop-

mentofthemethodwasinitially-orientedtowardsthemaximumparallelizationoftreeproessingandmaximum

independene ofnodeodefromothernodeodes. Thepaperfousesonaradiallynewmethodofrepresenting

trees inrelationalandnonrelationaldatabases,andalso demonstratesthattheGPU omputingofhierarhial

relationshipsanbemoreeetivethantheCPU.Thedevelopedmethodofrepresentingtreesisorientedtowards

theparallelproessingmodewiththeuseofaGPU.

2.1 Methods of RepresentingTrees

All themethodsofrepresentingtreesin databasesanbeput intotwoategories:

methodsofrepresentingtreesin non-graphdatabases;

methodsofrepresentingtreesin graphandhierarhialdatabases.

Thefollowingmethodsbelongtotherstategory:

Theadjaenylistmethod ofrelationalhierarhialmodeling.

Theadjaenylistmethodisbasedonthestorageofdiretparent-hildrelationships. Eahhildinatreelike

hierarhyrelatestoitsparentwhihisatonelevelhigher. Usingthishierarhyproperty,atable-levelrelational

dependene anbereated.

Thematerializedpathmethod.

Amaterializedpathisastringeldthatonsistsofelementnamesseparatedbyaseparatorharater[Tro03℄.

Parents'namesofanodeforwhihamaterializedpathisbuiltareused asthematerializedpathelements.

TheNested Setsmethod.

The Nested Set method implies that there are twoadditionalelds for the node desription: atree oding

using thealgorithmforthenestedsetsodingintroduedbyJoeCelko[Cel12℄[Cel℄[Cel04℄, isdoneasfollows:

movingdowntheleftsideofatree,it'sneessarytotravelthroughallthesubtrees beginningwiththeleftmost

totherightmostandassigneahnodeanautoinrementvalue. Whenmovingdownthetree,anautoinrement

valueisassignedtothevariableresponsiblefortheintervalstartingpoint(½leftbound); whenmovingup-to

thevariableresponsiblefortheintervalend(½rightbound).

Thenestedintervalsmethod.

The nestedintervalsmethod, introdued in [Tro05℄, is basedon thematerialized path oding using anite

ontinued fration

[q 1

^,

q 2

^,

. . .

^,

q n ]

^, ^where

[q 1

^,

q 2

^,

. . .

^,

q n ]

^, ^where

q 1

^,

q 2

^,

. . .

^,

q n

âre ^the ^stepsôf â materialized path. Rationalnumbers

a/b

^,^where

a >= b >= 1

^and

GCD(a, b) = 1

âreûsed âs^treeêlementôdes.

Asanexample,onsiderhowtheodeforamaterializedpath½

3.2.2

îs^reatedûsingâôntinued^fration:

3 + 1

2 + ¹ ₂ = 17 5

Totransform a materializedpath to arational expression the onvergene prinipleanbe used, while for

the inversetransformation the algorithm forthe gradual trunation should beused, that in ase of arational

expressionisthesameasEulid'salgorithmforndingGCD[Gai99℄.

Modiationsandombinationsoftheafore-namedmethods.

Therearealsomodiationsofthegivenmethodsthatanxtheinherentproblems[Kol09℄[vor14℄[MT11℄.

For example, in [MT11℄ the improvement of the nested intervals method is introdued through the interval

odingintheresiduenumbersystem(RNS).UsingRNSenablesreduingthelengthofnumbersusedbystoring

anumberasashort lengthremainder. Representingnode odesasnumbersexpressedin RNS enablesparallel

proessing of not only dierent tree nodes but also one and the samenode remainder. Thus, the maximum

parallelization of data proessing is inreased. In [vor14℄ the method of storing nestedintervals expressed in

RNS isintroduedthatsolvessomeproblemsofsortingnumbersexpressedin RNSandtheindexonstrution.

Suh methods as the MaterializedPath, nestedintervals and modiationsof thebothinlude the data on

hildrenand/orparentsinthekeywhileodingandanbeproessedinparallelwithsuienteetiveness.

Themethodsofrepresentingtreesingraphandhierarhialdatabasesarediulttolassifyasmostofgraph

databasesusetheirownformatsandstruturesofdatarepresentations.

Insomeasespointers areusedforlinking neighbornodes. Thepointersarebound tothedatathat appear

in ars. Initsstruturethishierarhymodelissimilartotheadjaenylists.

Other [Kup86℄[KV93℄graphdatabasesuseatablemodelforstoringrelationshipars.

AnotherformatofstoringgraphdataisXMLandXML-basedformats[MS11℄[BELP℄.

(3)

W ORKS Name:Larry

Page

W ORKS Person1

Company1

Person2

Company2

Person3 Sine1998

Sine2001

Sine2010

Name:Joshua

Bloh

Name:Google

Name:Orale

Name:Brian

Goetz WORKS

Traverserelation Traverserelation

a)Graphrepresentation

IndexlookuponPerson.id

Person

Worksin

Company

Id Name

PersonId

CompanyId Sinse 1

2

3

N

JoshuaBloh

BrianGoetz

...

LarryPage

1

2

3 1

2 1

IndexlookuponCompanyId

1998

2001

2010

Id Name

1

2

...

N Orale

...

Google IndexlookuponCompany.Name

b)Relationalrepresentation

Figure1: Representationtypesofdependentdata

Fig. 1band1a illustratethe aforesaid. Fig. 1billustrates arelationalrepresentation ofhierarhialdata in

DB, whileFig. 1aillustratesdatarepresentationintheformofagraph.

Theillustrationdataexemplifythestorageofhierarhydataindatabases.

One of the distintive features of graph databases is the base hosen for database objets representation,

relationsbetweenthedata,omplexobjetsandtheirattributes.

All graphdatabasesarebasedonamathematialdesriptionofagraphthatdeterminesgraphmanipulation

operations. A graph model depends on the base, for example: a direted or undireted graph, marked or

unmarkedgraphnodes,weightedandunweightedars.

SomemodelssuhasGOOD,GMOD,G-Log,Gramrepresentboththeshemeandentityasanameddigraph.

LDM is an exeption whose shemes are digraphs with leaves representingdata and internal nodes represent

strutured data. LDM onsists of atable with two-olumns,eah ofwhih assoiatesentities with data types

(primitive,tuple orset). Amoredetailed desriptionofgraphmodelsisrepresentedin [AG08℄.

2.2 GPU Data Proessing

Graphisproessingunitseemstobeanadvanedhighperformaneplatformforomputing. Thebasidistin-

tivefeatureenablingGPU'shighperformaneisitsarhiteturehavingagreatnumberofALU. TheGPUand

CPUarhiteturesareomparedin[PT11℄. Anumberofworks[BS10℄[HKOO11℄pointtotheadvantageofGPU

proessingofdataoverCPU.On average,thequerieswereproessed530 timesfasterin relationaldatabases.

Someworks[PSHL10℄[BMCN09℄showtheinreasedrateofsortingdatausingGPU.Asfarashierarhialand

graph strutures are onerned, the works [HN07℄ [HKOO11℄ [WO10℄ are worth mentioning. in whih GPU

graphomputationsaremadeandwhihdesribethespeedupofdataproessingusingGPUomparedtoCPU

(whenproessinggraphstrutures).

Hereafter,mentioningGPUomputingmeanstheusageofNVIDIACUDAtehnology.

CUDA (Compute Unied Devie Arhiteture) is a NVIDIA tehnology designed for the development of

appliationsformassiveparallelomputingunits. Due toagreatnumberofALU.GPUimplementsthethread

modelofomputation: thereareinputandoutputdatathatonsistofthesameelementswhihanbeproessed

independently. Theelementsproessingisperformedbythekernel.

A kernel is a funtion exeuted on GPU and alled from CPU. A kernelis omposed of agrid of multiple

threads groupedintoahierarhy.

The highestlevel grid orrespondsto the kerneland groupsall thekernelthreads. A grid isaone-or

two-dimensionalarrayofbloks. Eahblokisaone/two/three-dimensionalarrayofthreads. Furthermore,eah

blokisanentirelyindependentset ofinteratingthreads.

3 Priniples of Tree Constrution

By ½tree wewill understand a onnetedayli graphomposed of aset of nodes

n = (k, d)

^, ^where

k

^is ^the

nodeode,

d

^is^data^assoiated^with^the^node:

T = { n } = { (k, d) }

Letusdeneforanode

n

^operations

key(n) = k

^aess^to^the^node^ode,

data(n) = d

^aess^to^data^assoiated

withthenode.

(4)

Denote by the node identier aprime numberunique within node Ids. Let us introdue aset ofprimes

I

^,

representingprimesin inreasingorderof theirvalues:

I = s 1 , s 2 , . . . , s m , . . .

LetusallaprodutofthenodeparentidentiersbythenodeIdthe node ode.

Theodeofatreerootisthe rootidentier.

Letusintroduetheoperationofgettingthefollowingprimenumberfromaset pfprimes

I

^:

next() = s i , i = i + 1

Primesareneededfortreeonstrution,beausetreedeompositionshould beunique.

Theusageofoneandthesameidentierformorethanonenodeisimpossible.

A treeisformed asfollows:

1.Aprimeis hosenasarootnode.

2. Eah node added to thetree has aodeequal to theprodutof the parent ode by theidentier of the

nodeadded.

Eahtreenodeintheodeenodestheentirepathstartingfromtheroot. Thismethodofnodeodereation

issimilarto theMaterializedPathmethod.

4 Tree Proessing Operations

Letusshowadmissibletreeoperations. Letusintroduethefuntion ofndingtheremainderondividing

a

^by

b

^:

mod(a, b) = a − b · ⌊ a b ⌋

Thefollowingtreeoperationsareadmissible:

1.

add(T, d, n parent )

^,^where

n parent

^is^parent^node,

d

îs^dataâdded^to ^the^treeôperationôf âddingâ^node

to thetree.

Preondition:

exists(T, n parent )

Realization:

T := T ∪ { n(key(n parent ) · next(), d) }

⁽¹⁾

Prediate

exists(T, n)

^determines^if^the^node

n

^existing ⁱⁿ^the^tree

T

^:

exists(T, n) := ∃ n ∈ T

2.Operationofsearhinganodewiththeode

k

ⁱⁿ ^the^tree

node(T, k)

^.

node(T, k) := { x ∈ T | key(x) = k }

⁽²⁾

3.Operationofsearhingnodeparentsin thetree

parents(T, k)

^.

parents(T, k) := { x ∈ T | mod(k, key(x)) = 0 ∧ k 6 = key(x) }

⁽³⁾

4.Operationofsearhingadiretparent

parent(T, k)

^.

parent(T, k) := M AX(parents(T, k))

⁽⁴⁾

5.Operationofsearhingasubtree

subtree(T, k)

^.

subtree(T, k) := { x ∈ T | mod(key(x), k) = 0 }

⁽⁵⁾

6.

children(T, k)

^operation^of^searhing^diret ^hildren.

children(T, k) := { x ∈ T | k = key(parent(T, key(x))) }

⁽⁶⁾

7.Operationofsubtreeremoval

remove(T, k)

^.

(5)

T := { x ∈ T | mod(key(x), k) 6 = 0 }

⁽⁷⁾

8.Operationoftransferringasubtreerooted

n old

^to^a^new^parent

n new move(T, n old , n new )

^.

Preondition:

exists(T, n old ) ∧ exists(T, n new ) ∧ mod(key(n new ), (n old )) 6 = 0

Realization:

T := T ∪ { x ∈ subtree(T, key(n old )) |

| n(recalcKey(T, key(x), key(n old ), key(n new )), data(x)) }

⁽⁸⁾

where

recalcKey(T, k, k old , k new )

îsânôperationôf^nodeôderealulation:

recalcKey(T, k, k old , k new ) := k

k old ∗ k new

Consideringthepossibilityofexeutingparalleloperationswithatreeasaset, thefollowingfeaturesshould

benoted:

1.ThepossibilityofparalleltreeproessingusingGPU.

Treeoperations(2,3,4,5,6)that donotmodifythetreehavenoneedforinterationwithothernodes. All

suh operationsanbeproessedin parallel orsimultaneously. Parallelization oftree manipulation operations

uptoonethreadpernodeispossible.

2.Theorderofreordsin treerepresentationinthememoryinuenesonlytheorderofreordsintheresult

set anddoesnotinuenethenumberofreordsin it(the resultset).

3.Theusageofprimes inatreeinthesequentialorder isoptional.

5 GPU Computation of Trees

Eah node representsapair: {node ode: pointer tonode properties}. Let usallthis pairareord. Sine the

nodepropertiesanrepresentasequeneofdynamially-sizedfreeelds,eahnodeodeisassoiatedwiththe

pointertopropertiesloationin theleratherthanthepropertiesthemselves.

Intheledataisstoredasasequeneofreords. Whendownloading,reordsaredividedasfollows: keysare

reordedinto GRAM,pointers totheproperties arereordedintoRAM. Downloadeddata(keysandpointers)

is stored asa sequene. For suh form of representation it is important that the order of reords in RAM is

stritlythesameasinGRAM.WhenswappingsomenodesinGRAM,itisneessarytoswapnodesinRAMin

asimilarway.

When exeuting aquery, say, alookup query, all the reordsin thetree are reviewed. Thequery resultis

omputedon-the-y,omputingthedesiredresultwhileexeutingthequery. Thusthesearh isnotnarrowed.

Tosavethetime oftheproessingresulttransfertoRAM,theGPUmakesabitmapof thequeryresult,in

whihthebitsetdenotesanodeorrespondingtothequery,theremoveddenotesanon-orrespondingnode,Fig.

2). Thebit mapis formedin kernels(a kernelisafuntion exeutedonthegraphisard)of queryexeution;

asaresult,onemapbit orrespondstoeahnode. Forexample,witha4-bytekeythesize ofdatatransmitted

throughthebus(betweenGPUandCPU)isdereasedby32times.

Toget ertain nodes in the resultingseletion, it is neessaryto nd set bits in the bit map. The ordinal

numberofasetbit isanumberof astorageellinRAM(andGRAM) in whih thedataaddressof thefound

node(thefoundnodeode)isstored.

Let us onsider theproess of exeutingsometree operations. Eah operationis exeutedfor eah node in

theset.

ThewholetreeproessingisrealizedonGPU.

Theresultofomplexqueryproessing,forexampleaommonparentsearh,isformedasaresultofuniting

bit mapsofeahqueryresult,thus avoidinglongoperationsofeahnodesearhing.

Forexample

commonP arents(A, k, r)

ôperationôf^searhingômmon^parents^for^two^nodes^with^theôdes

k

and

r

^looks^like^this:

commonP arents(A, k, r) := parents(A, k) \

parent(A, r).

Forndingommonparents,themostrational(withregardtothetimeofndingasolution)methodisexeuting

twoqueriesforparentsearhforeahnodeandunitingbitmapsofthequeriesresultsthroughthebitwiseAND

(6)

operator (Fig. 2). Uniting of bit maps as well asalulations take plae in GPU. The GPU exeution of the

bitwiseANDoperatorfora4-bytewholerequiresfourmultiproessoryleswithoutonsideringmemoryaess

delay(4 ylesforthesharedmemory).

6 The Experiment Results

Tohekthedevelopedmethod of representingtrees, theDBMS prototypewasdesigned supporting thebasi

operationsof data manipulation: insertion,deletion, transfer,searh. Searhand transferqueriesare exeuted

on GPU, insertion and deletion are exeuted onCPU. The developed DBMS prototype does not ahe data.

Reading isperformeddiretlyfrom thedisk. Queriesarenotahedeither. Theresultofsearhqueryisabit

map returnedtothelientintheform ofaursor. Atthenextursorreordquery,thesearhforthenextset

bit inthemap takesplae,thenreadingofthedoumentandkeyorrespondingto thebitloationinthemap.

The performane test of the developed method of representing trees was arried out in omparison with

DBMS MongoDB.MongoDBwashosenasthepopularandhighperformaneDBMS.Asanindex, MongoDB

usesB+tree. ForhierarhyrepresentationinMongoDBtheMaterializedPathwashosenasahierarhyreord

type.

6.1 Test Set

A tree-like databaseis used asaset test. As aDB base, theAll-Russian Classierof Addresses(KLADR) is

used. [kla℄

Datais representedasatreewith2.5 millionnodes. Eah nodehasadierentnumberofhildren. Thetree

isnotbalaned. Theheightofthetreeis5levels.

ForrunningexperimentstheKLADRreferenewasextended. Theoriginalrefereneguidehasunusedelds

andoupiesapproximately400MBofthediskspae.

EahtreenodehastheeldsrepresentedinTable1.

It is neessary to disriminate between a geographial division and a tree node to whih the geographial

division isattahed.

The elds represented in Table 1 are ommon elds for the both DBMSs in the test. MongoDB uses the

Materialized Path (eld ½matPath) to store thehierarhydata. In the eld ½matPath the DBMS prototype

storesthenodeodewhihistheprodutofparentsIdsbythenodeId.

ThedatasetsarethesameforthebothDBMSstested.

Tohektheperformaneofthemethodofhierarhyrepresentationin omparisontotheanalogues,aseries

of testswereonduted.

(7)

Nodename Datatype Desription

name har[32℄ nodename

matPath varhar[128℄ nodematerializedpath

objName har[16℄ nameofobjetassoiatedwithnode

objData varhar[10000℄ dataattahedtoobjet

gpsX oat objetoordinateXinGPS oordinatesystem

gpsY oat objetoordinateYinGPS oordinatesystem

level int objetnestinglevelinaordanewithKLADR

objType har[8℄ objettypeinaordanewithKLADR

objCode har[16℄ objetodeinaordanewithKLADR

A ColdStartup,withallthetestsbeingdone,wasperformedforeahDBMS½warm-up. Afterthat, allthe

tests werearriedout. BeforetestingofeahDBMStheomputerwasrebooted.

Eahtest wasrepeatedthree times. Thetestresultsrepresenttheaveragereeivedvalue.

Thepowersupplyplanwassetto½peakperformane andalsoallthepower-savingfuntionswereturnedo.

For eah DBMS the following tests were performed: reord addition, subtree seletion, seletion of all the

nodeparents,treetraversal,nodeodesearh.

Testing of node addition time wasperformed asfollows: the time of all the nodes addition wastaken into

onsideration. Afterthat,theresultwasdividedbythetotalnumberofnodes.

The seletionof all the node hildren assumesobtaininga ursor forthe node hildren, with reading them

from thedisk. 3500querieswereexeutedduringthistest.

Theseletionof alltheparentsalsoassumesobtainingaursor fortheresultset,with readingreordsfrom

thedisk. Similarly,3500querieswereexeutedduringthis test.

Treetraversalassumesreadingthewholetree.

The seletion of threads with reading assumes exeution of 10,000 queries. After obtaining the ursor for

hildren,allthehildrenarereadoutfromthedisk. ThistestallowsdeterminationoftherealDBMSperformane

asMongoDB,whenobtainingaursor,returnonlytherstresultingreords. Therestofthereordsarefound

in theindexonlywhenreadingallthepreviouslyfound.

Thetotalsizeofthedatabasetestedis24.4GBwithoutonsideringthenormativedouments.

Twodierent experimentswerearried outusingtheDB tested. Therstoneinludes thewhole database,

theseond oneinludes only100000 nodes(thenumberofqueriesand datasetsareleftunhanged). Theneed

for doingseveralexperimentswith onedataset isaused bytheneessityof determiningthedegreeof impat

of thereordsnumberin thetreeonitsproessingperformane.

Inadditiontothedatabasetested,anexperimentwasset upin whih 100000treenodeswereusedasatest

set amaterializedpath wasattahed to eah nodeas properties. Theperformane of this experimentenables

understanding thedegreeofimpatof thereordsizeonthetreerepresentationmethod performane.

6.2 Hardware and Software

A omputerwithIntel

R ^Coreⁱ⁵^TM^-4590^proessor^running^Windows^8.1ôperating^system^wasûsed âsâôm- puting platform. Theproessorisa3.30GHz64bit quad-ore,withmaximumthroughputof32GB/se. The

mahinehas8gigabytesofmemory. ThegraphisardusedisanNVIDIAGeForeGTX760. with1152CUDA

ores. 4GBofglobalmemory,andsupportsamaximumthroughputof192.2 GB/se.

CUDA 6.5 driver is used on the omputer. As for the software. MongoDB 2.6 is used. When testing,

exeutableDBMSleswereusedthatweredownloadedfromtheoialsite. ThedevelopedDBMSwasreated

usingMSVS Studio2013andoptimizationag O3.

6.3 Test Results

Table2representsthetestresults. Allthevaluesareaverageresultsforasetofqueries.

Theresultsin Table2onrmapossibilityofthejustiedappliation ofthedevelopedmethod ofhierarhy

representation in databases. Suh aresult is notied with largedata volumes and agreat number of reords.

If areordissmall, thismethod performane beomesinferiorto aommontree index(B+treeforMongoDB)

(8)

Query Representedmethod MongoDB

(averagetime,se/test)

Nodeaddition 768,44 14304,02

Subtreesearh 162,01 283,25

Parentsearh 126,08 163,00

Readingthewhole tree 490,74 8436,37

Obtainingarbitrarynodes 159,90 8880,27

Table3: Performaneomparisonoftestswithdierentdatasets

Database / test Node addition Subtree searh Parentsearh Node searh

2,5 millionreords,10KB/reord

queries/se

MongoDB 174,78 12,36 21,47 1,13

Developed method 3253,34 21,60 27,76 62,54

100000reords,10Kb/reord

queries/se

MongoDB 162,70 23,33 93,97 5,77

Developed method 3245,34 21,09 30,02 57,22

100000reords,20b/reord

queries/se

MongoDB 8673,03 6603,77 350000 2554,74

Developed method 62500 30,04 108,97 1988,64

(Table3). Thediereneisausedbytherestritiontothemaximumnumberofqueriesperseondonthepart

of hardware. Theminimum timeof ongurationand startupofthe emptykernel(fortheard tested) in the

synhronousmodeisapproximately0.10mse: onsequently. thegraphisardannotexeutemorethan1015

thousandkernelsperseond.

Figure 3showstherelationofthequeryproessingtime.

Disk

70%

GPU

9.9996%

CPU

20%

Datatransfer

0.0004%

a)Childrensearh

Disk

69%

GPU

15%

CPU

15%

Datatransfer

1%

b)Parentsearh

Figure 3: Timedistributionwhenexeutingqueries

Reading datafrom thedisk takesmostofthetime forthebothqueries. Theexeutionofkernelsrequiresa

littletime. Onaverage,thekernelexeutiontakes4.4msethatprovidesthemaximum227queriesperseond

without onsidering CPU operation time (for 10 KB/reord). When testing DB with a small reord size (20

byte/reord), the kernel exeution time is almost the sameas thetesting time for largereords. This results

in the method beingpredisposed to building hierarhies basedon nding the remainder, dealingwith a large

amountofdata,theCPUproessingofwhih isalong-runningoperation.

(9)

7.1 Threaded Exeution

So far the multithreaded software produt has not beendeveloped (we mean CPU multithreading, not GPU

multithreading). Multithreadedsoftwareisanimportantaspetof futuredevelopments. Ifaqueryisproessed

usingmultithreading,themaximumspeedupwill beahievedwhenexeutingdatamodiationqueries(reord

addition,transfer). TheappliationofNoSQL-stylenonlokingreordfuntionswillenableperformingdatabase

modiationoperationsintheasynhronousmode,thusimprovingtheperformane.

7.2 MultipleGPU ComputingSupport

Currentlyonegraphiardperformsallalulations. Additionofseveralgraphiardswillallowinreasingthe

terminal apaity of data bank orthe queryexeution speedthrough proessinghunks oneah graphi ard

whenstoringthesameditiondierentgraphiards.

Whenusingseveralgraphiards,thelow-speedCPU-GPUbustransferappearsto bearestrition. Despite

the fat that PCI 3.0 xl6bus hasa theoreti duel-sided exhange rate 128GB/se and 8 GT/s (GigaTrans-

ation/s), it is restrited by the ore memory performane. In atual pratie the bus reprodution speed is

approximately39GB/se(ontestedPC).

Whatismore,manymotherboardsespeiallyinthemoderateprierangehaveonlyonePCIslotper16data

transmissionlineswhereastheseond oneandthesueedinghavex8 andx4ingeneral.

Eventhoughusinglesslinesdereasesmulti-ardongurationperformaneby515%onaverage,andusing

multi-ard ongurationsin most asesannot reah its full potential due to the RAM restrited speed, the

benetfromusingsuhongurationsexeedstheexpenses.

7.3 Hardware Restritions

CurrentlyanimportantGPUrestritionislakofthread synhronizationresouresonthewhole graphiard.

Thread synhronization is possible only inside bloks. Blok synhronization on GPU is impossible. User

synhronizationmethodsarebasedonusingags. Hardwaresynhronizationofthebloksismoreimportantas

theperformaneinreasesomparedto thesoftwaresynhronization.

Another GPU nuane is SIMT arhiteture. SIMT arhiteture allows appliation of one and the same

ommand to dierent data. Thedrawbakof the arhiteture is that all the threads go througheah branh

of theode whenbranhing isusedevenifthethread doesnotexeuteanyinstrutions. Inotherwords,some

threads areinoperation,theothersareidle.

It is important to note a small numberof registersavailable for eah thread. This problem beomesaute

with alargenumberofthreads and aomplex ode(largedatatype). With 100,000threads,16 byte sizedata

type and 512threads perblok, eah blok uses60 registersin 63theoretially possible(the data taken from

theCUDAproler). Thisresultsinalessnumberofthreadsbeingexeutedinparallel. Forexample,whenthe

numberofregistersusedin athreaddroppedto13,theproessingspeedofthesameamountofdatainreased

byalmost10times.

Whenrunningtheexperiment,theredutioninthenumberofthreadsperblokdidnotresultintheexpeted

performaneimprovement.

7.4 Key Compression

The tree proessingmethod desribed in the artile assumesusing plural multiplying for dening key values.

Thekeysobtainedbyprimesmultipliationareexposedtoarapidinreaseresultinginlarge-sizenumbers. With

100thousandreordsin atreeitisneessarytousekeyswhosesize ismorethan128bit.

Themostfrequentarithmetioperationsperformedwithkeysinatreearedivision,multipliationandtaking

the divisionremainder. Hardwaresupport ofsuhoperationsisimpossibledue to CPU andGPU arhiteture

harateristis. Toperformarithmetioperationswithsuhkeys,itisimportanttousedierentalgorithmsthat

arelesseientthanhardwarerealizationofsuhoperations.

Usingnonpositionalnotationswillallowthesizeredutionoftheusednumbersbystoringanumberasasmall-

size remainder. When performing non-modular operations, the lak of need for onsidering the dependenies

betweentheremaindersopensupanopportunityfortherealization ofeientparallelism. Thistogetherwith

the advaned featuresof Keplerarhiteture(bit ontrolwithin warp),will allowtheeientrealization ofbit

mappinganddeningzeroremainderfornonpositionalnotations.

(10)

Thepaperrepresentsamethodofstoringtree-likestruturesindatabasesorientedtowardsmultithreadproessing

withregardtoCUDAparallelization.

ThisprojetdemonstratesusingGPUforhierarhialdataproessing.

Theresultof theartileasafollows: anewmethod ofrepresentingtreesin databasesis developed,and the

realizationof ahierarhialdatabaseprototypewithGPUdataproessing.

Onaverage,theimplementedsoftwareruns by19timesfasteromparedto thematerializedpathproessing

in DBMSMongoDBonCPU.Despitethequeryresultvariations,theminimumspeedupwas1.29times.

Referenes

[AG08℄ Renzo Angles and Claudio Gutierrez. Survey of graph database models. ACM Comput. Surv.,

40(1):1:11:39,2008.

[BELP℄ U. Brandes,M.Eiglsperger,J.Lerner,andC.Pih. Graphmarkuplanguage(graphml).

[BMCN09℄ Ranieri Baraglia, Via G Moruzzi, Gabriele Capannini, and Frano Maria Nardini. Sorting using

BItoni netwoRkwIthCUDA. LSDS-IRWorkshop,(July),2009.

[BS10℄ P. Bakkumand K. Skadron. Aelerating sql database operationson a gpu with uda: Extended

results. Tehnialreport, Universityof VirginiaDepartmentofComputerSiene, 2010.

[Cel℄ JoeCelko. Treesin SQL.

[Cel04℄ JoeCelko. HierarhialSQL,2004.

[Cel12℄ J.Celko. JoeCelko's TreesandHierarhiesin SQLfor Smarties. JoeCelko'sTreesandHierarhies

in SQLforSmarties.Elsevier/MorganKaufmann,2012.

[Gai99℄ A. T. Gainov. Number Theory. Part I. Resoure book. Novosibirsk State University. Faulty of

MehanisandMathematis.,1999.

[GBE13℄ MarelGrandpierre,GeorgBuss,andRalfEsser. In-MemoryComputingtehnology-Theholygrail

ofanalytis? Deloitte &Touhe GmbHWirtshaftsprüfungsgesellshaf, (07),2013.

[HKOO11℄ SungpakHong,SangKyun Kim,TayoOguntebi, andKunleOlukotun. AeleratingCUDAgraph

algorithms atmaximumwarp. Proeedings of the 16thACM symposiumon Priniples andpratie

of parallel programming- PPoPP'11,page267,2011.

[HN07℄ PawanHarishand P.J.Narayanan. Aeleratinglargegraphalgorithms onthegpu usinguda. In

Proeedings of the 14th International Confereneon High PerformaneComputing, HiPC'07,pages

197208,Berlin,Heidelberg,2007. Springer-Verlag.

[kla℄ KLADR-theAll-RussianClassierofAddresses.

[Kol09℄ M.A. Kolosovskiy. Data struture for representing a graph?: ombination of linked list and hash

table. CoRR, 2009.

[Kup86℄ Gabriel Mark Kuper. The Logial Data Model: A New Approah to Database Logi. PhD thesis,

Stanford, CA,USA,1986. UMIorder no.GAX86-08173.

[KV93℄ G. M.Kuperand M.Y. Vardi. Thelogialdata model. ACM Transations on DatabaseSystems,

18(3):379413,1993.

[MS11℄ AndersM

ø

^ller^and^MihaelShwartzbah. Xmlgraphsinprogramanalysis. Si.Comput.Program., 76(6):492515,2011.

[MT11℄ A. Malikov and A. Turyev. Nested intervals tree enoding with system of residual lasses. IJCA

SpeialIssueonEletronis, InformationandCommuniation Engineering,ICEICE(2):1921,2011.

(11)

basedonbitonisort bitonisort. InParallel Proessing andAppliedMathematis, pages403410.

2010.

[PT11℄ Jonathan PalaiosandJosh Triska. A Comparison of ModernGPU and CPU Arhitetures: And

theCommonConvergeneofBoth. pages120, 2011.

[Tro03℄ Vadim Tropashko. TreesinSQL: NestedSetsandMaterializedPath. 2003.

[Tro05℄ Vadim Tropashko. Nested intervalstreeenodingin sql. SIGMODReord,34(2),June2005.

[vor14℄ DBMSIndexforHierarhialDataUsingNestedIntervalsandResidueClasses.YoungSi.Int.Work.

TrendsInf.Proess., 2014.

[WO10℄ YangzihaoWang and JohnOwens. Large-sale graphproessingalgorithms onthegpu. Tehnial

ReportJanuary2011,UCDavis,2010.

The GPU-Oriented Tree Representation Based on the Method of Finding the Remainder

Copyright c 2017 by the paper’s authors. Copying permitted for private and academic purposes.

In: S. H¨ olldobler, A. Malikov, C. Wernhard (eds.): YSIP2 – Proceedings of the Second Young Scientist’s International Workshop

on Trends in Information Processing, Dombai, Russian Federation, May 16–20, 2017, published at http://ceur-ws.org.

[q 1

q 2

. . .

q n ]

[q 1

q 2

. . .

q n ]

q 1

q 2

. . .

q n

a/b

a >= b >= 1

GCD(a, b) = 1

3.2.2

3 + 1

2 + 1 2 = 17 5

n = (k, d)

k

d

T = { n } = { (k, d) }

n

key(n) = k

data(n) = d

I

I = s 1 , s 2 , . . . , s m , . . .

I

next() = s i , i = i + 1

a

b

mod(a, b) = a − b · ⌊ a b ⌋

add(T, d, n parent )

n parent

d

exists(T, n parent )

T := T ∪ { n(key(n parent ) · next(), d) }

exists(T, n)

n

T

exists(T, n) := ∃ n ∈ T

k

node(T, k)

node(T, k) := { x ∈ T | key(x) = k }

parents(T, k)

parents(T, k) := { x ∈ T | mod(k, key(x)) = 0 ∧ k 6 = key(x) }

parent(T, k)

parent(T, k) := M AX(parents(T, k))

subtree(T, k)

subtree(T, k) := { x ∈ T | mod(key(x), k) = 0 }

children(T, k)

children(T, k) := { x ∈ T | k = key(parent(T, key(x))) }

remove(T, k)

T := { x ∈ T | mod(key(x), k) 6 = 0 }

n old

n new move(T, n old , n new )

exists(T, n old ) ∧ exists(T, n new ) ∧ mod(key(n new ), (n old )) 6 = 0

T := T ∪ { x ∈ subtree(T, key(n old )) |

| n(recalcKey(T, key(x), key(n old ), key(n new )), data(x)) }

recalcKey(T, k, k old , k new )

recalcKey(T, k, k old , k new ) := k

k old ∗ k new

commonP arents(A, k, r)

k

r

commonP arents(A, k, r) := parents(A, k) \

parent(A, r).

ø

2 + ¹ ₂ = 17 5