BlobSeer: Bringing High Throughput under Heavy Concurrency to Hadoop Map/Reduce Applications

(1)

HAL Id: inria-00440312

https://hal.inria.fr/inria-00440312

Submitted on 10 Dec 2009

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

Concurrency to Hadoop Map/Reduce Applications

Bogdan Nicolae, Diana Moise, Gabriel Antoniu, Luc Bougé, Matthieu Dorier

To cite this version:

Bogdan Nicolae, Diana Moise, Gabriel Antoniu, Luc Bougé, Matthieu Dorier. BlobSeer: Bringing

High Throughput under Heavy Concurrency to Hadoop Map/Reduce Applications. [Research Report]

RR-7140, INRIA. 2009, pp.20. �inria-00440312�

(2)

a p p o r t

d e r e c h e r c h e

0

2

4

9 -6

3

9

9 IS

R

N

IN

R

IA

/R

R

--7

1

4

0 --F

R

+

E

N

G

BlobSeer: Bringing High Throughput

under Heavy Concurrency

to Hadoop Map/Reduce Applications

Bogdan Nicolae — Diana Moise — Gabriel Antoniu — Luc Bougé — Matthieu Dorier

N° 7140

(3)

(4)

Centre de recherche INRIA Rennes – Bretagne Atlantique

to Hadoop Map/Redu e Appli ations

Bogdan Ni olae

∗

, DianaMoise

†

, GabrielAntoniu

†

, Lu Bougé

‡

, Matthieu Dorier

‡

Thème: Cal uldistribué etappli ationsàtrès hauteperforman e Équipe-ProjetKerData

Rapportdere her he n° 7140De ember200920pages

Abstra t: HadoopisasoftwareframeworksupportingtheMap/Redu e pro-grammingmodel. ItreliesontheHadoopDistributedFileSystem(HDFS)asits primarystoragesystem. Thee ien y ofHDFSis ru ialfortheperforman e ofMap/Redu eappli ations. WesubstitutetheoriginalHDFSlayerofHadoop with anew, on urren y-optimized data storage layer based on the BlobSeer datamanagementservi e. Thereby,thee ien yofHadoopissigni antly im-proved fordata-intensiveMap/Redu eappli ations, whi h naturallyexhibit a highdegreeofdataa ess on urren y. Moreover,BlobSeer'sfeatures(built-in versioning,its support for on urrentappend operations)open the possibility forHadooptofurtherextenditsfun tionalities. Wereportonextensive exper-iments ondu ted ontheGrid'5000 testbed. Theresultsillustratethebenets ofourapproa hovertheoriginalHDFS-basedimplementationofHadoop. Key-words: Distributed le systems; High-performan e systems; High throughput; Large-s ale;Heavya ess on urren y;Map/Redu eappli ations; Hadoop;BlobSeer.

Conta tauthor: Gabriel.Antoniuinria.fr .

Aslightlyrevisedversion ofthisworkwillbepublishedinthePro eedingsofthe24th IEEEInternational ParallelandDistributedPro essingSymposium(IPDPS2010),Atlanta, April19-23,2010

∗

UniversityofRennes1/IRISA,Rennes,Fran e

†

INRIA/IRISA,Rennes,Fran e

‡

(5)

pour les appli ations Map/Redu e sur Hadoop Résumé : Hadoop est un environnement logi iel pour la mise en ÷uvre du modèlede programmationMap/Redu e.Ils'appuie prin ipalement surle sys-tème de gestion de hiers distribué HDFS. L'e a ité de HDFS est un pa-ramètre ru ial de la performan e des appli ations Map/Redu e. Nous pro-posons de rempla er la ou he HDFS de Hadoop par une nouvelle ou he de sto kagedesdonnéesquisoitoptimiséepouruneutilisation on urrente.Cette nouvelle ou he s'appuie sur le servi e de gestion de données BlobSeer. Nous montronsquel'e a itédeHadoopest ainsiamélioréedemanièresigni ative pour desappli ations Map/Redu equi manipulent intensivementles données: eneet,ellesorentnaturellementunhautdegré de on urren e.Deplus, les fon tionnalités spé iquesde BlobSeer (gestion intégrée des versions,support pour les opérations append on urrentes) permettent d'envisager d'étendre lesfon tionnalités deHadoop.Nousrendons ompte d'une ampagneintensive d'expérien esmenéesur l'instrumentGrid'5000.Lesrésultatsillustrentles bé-né esde notreappro heparrapportàl'implémentationprimitivede Hadoop fondéesurHDFS.

Mots- lés : Système de gestionde hiersdistribué; système haute perfor-man e; grande bande passante; grande é helle; a ès hautement on urrents; appli ationsMap/Redu e;Hadoop;BlobSeer.

(6)

Contents

1 Introdu tion 3

2 Spe ialized le systems for data-intensive Map/Redu e

appli- ations 4

2.1 Requirementsforthestoragelayer . . . 4

2.2 Dedi atedlesystemsforMap/Redu e . . . 4

3 BlobSeeras a on urren y-optimizedle systemfor Hadoop 6 3.1 DesignoverviewofBlobSeer . . . 6

3.2 IntegratingBlobSeerwithHadoop . . . 9

3.3 BlobSeer: detailedar hite ture . . . 10

3.4 Zoomingonreads. . . 11

3.5 Zoomingonwrites . . . 11

4 Experimentalevaluation 12 4.1 Mi roben hmarks. . . 13

4.2 S enario1: singlewriter,singlele . . . 14

4.3 S enario2: on urrentreads,sharedle . . . 15

4.4 S enario3: Con urrentappends,sharedle . . . 16

4.5 Higher-levelexperimentswithMap/Redu eappli ations . . . 17

5 Con lusion 18

1 Introdu tion

Map/Redu e[5℄isaparallelprogrammingparadigmsu essfullyused bylarge Internetservi eproviderstoperform omputationsonmassiveamountsofdata. After being strongly promoted by Google, it has also been implemented by the open sour e ommunity through the Hadoop [7℄ proje t, maintained by the Apa he Foundation and supported by Yahoo! and even by Google itself. This modelis urrentlygettingmoreandmorepopularasasolutionfor rapid implementationofdistributeddata-intensiveappli ations.

At the ore of the Map/Redu e frameworks stays a key omponent: the storage layer. To enable massively parallel data pro essing to a high degree overalarge numberof nodes, thestorage layermust meeta series of spe i requirements(dis ussedinSe tion2),that arenotpartofdesignspe i ations oftraditionaldistributed lesystemsemployedintheHPC ommunities: these le systems typi ally aim at onformingto well-established standardssu h as POSIX and MPI-IO. To address these requirements, spe ialized le systems have been designed, su h as HDFS [8℄, the default storage layer of Hadoop. HDFS has howeversome di ulties to sustain ahigh throughput in the ase of on urrenta essesto thesamele. Moreover,manydesirablefeatures are missingaltogether,su hasthesupportforversioningandfor on urrentupdates tothesamele.

We substitute the original data storage layer of Hadoop with a new, on urren y-optimized storage layer based on BlobSeer, a data management servi e wedeveloped withthe goalof supporting e ient, ne-grain a essto massive,distributeddataa essedunderheavy on urren y. ByusingBlobSeer

(7)

insteadofitsdefault storagelayer,Hadoopsigni antlyimprovesitssustained throughputins enariosthat exhibithighly on urrenta essestosharedles. Wereportonextensiveexperimentationbothwithsyntheti mi roben hmarks and real Map/Redu e appli ations. The results illustrate the benets of our approa h overthe originalHDFS-based implementationof Hadoop. Moreover wesupportadditionalfeaturessu hase ient on urrentappends, on urrent writesat randomosetsand versioning. These features ould be leveragedto extendorimprovefun tionalitiesinfutureversionsofHadooporother Map/Re-du eframeworks.

2 Spe ialized le systems for data-intensive Map/Redu e appli ations

2.1 Requirements for the storage layer

Map/Redu eappli ations typi ally run h evergrowingdatasetsofbillionsof smallre ords. StoringbillionsofKB-sizedre ordsinseparatetinylesisboth unfeasible andhard to handle, evenif thestoragelayerwould support it. For thisreason,datasetsareusuallypa kedtogetherinhugeleswhosesizerea hes theorderof severalhundredsofGB.

ThekeystrengthoftheMap/Redu emodelisitsinherentlyhigh paralleliza-tionof the omputation, that enablespro essing of PB of datain a oupleof hourson large lusters onsisting of several thousand nodes. This hasseveral onsequen esforthestorageba kend. Firstly,sin edataisstoredinhugeles, the omputationwillhavetopro esssmallpartsofthesehugeles on urrently. Thus, thestoragelayerisexpe ted toprovidee ientne-graina ess tothe les. Se ondly, the storage layer must be able to sustain a high throughput in spite of heavy a ess on urren y to the same le, as thousands of lients simultaneouslya essdata.

Dealingwithofhugeamountsofdataisdi ultintermsofmanageability. Simplemistakesthatmayleadtolossofdata anhavedisastrous onsequen es sin e gatheringsu h amounts of data requires onsiderable eortinvestment. Versioning inthis ontextbe omesanimportantfeature thatisexpe tedfrom thestoragelayer. Not onlyit enablesrollingba k undesired hanges,but also bran hing a dataset into two independent datasets that an evolve indepen-dently. Obviously, versioning should have a minimal impa t both on perfor-man eandonstoragespa eoverhead.

Finally,anotherimportantrequirementforthestoragelayerisitsabilityto exposeaninterfa ethatenablestheappli ationtobedata-lo ationaware. This allowsthe s heduler to use this information to pla e omputation tasks lose tothe data. Thisredu es network tra , ontributingto abetterglobaldata throughput.

2.2 Dedi ated le systems for Map/Redu e

These riti al needs of data-intensive distributed appli ations have not been addressed by lassi al, POSIX- ompliant distributed le systems. Therefore, Google introdu ed GoogleFS [6℄ asa storage ba kend that provides the right

(8)

abstra tionfortheirMap/Redu edatapro essingframework.Then,other spe- ialized lesystemsemerged: ompanies su h asYahoo! and Kosmixfollowed thistrendbyemulatingtheGoogleFSar hite turewiththeHadoopDistributed FileSystem(HDFS,[8℄)andCloudStore[4℄.

Essentially, GoogleFS splits les into xed-sized 64 MB hunks that are distributed among hunkservers. Both metadatathat des ribes the dire tory stru ture ofthele system,andmetadatathat des ribesthe hunk layoutare stored on a entralized master server. Clients that need to a ess ale rst onta tthis serverto obtainthelo ationofthe hunks that orrespondto the range of the le they are interestedin. Then, they dire tly intera t with the orresponding hunkservers. GoogleFSis optimizedtosustainahigh through-putfor on urrentreads/appendsfrom/toasinglele,byrelaxingthesemanti onsisten y requirements. It alsoimplements support for heap snapshooting andbran hing.

Hadoop Map/Redu e is a framework designed for easily writing and e- ientlypro essingMap/Redu eappli ations. Theframework onsistsofasingle masterjobtra ker,andmultipleslavetasktra kers,onepernode. Thejobtra ker isresponsiblefors hedulingthejobs' omponenttasksontheslaves,monitoring them and re-exe utingthe failed tasks. Thetasktra kers exe utethe tasksas dire tedbythemaster. HDFSisthedefaultstorageba kendthatshipswiththe Hadoopframework. Itwasinspiredbythe ar hite ture ofGoogleFS.Filesare alsosplitin64MBblo ksthataredistributedamongdatanodes. A entralized namenode isresponsibletomaintainboth hunklayoutanddire torystru ture metadata. Read and write requests are performed by dire t intera tion with the orrespondingdatanodes anddonotgothroughthenamenode.

InHadoop,readsessentiallyworkthesamewayaswithGoogleFS.However, HDFShasadierentsemanti sfor on urrentwrite a ess: itallowsonlyone writeratatime,and,on ewritten,data annotbealtered,neitherby overwrit-ingnorbyappending. Severaloptimizationte hniquesareusedtosigni antly improvedata throughput. First,HDFSemploysa lientside buering me ha-nismfor smallread/write a esses. It prefet hesdataon reading. Onwriting, itpostpones ommittingdataafter thebuerhasrea hedatleastafull hunk size. A tually, su h ne-grain a esses are dominantin Map/Redu e appli a-tions,whi husuallymanipulatesmallre ords. Se ond,Hadoop'sjobs heduler (the jobtra ker)pla es omputationsas loseaspossibleto thedata. Forthis purpose,HDFSexpli itelyexposesthemappingof hunksoverdatanodestothe Hadoopframework.

With loud omputingbe omingmoreandmorepopular,providerssu has AmazonstartedoeringMap/Redu eplatformsasaservi e. Amazon's initia-tive,Elasti MapRedu e[2℄,employsHadoopontheirElasti ComputeCloud infrastru ture (EC2, [1℄). The storage ba kend used by Hadoop is Amazon's Simple Storage Servi e (S3, [3℄). The S3 framework was designed with sim-pli ityin mind,to handleobje tsthat mayrea hsizesin theorderofGB: the user anwrite,read,anddeleteobje tssimplyidentiedbyanuniquekey. The a ess interfa e is basedon well-establishedstandardssu h asSOAP. Careful onsideration was invested into using de entralized te hniques and designing operationsinsu hwayastominimizetheneedfor on urren y ontrol. Afault tolerantlayerenablesoperationsto ontinue with minimal interruption. This allowsS3tobehighlys alable. On thedownsidehowever,simpli ity omesat a ost: S3provideslimitedsupportfor on urrenta essestoasingleobje t.

(9)

Other eortsaimatadapting general-purposedistributed lesystemsfrom theHPC ommunitytotheneedsoftheMap/Redu eappli ations. Forinstan e, PVFS(ParallelVirtualFileSystem)andGPFS(GeneralParallelFileSystem, fromIBM)havebeenadaptedtoserveasastoragelayerforHadoop. GPFS[13℄ ispartoftheshared-disklesystems lass,thatuseapoolofblo k-levelstorage, sharedanddistributed a rossallthe nodesin the luster. The sharedstorage an be dire tly a essed by lients, with no intera tion with an intermediate server. Integrating GPFS with the Hadoop framework, involves over oming some limitations: GPFS supports a maximal blo k size of 16 MB, whereas Hadoopoftenmakesuseofdatain64MB hunks;Hadoop'sjobtra ker mustbe awareof theblo k lo ation,while GPFS(likeall parallellesystems) exposes aPOSIXinterfa e. PVFS [12℄belongstoase ond lassofparallellesystems, obje t-basedlesystems whi hseparatethenodesthatstorethedatafromthe onesthatstorethemedatata(leinformation,andleblo klo ation). Whena lientwantsto a essale,itmustrst onta tthemetadataserverandthen dire tlya ess thedata onthedata serversindi atedbythe metadataserver. In[14℄,itisdes ribedthewayPVFS wasintegratedwithHadoop,byaddinga layerontopofPVFS.Thislayerenhan edPVFSwithsomefeaturesthatHDFS already provides to the Hadoop framework: performing read-aheadbuering, exposingthedatalayoutandemulatingrepli ation.

Theaboveworkhasbeenasour eofinspirationforourapproa h. Thanksto thespe i featuresofBlobSeer,we ouldaddressseverallimitationsofHDFS highlightedin it.

3 BlobSeer as a on urren y-optimized le sys-tem for Hadoop

Inthisse tionweintrodu eBlobSeer,asystemformanagingmassivedataina large-s aledistributed ontext[10℄. Itse ientversion-orienteddesignenables lo k-freea esstodata,andtherebyfavorss alablityunderheavy on urren y. Thanksto its de entralized data and metadatamanagement, it provideshigh datathroughput [11℄. The goalofthis paperis to showhowBlobSeer anbe extendedinto an lesystem for Hadoop, and thus used asan e ient storage ba kendforMap/Redu eappli ations.

3.1 Design overview of BlobSeer

The goal of BlobSeer is to provide support for data-intensive distributed ap-pli ations. Nohypothesis whatsoeverismade aboutthestru ture of thedata atstake: theyareviewedashuge,atsequen esof bytes,often alledBLOBs (BinaryLargeOBje ts). Weespe iallytargetappli ationsthatpro essBLOBs in a ne-grain manner. This is the typi al ase of Map/Redu eappli ations, indeed: workers usually a ess pie es of up to 64 MB from huge input les, whosesizemayrea hhundredsofGB.

A lient of BlobSeer manipulates BLOBs by using asimple interfa e that allowsto: reateanewemptyBLOB;appenddatatoanexistingBLOB; read-/writeasubsequen eofbytesspe iedbyanosetandasizefrom/toanexisting BLOB.Ea hBLOBisidentiedbyauniqueid inthesystem.

(10)

Figure1: Metadatatreeafterwritingtherst4blo ksofaBLOB

Versioningis builtin BlobSeer atthe earlieststageof design. Ea h timea writeorappendisperformedonaBLOB,anewsnapshotree tingthe hanges isgenerated insteadofoverwritinganyexisting data. Thisnewsnapshotis la-beledwithanin rementalversionnumber,sothatallpastversionsoftheBLOB anpotentiallybea essed,atleastaslongastheyhavenotbeengarbagedfor thesakeofstoragespa e.

The version numbers are assigned and managed by the system. In order to read apartof theBLOB, the lientmust spe ifyboththe uniqueid ofthe BLOB and the snapshotversionitdesires to read from. A spe ial allallows the lienttond outthelatest versionof aparti ularBLOB,but the lientis allowedto readanypastversionoftheBLOB.

Althoughea hwriteorappendgeneratesanewversion,onlythedierential pat h isa tually stored, sothat storagespa e is savedat far aspossible. The newsnapshotshares allunmodied dataandmostof theasso iatedmetadata withthepreviousversions,aswewillseefurtherinthisse tion. Su han imple-mentation further fa ilitates the implementation of advan ed features su h as rollba kandbran hing,sin edataandmetadata orrespondingtopastversions remainavailablein thesystemand aneasilybea essed.

ThegoalofBlobSeeris tosustainhigh throughput underheavy a ess on- urren y in reading, writing and appending. This is a hieved thanks to the ombination of various te hniques,in luding: data striping, distributed meta-data,version-baseddesign,lo k-freedata a ess.

Data striping. BlobSeerreliesonstriping: ea hBLOBismadeupofblo ks ofaxedsize. TooptimizeBlobSeerforMap/Redu eappli ations,weset this size to thesizeof thedatapie e aMap/Redu eworkerissupposed to pro ess (i.e., 64 MB in the experiments below with Hadoop, equal to the hunk size in HDFS). These blo ks are distributed among the storage nodes. We use a loadbalan ingstrategythataimsatevenlydistributingtheblo ksamongthese nodes. Asdes ribedinSe tion4.3,thishasamajorpositiveimpa tinsustaining ahigh throughputwhen many on urrentreadersa ess dierentparts of the samele.

Distributedmetadata. ABLOBisa essedbyspe ifyingaversionnumber andarangeofbytesdelimitedbyanosetandasize. BlobSeermanages addi-tionalmetadatatomapagivenrangeandaversiontothephysi alnodeswhere the orresponding blo ks are lo ated. We organize metadata asa distributed

(11)

segmenttree [15℄: onesu htreeisasso iatedtoea hversionofagivenblobid. Asegmenttreeis abinarytreein whi h ea h nodeis asso iatedto arangeof theblob, delimited byoset andsize. Wesaythat thenode overs the range (oset,size). Theroot oversthewholeBLOB.Forea hnodethatisnotaleaf, theleft hild oversthersthalfoftherange,andtheright hild oversthe se -ondhalf. Ea hleaf oversasingleblo koftheBLOB.Figure1illustratessu h ametadatatreefora4-blo k. Tofavore ient on urrenta esstometadata, treenodes aredistributed: theyare stored onthemetadataprovidersusing a DHT (Distributed Hash Table). Ea h tree node is identied in the DHT by itsversionandbytherangespe iedthroughtheosetand thesizeit overs. Su h ametadatatreeis reatedwhen therst blo ksof theblob are written, fortherange overedbythoseblo ks. Then,toavoidtheoverhead(intimeand spa e!) ofrebuildingsu hatreeforthesubsequentupdates,we reatenewtree nodesonlyfortherangesthatdo interse twith therangeoftheupdate.

Note thatmetadatade entralizationhasasigni antimpa tontheglobal throughput, as demonstrated in [11℄: it avoidsthe bottlene k reatedby on- urrenta essesinthe aseofa entralizedmetadataserverinmostdistributed lesystems, in luding HDFS.A detailed des ription ofthe algorithms weuse tomanagemetadata anbefoundin[10℄: duetospa e onstraints,wewillnot developthemfurtherinthis paper.

Version-based, lo k-free, on urren y-optimized data a ess. Blob-Seerreliesonaversioning-based on urren y ontrolalgorithmthatmaximizes the number of operations performed in parallel in the system. It is done by avoiding syn hronization asmu h aspossible, both at thedata and metadata levels. The keyideaisamazingly simple: noexisting dataormetadatais ever modied! First,anywriter orappenderwrites itsnewdata blo ks,bystoring thedierentialpat h. Then,inase ondphase,theversionnumberisallo ated andthenewmetadatareferringto these blo ksare generated. Therst phase onsistsina tuallywritingthenewdataonthedistributed storagenodes. The on urrentwriters anpro eedwithfull parallelism,withoutany syn hroniza-tion. Inthe se ondphase, thenewmetadataare thenweaved togetherwith themetadataof theversions with alowernumber. The ru ial observationis thatthisse ondphase analsobemostly on urrent. Theonlyglobal syn hro-nizationrequirementisthattheorderinwhi hthe ompletionofthe on urrent writes o urs must respe t theorder in whi h the versionnumbers havebeen assigned. This istransparently ensured bythe system, without requiringany expli itsyn hronizationbytheuser. Thereby,thealgorithm reatestheillusion ofafullyindependentsnapshotgeneration. Thisallowswrite/write on urren y atdatalevel,whilestillpreservingserializationand atomi ity.

Sin eea hwriterorappendergeneratesnewdata/metadataandnever modi-esexistingdata/metadata,readersare ompletelyde oupledfromthem. Read-ers anthuspro eedwithfull on urren ywithrespe ttowritersandappenders (andvi e-versa), both for data andmetadata a ess. We an thus laim that ourapproa h supports read/read, read/write and write/write on urren y by design. This signi antlyoverpassesthe apabilities of HDFS, whi h only al-lowsasinglewritertopro eedatatime. Theexperimentalresultspresentedin Se tion4 learlysupport our laim.

(12)

Figure2: BlobSeer'sar hite ture. TheBSFSlayerenablesHadooptouse Blob-Seerasastorageba kend through alesysteminterfa e.

3.2 Integrating BlobSeer with Hadoop

The Hadoop Map/Redu e framework a esses its default storage ba kend (HDFS) through a lean, spe i Java API. This API exposes the basi op-erations of a le system: read, write, append, et . To make Hadoop benet from BlobSeer's properties, weimplementedthis APIon topofBlobSeer. We allthis higherlayertheBlobSeer File System (BSFS): itenablesBlobSeerto a tasastorageba kendlesystemforHadoop. Toenableafair omparisonof BSFSwithHDFS,weaddressedseveralperforman e-orientedissueshighlighted in [14℄. Theyarebrieydis ussedbelow.

File systemnamespa e. TheHadoopframeworkexpe tsa lassi al hierar- hi aldire torystru ture,whereasBlobSeerprovidesaatstru tureforBLOBs. For this purpose, we had to design and implement a spe ialized namespa e manager,whi hisresponsibleformaintainingalesystemnamespa e,andfor mappingles to BLOBs. Forthesake of simpli ity, this entity is entralized. Careful onsiderationwasgiventominimizetheintera tionwiththisnamespa e manager,inordertofullybenetfromthede entralizedmetadatamanagement s heme of BlobSeer. Our implementation of Hadoop's le system API only intera tswithitforoperationslikeleopeningandle/dire tory reation/dele-tion/renaming. A essto thea tual datais performedby adire t intera tion withBlobSeerthroughread/write/appendoperationsontheasso iatedBLOB, whi hfullybenetfrom BlobSeer'se ientsupportfor on urren y.

Data prefet hing. Hadoopmanipulatesdatasequentiallyinsmall hunksof afewKB(usually,4KB)atatime. Tooptimizethroughput,HDFSimplements a a hing me hanism that prefet hes data for reads, and delays ommitting data for writes. Thereby, physi al reads and writes are performed with data sizeslargeenoughto ompensatefornetworktra overhead.Weimplemented a similar a hing me hanism in BSFS. It prefet hes a whole blo k when the

(13)

requested data is not already a hed, and delays ommitting writes until a wholeblo khasbeenlledinthe a he.

Anity s heduling: exposing data distribution. In a typi al Hadoop deployment,thesamephysi alnodesa tbothasstorageelementsandas om-putationworkers. Therefore,theHadoops heduler strivesatpla ingthe om-putationas loseaspossibletothedata: thishasamajorimpa tontheglobal datathroughput,giventhehugevolumeofdatabeingpro essed. Toenablethis s hedulingpoli y,Hadoop'slesystemAPIexposesa allthat allowsHadoop to learn how the requested data is split into blo ks, and where those blo ks arestored. WeaddressthispointbyextendingBlobSeerwithanewprimitive. Givenaspe iedBLOBid,version,osetandsize,itreturnsthelistofblo ks thatmakeuptherequestedrange,andtheaddressesofthephysi alnodesthat storethose blo ks. Then, we simply map Hadoop's orresponding le system alltothis primitiveprovidedbyBlobSeer.

3.3 BlobSeer: detailed ar hite ture

BlobSeer onsists ofaseriesof distributed ommuni atingpro esses. Figure 2 illustratesthepro essesandtheirintera tionsbetweenthem.

Clients reate, read, write and append data from/to BLOBs. Clients an a ess theBLOBswithfull on urren y,eveniftheyalla ess thesame BLOB.

Data providers physi allystoretheblo ksgeneratedbyappendsandwrites. New data providers may dynami ally join and leave thesystem. Inthe ontext of Hadoop Map/Redu e, thenodes hosting data providers typi- allyalsoa tas omputingelementsaswell. Thisenablesthemtobenet from thes hedulingstrategyof Hadoop,whi h aimsat pla ingthe om-putationas loseaspossibleto thedata.

The provider manager keepsinformationabouttheavailablestoragespa e and s hedules the pla ement of newly generated blo ks. For ea h su h blo kto bestored, it sele tsthe dataprovidersa ordingto aload bal-an ing strategy that aims at evenly distributing the blo ks a ross data providers.

Metadata providers physi allystorethemetadatathatallowsidentifyingthe blo ksthat makeupasnapshotversion. We useadistributed metadata managements hemetoenhan e on urrenta esstometadata. Thenodes hostingmetadataprovidersmaya tas omputingelementsaswell. The version manager is in hargeof assigningsnapshotversionnumbersin

su hawaythatserializationandatomi ityofwritesandappendsis guar-anteed. Itistypi allyhostedonadedi ated node.

The namespa e manager is not part of the BlobSeer. It is an additional entityintrodu edforBSFS,thehigher-levellesystemlayer. Itmaintains ale systemnamespa e, andmaps les in thenamespa eto BLOBs. It

(14)

3.4 Zooming on reads

Toread data,the lientrstneedsto ndouttheBLOB orrespondingtothe requestedle. This informationistypi alyavailablelo ally (asithastypi ally beenrequestedfromthenamespa emanagerwhenthele wasopened). Then the lientmust spe ify the versionnumberit desires to read from,as wellas the osetand size of therange to beread. The lientmay also allaspe ial primitiverst,tondoutthelatestversionavailableinthesystematthetime thisprimitivewasinvoked. Inpra ti e,sin eHadoop'slesystemAPIdoesnot supportversioningyet,this allisalwaysissuedinthe urrentimplementation. Next,the read operationin BSFS follows BlobSeer's sequen eof steps for reading arange withinaBLOB. The orrespondingdistributed algorithm, de-s ribingtheintera tionsbetweenthe lient,theversionmanager,thedistributed dataandmetadataprovidersarepresentedanddis ussedindetailin[10℄. The main global steps anbe summarized as follows. The lient queries the ver-sion manageraboutthe requestedversionoftheBLOB. The versionmanager forwardsthequerytothemetadataproviders,whi hsendtothe lientthe meta-data that orrespondsto theblo ksthat makeupthe requestedrange. When the lo ation of all these blo ks was determined, the lient fet hes the blo ks fromthedataproviders. Theserequestsaresentasyn hronouslyandpro essed inparallelbythedataproviders. Notethattherstandthelastblo kinthe se-quen eofblo ksfortherequestedrangemaynotneedtobefet hed ompletely, astherequestedrangemaybeunalignedtofull blo ks. Inthis ase,the lient fet hesonlytherequiredparts oftheextremalblo ks.

3.5 Zooming on writes

Towritedata, the lientrstsplits thedatato bewritteninto alistofblo ks that orrespondtotherequestedrange. Then,it onta tstheprovidermanager, requesting a list of providers apable of storing the blo ks: one provider for ea hblo k. Blo ksarethenwritteninparalleltotheprovidersallo atedbythe providermanager. If, forsomereason, writingofablo kfails, thenthewhole write fails. Otherwise the lient pro eeds by onta ting the version manager to announ e its intent to update the BLOB. As highlighted in Se tion 3.1, on urrentwritersofdierentblo ksofthesamele anperformthisrststep with full parallelism. Subsequently, theversionmanager assignsto ea h write request anewsnapshotversionnumber. This numberis used bythe lientto generatenewmetadata, weaveittogether withexisting metadata,and storeit onthe distributed metadataproviders,in orderto reate theillusionofanew standalonesnapshot.

Note that the term existing metadata overs two ases. First, it refers to metadata orresponding to previous, ompleted writes. But it also refers to metadata generated by still a tive on urrent writers that were assigned a lower version number (i.e., they have written the data, but they have not nished writing the metadata)! In parti ular, su h on urrent writers might beinthepro essofgeneratingandwritingmetadata,onwhi hthe lientshall dependwhenweavingitsownmetadata. Todealwiththissituation,theversion managerhintsthe lientonsu hdependen ies. Insomesense,the lientisable topredi t thevalues orrespondingtothemetadatathatisbeingwrittenbythe on urrentwriters that are still in progress. It an thus pro eed on urrently

(15)

with the other writers, rather than waiting for them to nish writing their metadata. The reader an refer to [10℄ for further details on how we handle metadatafor on urrentwriters.

On emetadatawassu essfullywrittentothemetadataproviders,the lient notiestheversionmanager ofsu ess, andreturns tothe user. Observethat theversionmanagerneedsto keeptra kofallwriters on urrentlya tive,and delay ompleting anewsnapshotversionuntilallwriters thatwereassigned a lower versionnumber reported su ess. The detailed algorithm for writingis providedin [10℄.

Theappendoperationisidenti altothewriteoperation,ex eptforasingle dieren e: theosetof therange to be appended is unknownat the timethe append isissued. Itis eventuallyxed bytheversionmanagerat thetimethe versionnumberisassigned. Itissettothesizeofthesnapshot orrespondingto thepre edingversionnumber. Again,observethat thewritingofthissnapshot maystillbeinprogress.

4 Experimental evaluation

Platformdes ription. ToevaluatethebenetsofusingBlobSeerasthe stor-ageba kend forMap/Redu eappli ationsweused Yahoo!'sreleaseof Hadoop v.0.20.0 (whi h is essentially the main release of Hadoop with some minor pat hesdesignedto enableHadooptorun ontheYahoo! produ tion lusters). We hose this release be ause it is freely available and enables us to experi-mentwithaframeworkthat isbothstableandusedinprodu tiononYahoo!'s lusters.

WeperformedourexperimentsontheGrid'5000[9℄testbed,are ongurable, ontrollableand monitorable experimental Gridplatform gathering9sites ge-ographi ally distributed in Fran e. We used the lusters lo ated in Sophia-Antipolis, Orsay and Lille. Ea h experiment was arried out within a single su h luster. Thenodes areouttted with x86_64CPUs and 4GB of RAM for the Rennes and Sophia lusters (2 GB for the luster lo ated in Orsay). Intra luster bandwidth is 1 Gbit/s (measured: 117.5 MB/s for TCP so kets with MTU= 1500 B),intra lusterlaten y is 0.1 ms. A signi anteort was investedinpreparingtheexperimentalsetup,bydeninganautomated deploy-mentpro essfortheHadoopframeworkbothwhenusingBlobSeerand HDFS asthestorageba kend. Wehadtoover omenontrivialnodemanagementand ongurationissuestorea hthispoint.

Overview oftheexperiments. Inarstphase,wehaveimplementedaset ofmi roben hmarksthatwrite/readandappenddatatolesthroughHadoop's lesystemAPIandhavemeasuredthea hievedthroughputasmoreandmore on urrent lientsa essthelesystem. Thissyntheti setuphasenabledusto ontrolthea esspatterntothelesystemandfo usondierents enariosthat exhibitparti ulara esspatterns. We anthusdire tly omparetherespe tive behaviorofBSFSandHDFSin theseparti ularsyntheti s enarios.

In ase ond phase,our goalwasto get afeeling of theimpa t ofBlobSeer at theappli ation level. We haveruntwostandardMap/Redu eappli ations fromtheHadooprelease,bothwithBSFSandwithHDFS.Wehaveevaluated

(16)

20

30

40

50

60

70

80

90 0 2 4 6 8 10 12 14 16

Throughput

File size (GB)

HDFS

BSFS

(a)Performan eofHDFSandBSFSwhen asingle lientwritestoasinglele

0

100

200

300

400

500 0 2 4 6 8 10 12 14 16

Degree of unbalance

File size (GB)

HDFS

BSFS

(b)Load-balan ingevaluation

Figure3: Singlewriterresults

thenumberofavailableMap/Redu eworkersprogressivelyin reases. Notethat HadoopMap/Redu eappli ationsrun out-of-the-boxinanenvironmentwhere HadoopusesBlobSeerasastorageba kend,justlikeintheoriginal,unmodied environmentofHadoop. ThiswasmadepossiblethankstotheJavalesystem interfa eweprovidedwithBSFS,ontopofBlobSeer.

4.1 Mi roben hmarks

We have rst dened several s enarios aiming at evaluating the throughput a hievedby BSFS and HDFS when the distributed le system is a essed by a single lient orby multiple, on urrent lients, a ordingto several spe i a ess patterns. In this paper we have fo used the following patterns, often exhibitedbyMap/Redu eappli ations:

asinglepro esswritingahugedistributedle;

on urrentreadersreadingdierentpartsofthesamehugele; on urrentwritersappending datatothesamehugele.

Theaimoftheseexperimentsisof ourseto evaluatewhi hbenets anbe expe tedwhen usinga on urren y-optimizedstorageservi esu hasBlobSeer for highly-parallel Map-Redu e appli ations generating su h a ess patterns. The relevan e of these patterns is dis ussed in the following subse tions, for ea h s enario. Additional s enarios with other dierent a ess patterns are urrentlyunder investigation.

In ea h s enario, we rst measure the throughput a hieved when a single lientperforms aset of operationson the le system. Then, wegradually in- rease the number of lientsperforming the sameoperation on urrently and measure the average throughput per lient. Forany xed number

N

of on- urrent lients, the experiment onsists in two phases: we deploy of HDFS (respe tivelyBSFS)onagivensetup,thenwerunthetests enario.

Inthedeploymentphase,HDFS(respe tivelyBSFS)isdeployedon270 ma- hinesfromthesame lusterofGrid'5000. ForHDFS,wedeployonenamenode on adedi ated ma hine; theremaining nodesare used forthe datanodes (one datanode per ma hine). On the same number of nodes, we deploy BSFS as

(17)

follows: oneversionmanager,oneprovidermanager, one node forthe names-pa e manager, 20 metadata providers; the remaining nodes are used as data providers. Ea h entityisdeployedonaaseparate,dedi atedma hine.

For the measurement phase, a subset of

N

ma hines is hosen from the set of ma hines where datanodes/providersare running. The lients are then laun hed simultaneouslyon this subsetof ma hines, individual throughput is olle ted and is then averaged. These steps are repeated 5 times for better a ura y(whi h is enough,asthe orrespondingstandarddeviation provedto below).

4.2 S enario 1: single writer, single le

Werstmeasuretheperforman eofHDFS/BSFSwhenasingle lientwritesa lewhose size gradually in reases. This test onsistsin sequentiallywritinga uniqueleof

N ×64

MB,inblo ksof64MB(

N

goesfrom1to246).Thesizeof HDFS's hunksis64MB,andsoistheblo ksize onguredwithBlobSeerinthis ase. Thegoalofthisexperimentis to omparetheblo kallo ationstrategies thatHDFSandBSFSuseindistributingthedataa rossdatanodes(respe tively dataproviders). Thepoli yusedbyHDFS onsistsinwritinglo allywhenevera writeisinitiatedonadatanode. Toenableafair omparison,we hosetoalways deploy lientsonnodeswhere nodatanode haspreviouslybeendeployed. This way, wemakesure that HDFS will distribute the data among the datanodes, instead of lo ally storing the whole le. BlobSeer's default strategy onsists in allo ating the orresponding blo ks on remote providers in a round-robin fashion.

We measure the write throughput for both HDFS and BSFS: the results anbe seenon Figure 3(a). BSFS a hieves asigni antly higherthroughput than HDFS,whi h is a resultof the balan ed, round-robinblo k distribution strategyusedbyBlobSeer. AhighthroughputissustainedbyBSFSevenwhen thelesizein reases(upto16 GB).Toevaluateoftheloadbalan ingin both HDFSand BSFS, we hose to omputethe Manhattan distan e to anideally balan edsystemwherealldata providers/datanodes storethesamenumberof blo ks/ hunks. To al ulatethisdistan e,werepresentthedatalayoutin ea h asebyave torwhosesizeisequaltothenumberofdataproviders/datanodes; the elements of the ve tor represent the number of blo ks/ hunks stored by ea h provider/datanode. We ompute 3su h ve tors: one for HDFS, onefor BSFS and one for a perfe tly balan ed system (where all elements have the samevalue: thetotalnumberof blo ks/ hunksdividedbythetotalnumberof storagenodes. Wethen omputethedistan ebetweentheidealve torandthe HDFS(respe tivelyBSFS).AsshownonFigure3(b), asthelesize(andthus, the number of blo ks) in reases, both BSFS and HDFS be ome unbalan ed. However, BSFS remains mu h loser to a perfe tly balan ed system, and it manages to distribute the blo ks almost evenly to the providers, even in the aseofalargele. Asfaraswe antell,this anbeexplainedbythefa t that the blo k allo ation poli y in HDFS mainly takes into a ount data lo ality and doesnotaim at perfe tly balan ing thedata distribution. A global load-balan ingofthesystemisdoneforMap/Redu eappli ationswhenthetasksare assignedtonodes. Duringthisexperiment,we ouldnoti ethatinHDFSthere aredatanodes thatdonotstoreanyblo k,whi h explainsthein reasing urve

(18)

20

30

40

50

60

70

80

0

50 100 150 200 250

Average throughput (MB/s)

Number of clients

HDFS

BSFS

Figure 4: Performan eof HDFSand BSFSwhen on urrent lients readfrom asinglele

shownin gure3(b). Aswewill seein the nextexperiments,abalan ed data distributionhasasigni antimpa tontheoveralldataa essperforman e. 4.3 S enario 2: on urrent reads, shared le

Inthis s enario,forea hgivennumber

N

of lientsvaryingfrom 1to 250,we exe uted the experiment in two steps. First, we performed a boot-up phase, where a single lient writes a le of

N × 64

MB, right after the deployment of HDFS/BSFS. Se ond,

N

lientsread parts from the le on urrently; ea h lient reads a dierent 64 MB hunk sequentially, using ner-grain blo ks of 4 KB. This patternwhere multiple readersrequest data in hunks of4 KB is very ommoninthemap phaseofaHadoopMap/Redu eappli ation, where themappersread theinputleinorder toparsethe(key,value)pairs.

Forthiss enario,werantwoexperimentsinwhi hwevariedthedatalayout forHDFS.The rstexperiment orrespondstothe asewhere theleread by all lients isentirelystored byasingle datanode This orresponds to the ase where thelehaspreviouslybeenentirelywrittenbya lient olo atedwitha datanode (asexplainedinthepreviouss enario). Thus,all lientssubsequently read thedata storedby onenode,whi h willlead toaverypoorperforman e of HDFS. Wedo not representthese results here. In order to a hieveamore fair omparison where the le isdistributed onmultiple nodes both in HDFS and in BSFS, we hose to exe ute a se ond experiment. Here, the boot-up phaseisperformedonadedi atednode(nodatanode isdeployedonthatnode). By doing so, HDFS will spread the le in a more balan ed way on multiple remotedatanodesandthereadswillbeperformedremotelyforbothBSFSand HDFS. This s enarioalso oers ana urate simulation of therst phaseof a Map/Redu eappli ation,whenthemappersareassignedtonodes. TheHDFS jobs hedulertriestoassignea hmaptasktothenodethatstoresthe hunkthe task willpro ess; thesetasks are alled lo al maps. Thes heduleralso triesto a hieveaglobalload-balan ingofthesystem,thereforenotalltheassignments will be lo al. The tasks running on a dierent node than the one storing its inputdata,are alledremotemaps: theywillread thedataremotely.

The resultsobtained in these ond experiment are presented on Figure 4. BSFS performs signi antly better than HDFS, and moreover, it is able to deliverthe samethroughput even when thenumber of lients in reases. This is adire t onsequen eof how balan ed isthe blo kdistribution forthat le.

(19)

0 2000

4000

6000

8000

10000

12000

0 50 100 150 200 250

Aggregated throughput (MB/s)

Number of clients

BSFS

Figure5: Performan eofBSFSwhen on urrent lientsappendtothesamele

Thesuperiorloadbalan ingstrategyusedbyBlobSeerwhenwritingthelehas apositiveimpa t ontheperforman e of on urrent reads,whereasthe HDFS suersfromthepoordistributionofthele hunks.

4.4 S enario 3: Con urrent appends, shared le

Wenowfo usonanothers enario,where on urrent lientsappenddatatothe samele. Thiss enarioisalsousefulinthe ontextofMap/Redu eappli ations, asitisforawiderangeof data-intensiveappli ationsingeneral. Forinstan e, the possibility of running on urrent appends an improve the performan e of a simple operation su h as opying a large distributed le. This an be donein parallel by multiple lientswhi h read dierent parts ofthe le, then on urrentlyappend the data to the destination le. Moreover,if on urrent append operationsare enabled, Map/Redu eworkers anwrite theoutput of theredu e phasetothesamele,insteadof reatingmanyoutputles,asitis urrentlydonein Hadoop.

Despiteitsobvioususefulness,thisfeatureisnotavailablewithHadoop'sle system: Hadoophasnotbeenoptimizedforsu h as enario. As BlobSeer pro-videssupportfore ient, on urrentappendsbydesign,wehaveimplemented the append operation in BSFS and evaluated the aggregated throughput as the number of lients varies from 1to 250. We ould not perform the same experimentforHDFS,sin eitdoesnotimplementtheappendoperation.

Figure 5 illustrates the aggregated throughput obtained when multiple lients on urrentlyappenddatatothesameBSFSle. Thesegoodresults an beobtainedthankstoBlobSeer,whi hisoptimizedfor on urrentappends.

Notethattheseresultsalsogiveanideaabouttheperforman eof on urrent writes to thesame le. In BlobSeer, theappend operation is implemented as aspe ial aseof the write operation where the write osetis impli itly equal to the urrent le size: theunderlying algorithms are a tually identi al. The sameexperimentperformedwithwritesinsteadofappends,leadstoverysimilar results.

(20)

20

40

60

80

100

120

140

160

180

200

220

240 0 1 2 3 4 5 6 7

Job completion time (s)

Size of data generated by one mapper (GB)

HDFS

BSFS

(a) RandomTextWrite: Job ompletion timefora totalof 6.4GBof outputdata whenin reasingthedatasizegeneratedby ea hmapper

0

20

40

60

80

100 6 7 8 9 10 11 12 13

Job completion time (s)

Total text size to be searched (GB)

HDFS

BSFS

(b)Distributedgrep: Job ompletiontime whenin reasingthe sizeofthe inputtext tobesear hed

Figure6:BenetsofusingBSFSinsteadofHDFSasastoragelayerinHadoop: impa tontheperforman eofMap/Redu eappli ations

4.5 Higher-level experiments with Map/Redu e appli a-tions

In orderto evaluate how well BSFSand HDFS perform in the roleof storage layersforrealMap/Redu eappli ations,wesele tedtwostandardMap/Redu e appli ationsthat arepartofYahoo!'sHadooprelease.

Therstappli ation,RandomTextWriter, isrepresentativeofadistributed job onsisting in alarge number oftasks ea h ofwhi h needsto write alarge amountofoutputdata(withnointera tionamongthetasks). Theappli ation laun hesaxednumberofmappers,ea hofwhi hgeneratesahugesequen eof randomsenten esformed fromalistof predenedwords. Theredu e phaseis missingaltogether: theoutputofea hofthemappersisstoredasaseparatele inthelesystem. Thea esspatterngeneratedbythisappli ation orresponds to on urrent,massivelyparallelwrites,ea hofthemwritingtoadierentle. To omparetheperforman eofBSFSvs. HDFSin su h as enario,we o-deployaHadooptasktra ker withadatanode inthe aseofHDFS(withadata providerin the ase ofBSFS) onthe samephysi al ma hine, for atotalof 50 ma hines. TheotherentitiesforHadoop,HDFS(namenode,jobtra ker)andfor BSFS (versionmanager, providermanager, namespa emanager)are deployed onseparatededi atednodes. ForBlobSeer,10metadataprovidersaredeployed ondedi atedma hinesaswell.

Wexthetotaloutputsizeofthejobtoamountto6.4GBworthofgenerated textandvarythesizegeneratedbyea hmapperfrom128MB( orrespondingto 50parallelmappers)to6.4GB( orrespondingtoasinglemapper),andmeasure thejob ompletion timeinea h ase.

ResultsobtainedaredisplayedonFigure6(a). Observetherelativegainof BSFSoverHDFSrangesfrom7% for50parallelmappersto11% forasingle mapper. The aseofasinglemapper learlyfavoursBSFSandis onsistentwith ourndingsfor thesyntheti ben hmark inwhi h weexplainedtherespe tive behavior of BSFS and HDFS when a single pro ess writes a huge le. The relativedieren eissmallerthaninthe aseofthesyntheti ben hmarkbe ause herethetotal job exe utiontime in ludes some omputationtime (generation

(21)

ofrandomtext). This omputationtimeisthesameforbothHDFSandBSFS andtakesasigni antpartofthetotalexe utiontime.

These ondappli ationwe onsiderisdistributedgrep. Itisrepresentativeof adistributedjobwherehugeinputdataneedstobepro essedinordertoobtain somestatisti s. Theappli ations ans ahugetext inputlefor o urren esof a parti ular expression and ounts the number of lines where the expression o urs. Mapperssimplyoutput thevalueof these ounters, then theredu ers sumupthealltheoutputsofthemapperstoobtainthenalresult. Thea ess patterngeneratedbythisappli ation orrespondsto on urrentreadsfromthe samesharedle.

Inthis s enariowe o-deploy atasktra ker with aHDFSdatanode (with a BlobSeer data provider, respe tively), ona totalof 150 nodes. We deployall entralized entities (version manager, providermanager, namespa e manager, namenode,et )ondedi ated nodes. Also,20Metadataprovidersaredeployed ondedi atednodesforBlobSeer.

WerstwriteahugeinputletoHDFSandBSFSrespe tively. Inthe ase ofHDFS,theleiswrittenfromanodethatisnot olo atedwithadatanode,in ordertoavoidthes enariowhereHDFSwritesalldatablo kslo ally. Thisgives HDFSthe han etoperformsomeload-balan ingofdatablo ks. Thenwerun thedistributed grepMap/Redu eappli ation andmeasure thejob ompletion time. Wevary thesizeof theinputlefrom 6.4GBto 12.8GBin in rements of1.6GB.Sin eaHadoopdatablo kis64MBlargeandsin eusuallyHadoop assignsasinglemappertopro esssu h adatablo k,this roughly orresponds tovaryingthenumberof on urrentmappersfrom100to200.

ResultsobtainedarerepresentedinFigure 6(b). As anbe observedBSFS outperformsHDFSby35 %for6.4GB andthegapsteadily in reasesto 38% for12.8GB. This behavioris onsistentwith theresultsobtainedforthe syn-theti ben hmarkwhere on urrent pro essesread from the samele. Again, therelativedieren e issmaller thanin thesyntheti ben hmarkbe ausethe job ompletiontimea ountsforboththe omputationtimeandtheI/O trans-fer time. Note howeverthehigh impa t ofI/O in su h appli ations that s an throughthedataforspe i patterns: thebenetsof supportinge ient on- urrent reads from thesame le at thelevelof the underlying distributed le systemaredenitelysigni ant.

5 Con lusion

Thee ien yof theHadoopframeworkis adire t fun tion ofthat ofitsdata storagelayer. Thisworkdemonstratesthatitispossibletoenhan eitby repla -ingthedefaultHadoopDistributedFileSystem(HDFS)layerbyanotherlayer, builtalongdierentdesignprin iples. Weintrodu eourBlobSeersystem,whi h isspe i allyoptimizedtowarde ient,ne-graina esstomassive,distributed dataa essedunderheavy on urren y. ThanktothisnewBlobSeer-basedFile System(BSFS) layer, thesustained throughputof Hadoopis signi antly im-provedins enariosthatexhibithighly on urrenta essestosharedles. More-over, BSFS supports additionalfeatures su h as e ient on urrent appends, on urrent writes at random osets and versioning. These features ould be leveragedto extendorimprovefun tionalities in future versions of Hadoopor

(22)

Leveraging versioning. Although in most real Map/Redu e appli ations, dataismostlyappendedratherthanoverwritten,Hadoop'slesystemAPIdoes not implement append. Sin e BlobSeer supports arbitrarily on urrent writes aswellasappends,thisopensahighpotentialforverypromisingimprovements ofMap/Redu eframeworkimplementations,in ludingHadoop. Versioning an be leveragedto optimize more omplex Map/Redu eworkows, in whi h the output of one Map/Redu eis the input of another. In many su h s enarios, datasetsareonlylo allyalteredfromoneMap/Redu epasstoanother: writing partsofthedatasetwhilestillbeingabletoa esstheoriginaldataset(thanks toversioning) ouldsavealot oftemporarystoragespa e.

Fault toleran e. An important aspe t we did not dis uss in this paper is fault toleran e. For this, we urrently relyon lassi al me hanisms. At data level,weemployasimplerepli ationme hanismthatallowstheusertospe ifya repli ationlevelforea hBLOB.Awriteoperationa tuallywritesitsrespe tive blo kstoanumberofprovidersequaltothatrepli ationlevel. Themetadatais storedinaDHT(formedbythemetadataproviders),whi hisresilienttofaults by onstru tion. The entralizedmanagersrepresentsinglepointsoffailureas is the ase withthe namenode of HDFS.Overall,fault-toleran e s hemes ur-rentlyusedinBlobSeerarehoweverratherminimal. Weare urrentlyexploring ways to repla e them with distributed, fault-tolerant me hanisms, while still preservingahigh-throughputfordataa ess.

Se urity. We did not address se urity issues in this paper, as most of the time Hadoopdeploymentsareexploited within private,trusted lusters owned by big ompanies,su h asGoogle and Yahoo!: for now,wepla e ourselvesin the same ontext, therefore the se urity assumptions are basi ally the same as for Hadoop's built-in le system. Inthe ase where Hadoop would run as a Map/Redu e loud servi e, possibly relying on externalized, virtualized re-sour es from other loud omputing servi e providers (su h as Amazon), the se urity onstraints would be dierent. It then be omes ru ial to guarantee datapriva yanddataa ess ontrolformultipleusers,a ordingtoa ontra t. Weplantoexplorethese issuesin thenearfuture.

Referen es

[1℄ AmazonElasti ComputeCloud(EC2). http://aws.amazon. om/e 2/. [2℄ AmazonElasti MapRedu e. http://aws.amazon. om/elasti mapredu e/. [3℄ AmazonSimpleStorageServi e (S3). http://aws.amazon. om/s3/.

[4℄ Kosmix CloudStore DistributedFile System. http://kosmosfs.sour eforge. net/index.html.

[5℄ JereyDeanandSanjayGhemawat. MapRedu e: simplieddata pro ess-ingonlarge lusters. Communi ations ofthe ACM,51(1):107113,2008. [6℄ SanjayGhemawat, HowardGobio,andShun-TakLeung. Thegooglele

(23)

[7℄ TheApa heHadoopProje t. http://www.hadoop.org .

[8℄ HDFS. The Hadoop Distributed File System. http://hadoop.apa he.org/ ommon/do s/r0.20.1/hdfs_design.html.

[9℄ YvonJégou, StephaneLantéri,Julien Ledu , MelabNoredine, Guillaume Mornet, Raymond Namyst, Pas ale Primet, Benjamin Quetier, Olivier Ri hard, El-Ghazali Talbi, and Tou he Iréa. Grid'5000: a large s ale andhighlyre ongurableexperimentalgridtestbed.International Journal of High Performan e Computing Appli ations, 20(4):481494, November 2006.

[10℄ Bogdan Ni olae,Gabriel Antoniu, andLu Bougé. BlobSeer: Howto en-able e ient versioningfor large obje t storage under heavy a ess on- urren y. In Pro . 2nd Workshop on Data Management in Peer-to-Peer Systems (DAMAP'2009), SaintPetersburg, Russia, Mar h 2009. Held in onjun tionwithEDBT'2009.

[11℄ Bogdan Ni olae, Gabriel Antoniu, and Lu Bougé. Enabling high data throughputindesktopgridsthroughde entralizeddataandmetadata man-agement: The blobseer approa h. In Pro . 15th International Euro-Par Conferen e on Parallel Pro essing (Euro-Par '09), volume 5704 of Le t. Notes in Comp. S ien e, pages 404416, Delft, The Netherlands, 2009. Springer-Verlag.

[12℄ PVFS. Parallelvirtuallesystem,version2. http://pvfs2.org/.

[13℄ FrankB.S hmu kandRogerL.Haskin. GPFS: Ashared-disklesystem for large omputing lusters. InFAST'02: Pro eedings of the Conferen e on File and Storage Te hnologies, pages 231244. USENIX Asso iation, 2002.

[14℄ GarthGibsonWittawatTantisiriroj,SwapnilPatil. Data-intensivele sys-tems forinternet servi es: A rosebyanyother name... Te hni al Report UCB/EECS-2008-99,ParallelData Laboratory,O tober2008.

[15℄ C. Zheng, G. Shen, S. Li, and S. Shenker. Distributed Segment Tree: Supportrangequeryand overqueryoverDHT. InPro eedingsoftheFifth InternationalWorkshoponPeer-to-PeerSystems(IPTPS),SantaBarbara, California,2006.

(24)