Enabling user-driven Checkpointing strategies in Reverse-mode Automatic Differentiation

(1)

HAL Id: inria-00079223

https://hal.inria.fr/inria-00079223v2

Submitted on 12 Jun 2006

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

Reverse-mode Automatic Differentiation

Laurent Hascoet, Mauricio Araya-Polo

To cite this version:

Laurent Hascoet, Mauricio Araya-Polo. Enabling user-driven Checkpointing strategies in

Reverse-mode Automatic Differentiation. [Research Report] RR-5930, INRIA. 2006, pp.21. �inria-00079223v2�

(2)

inria-00079223, version 2 - 12 Jun 2006

a p p o r t

d e r e c h e r c h e

9 -6

3

9

9 IS

R

N

IN

R

IA

/R

R

--5

9

3

0 --F

R

+

E

N

G

Thème NUM

Enabling user-driven Checkpointing strategies in

Reverse-mode Automatic Differentiation

Laurent Hascoët — Mauricio Araya-Polo

N° 5930

(3)

(4)

Laurent Has oët ,Mauri io Araya-Polo

ThèmeNUMSystèmesnumériques ProjetTropi s

Rapportdere her he n°5930Mai200621pages

Abstra t: This paperpresentsanewfun tionalityoftheAutomati Dierentiation(AD) Tool tapenade. tapenade generates adjoint odes whi h are widely used for optimiza-tion orinverse problems. Unfortunately, for large appli ations the adjoint ode demands agreat dealof memory, be ause it needs to store alarge set of intermediatesvalues. To opewiththatproblem, tapenadeimplementsasub-optimalversionofate hnique alled he kpointing,whi hisatrade-obetweenstorageandre omputation. Ourlong-termgoal istoprovideanoptimal he kpointingstrategyforevery ode,notyeta hievedbyanyAD tool. Towardsthat goal,werstintrodu emodi ationsin tapenadein ordertogivethe user the hoi e to sele t the he kpointing strategy most suitablefor their ode. Se ond, we ondu t experimentsin real-sizes ienti odesinorder togatherhintsthathelp usto dedu eanoptimal he kpointingstrategy. Someof theexperimentalresultsshowmemory savingsupto35%andexe utiontimeupto90%.

(5)

Résumé: Nousprésentonsunenouvellefon tionnalitédel'outildeDiérentiationAutomatique (DA)tapenade. LemodeinversedelaDA onstruitdes odesadjoints,quisontlargement utilisésen al uls ientique,pourl'optimisationoulesproblèmesinverses. Bienqu'apriori remarquablemente a e,lemodeinversesouredelatrèsgrande onsommationmémoire requisepour onserverdesvaleursintermédiairesdu programmeinitial. LeChe kpointing estun ompromissto kage-re al ulquiréduit ette onsommation. Notrebutestderendre l'utilisationduChe kpointingdanstapenadeplusexible,enparti ulierpardesdire tives utilisateur. Notrebutàtermeestdedévelopperdesstratégiessemi-automatiquesoptimales d'appli ationduChe kpointing. Dans erapport,nousprésentonslesmodi ationsapportées àtapenadepourrendrele Che kpointingexible, puisnous étudionset nous omparons ertainesstratégiesdeChe kpointingsurplusieursappli ationsréellesprovenantd'utilisations industrielles de tapenade, dans le but de dégager des heuristiques génerales. Certaines expérien esmontrentdesaméliorationsenmémoiredel'ordrede35%,etentempsd'exé ution del'ordrede90%.

(6)

1 INTRODUCTION

The ontextofthisworkisAutomati Dierentiation(AD)[2,7℄. ThereversemodeofAD isapromisingwaytobuildadjoint odesto omputegradients. Thefundamentaladvantage ofadjoint odesisthatthey omputegradientsata ostwhi hisindependentofthe dimen-sion of theinput spa e, and they are thus a key ingredient to solve inverse problems and optimizationproblems[14, 4℄. ADadjoint odesarefundamentallymadeof twosu essive sweeps,aforwardsweeprunning theoriginal odeand storingasigni antpartofthe in-termediatevalues,andaba kwardsweepusingthesevaluesto omputethederivatives. For largeappli ations,su hasCFDprograms,reversedierentiated odesmayendupusingfar toomu hmemory.

Che kpointing is a standardtime/memory trade-o ta ti to redu e the peak of this memoryuse. Whenasegmentoftheprogramis he kpointed,itisexe utedwithoutstorage of theintermediatevalues. Later on, when the ba kwardsweeprea hesthe he kpointed segment,thissegmentmustbeexe utedase ondtimewithstorage,andnallytheba kward sweepmayresume. Che kpointinghasabenet: therearetwopla eswherethememorysize rea hesapeak,namelyattheend oftheforwardsweepandat theendofthe he kpointed segment, and both peaks are generally smaller than thepeak without he kpointing. On theotherhand, he kpointinghasa ost: (1)inexe utiontimebe ausesegmentisexe uted twi e and (2) in memory be ause intermediate values must be store to run the segment twi e. Hopefullythislast memory ostislessthanthememorybenetabove.

In ADtools, he kpointingis applied systemati ally, for instan e at pro edure alls or aroundloopsbodies. Experien eshowsthat he kpointingeverypro edure allisingeneral sub-optimal. Optimalstrategieshavebeenfoundonlyforthe aseofaxed-lengthloop[5℄, andnotforthenestedpro edurestru tureofreal-life odes.

TowardstheultimategoalofanADtoolembedding anoptimal he kpointingstrategy forallprograms, we propose in arst stepto a tivate he kpointingfor onlyanumberof user-sele tedpro edure alls. Therefore, in addition to the default systemati he kpoint mode ( alledjoint modein [7℄), ea h pro eduremaynowbe dierentiatedin theso- alled splitmode, i.e. without he kpointing. Insplit mode,thepro edure givesriseto two sepa-ratedierentiatedpro edures,onefortheforwardsweepandonefortheba kwardsweep.

This paperpresentstheimplementationof thisnewsplit mode fun tionality inside our ADtooltapenade[10℄,whi huptonowonlyfeaturedthejointmode. Wealsodis ussthe ne essaryadaption of theexisting preliminary data-owanalyses namely, adjoint-liveness analysis[11℄and TBRanalysis [9,11℄. Inase ondstep,weusethisuser ontrolon he k-pointing to make experimental measurements of various he kpointing hoi es on several large s ienti odes. We present the results of these experiments, some of whi h show savingsofmemoryupto 35%andexe utiontimeupto 90%. Also,these resultsgivehints toageneralautomati strategyofwhereto use he kpointing. Atpresent,noADtoolhas

(7)

su hageneral he kpointingstrategy,andourlongtermgoalistoprovideoneintapenade.

The remainder of this paperis stru tured asfollows: Se tion 2 introdu esthe reverse mode of AD.In Se tion3 wepresentthe he kpointing te hnique andshowhowdierent he kpointingpla ementstrategiesae tthebehaviorofthereversedierentiated ode. In Se tion 4we dis uss the implementation issues. In Se tion 5 we present and dis uss the experimental measurements. Finally, we dis uss the future work and the on lusions in Se tion6.

2 REVERSE AUTOMATIC DIFFERENTIATION

In our ontext, AD is a program transformation te hnique to obtain derivatives, and in parti ular gradients. We aregivenaprogram

P

that evaluates afun tion

F

. Program

P

anbeseenasasequentiallistofinstru tions

I

j

P = I

1 ; I

2 ; . . . ; I

j

; . . . ; I

p−1

; I

p

,

wheretheinstru tionsrepresentelementaryfun tions

f

i

. Thenthefun tion

F

isindeed

F = f

p

◦

f

p−1

◦

. . . ◦ f

j

◦

. . . ◦ f

2 ◦

f

1 .

ADtakesadvantageofthistoapplythe hainruleof al ulustobuildanewprogramthat evaluatesthederivativesof

F.

ThereversemodeofAD omputesgradients. Roughlyspeaking,foragivens alaroutput,it returnsthedire tionin theinputspa ethatmaximizesthein reaseofthisoutput. Stri tly speakinggradientisdenedonlyfors alaroutputfun tions. Therefore,webuildave tor

Y

thatdenes theweightsofea h omponentoftheoriginaloutput

Y = F (X)

. Thisdenes as alaroutput

Y

t

×

_{Y = Y}

t

×

_{Y = F}

t

_{(X) × Y}

. Itsgradienthasthusthefollowingform:

X = F

′t

_{(X) × Y = f}

′t

1 (x

0 ) × . . . × f

j+1

′t

(x

j

) × . . . × f

p

′t

(x

p−1

) × Y

(1)

where

x

i−1

is the set of all variables values just before exe ution of the instru tion that implements

f

′t

i

,and

F

′t

_(X)

isthetransposed Ja obian.

Formula1isimplementedfromrighttoleft,be ausematrix

×

ve torprodu tsare heaper to omputethan matrix

×

matrix produ ts. This resultin probably themost e ient way to omputeagradient. Unfortunately,thismodeofADhasadi ulty: the

f

′t

i

instru tions requiretheintermediate values

x

i−1

in thereverseof their reationorder. The troubleis thatprogramsoftenoverwritevariables,andthereforethesevaluesmaybelostwhenneeded bythe

f

′t

i

.

There are two main strategies to ope with this problem: Re ompute-All [3℄ or Store-All[7℄. Re ompute-Allstrategyisverydemandinginexe utiontime,quadrati withrespe t

(8)

tothenumberofrun-timeinstru tions,be auseitre omputestheintermediatesvaluesevery timetheyarerequired,fromasavedinitialpoint. Ontheotherhand,theStore-Allstrategy islinearwithrespe ttothenumberofrun-timeinstru tions,bothformemory onsumption and exe ution time, be ause it onsists in storing on a sta k all values required later by derivatives,and then restore them when theyareneeded. This results in thestru ture of reversedierentiatedprogramsshownonFigure1.

0

1

0

1

0

1

000

111

000

111 000000

111111

0

1

0

1

0

1

0

1

0

1

1 F orward

Sweep

Backward

Sweep

m

peak

¯

x

= f

′t

1 (x

0 ) × ¯

x

1 ;

x

j

= f

j

(x

j−1

);

x

p−1

= f

p−1

(x

p−2

);

¯

x

j

= f

j+1

′t

(x

j

) × ¯

x

j+1

;

x

0 ;

T IM E

...

..

.

...

¯

x

p−1

= f

p

′t

(x

p−1

) × ¯

y;

restore values

store values

(stack)

M EM ORY

Figure1: Thehorizontalaxisrepresentstheamountofvalues urrentlyonthesta k.

Be ause we will need to reason formally about adjoint programs in the sequel of this paper,weneedtodenotethemin amorealgebrai way. Thereversedierentiatedprogram

P

hastwo parts. The rst is alled the forward sweep

−

→

P

, and is basi ally the ne essary sli e"oftheoriginalprogram

P

plussomeinstru tionstostorerequiredvalues. These ond partis alled theba kwardsweep

←

−

P

, and onsistsof the instru tionsthat implementthe fun tions

f

′t

i

(x)

fromFormula1,plussomeinstru tionsto re overtheneededintermediate values.

Formalizingthestru tureoftheprograminFigure1,thestru tureofthereverse dier-entiatedprogram

P

ofaprogram

P

isroughlydes ribedbyequation(2)

P =

−

→

P ;

←

P = I

−

1 ; . . . ; I

_p−1

;

←

−

I

p

; . . . ;

←

−

I

1

(2)

Figure2showsthereversedierentiatedversionofasmallexamplepro edure,featuring theforwardandba kwardsweeps. ThePUSH()andPOP() alls storeandrestorevaluesof

(9)

requiredintermediatesvariables. We annowreneformula(2)byinsertingthese alls. For anyinstru tion

I

andany programtail

D

after

I

, theprogram

P

is dened re ursivelyby thefollowingequation:

P = I ; D =

−

→

I ; D ;

←

−

I = PUSH(

out

(I)) ; I ; D ; POP(

out

(I)) ; I

′

(3)

where out

(I)

is aset of values overwritten by instru tion

I

. In reality, westore onlythe intermediatesvalueswhi harerequiredto omputethederivativesof

I

andofitspre eding instru tions. Thedata-owequationsofthestati analysis thatevaluatesthesevalues"To BeRe orded",knownasthe"TBR"analysis,wasgivenin [11℄.

Original pro edure Reverse dierentiated pro edure

subroutine sub1(x,y,z)

I

1

tmp1 = SIN(y)

I

2

y = y * y

I

3

tmp1 = tmp1 * x

I

4

z = y / tmp1 end subroutine sub1_b(x,xb,y,yb,z,zb)

I

1

tmp1 = SIN(y) PUSH(y)

I

2

y = y * y PUSH(tmp1)

I

3

tmp1 = tmp1 * x

I

′

4 yb

= zb/tmp1

tmp1b

= −(y ∗ zb/tmp1 ∗ ∗2)

POP(tmp1)

I

′

3 xb

= tmp1 ∗ tmp1b

tmp1b

= x ∗ tmp1b

POP(y)

I

′

2

yb = 2 * y * yb

I

′

1

yb = COS(y) * tmp1b end

(10)

3 CHECKPOINTING

To ontrolthememoryproblem ausedbythestorageofintermediatesvalues,theStore-All strategy anbeimprovedintwomaindire tions: (1)renethedata-owanalysesin order toredu ethenumberofvaluestostore,and(2)dea tivatetheStore-Allstrategyfor hosen segmentsof the ode,therefore savingmemoryspa e. Theformer isdes ribedin [11, 12℄, thelatteristhefo usofthiswork.

The me hanism whi h dea tivates the Store-Allstrategy for ertain hosensegmentis alled he kpointing. Ithas two onsequen es on thebehaviorof thereverse dierentiated program:

1. whentheba kwardsweeprea hesthe hosensegment,itmustbeexe utedagain,this timewithStore-Allstrategyturnedon.

2. in order to exe ute the segment twi e, a su ient set of values ( alled a snapshot) mustbestoredbeforetherstexe utionofthesegment.

OnFigure 3,weassumethat

snapshot(C) < tape(C)

. This isareasonableassumption in most ases, and parti ularly when C is large. As a onsequen ewe see that

m

peak

c

is smallerthan

m

peak

,be ausein the he kpointed asetherstexe utionofsegmentCdoes not store anything. Conversely, we see that the time

t

c

is longer than

t

, be ause in the no- he kpointing aseeverypie e ofthe odeisexe utedonlyon e,whereasweobservein the he kpointing asethat segment

C

isexe utedtwi e (

C

and

−

→

C

).

Che kpointedsegments anbe hosenindierentways,and anbenested. One lassi al strategyisto he kpointea handeverypro edure all. However,experien eindi ates that thisstrategy isnotoptimal, thoughtheoptimalsituation isnoteasyto foresee. Sin ethe optimal he kpointingstrategy is stillout rea h, it seemsnaturalto lettheuser inuen e the hoi e. A ompletelyuser-driven he kpointingwillallowtheusertotryea handevery ombination, looking for an optimal pla ement of he kpoints. This paper des ribes the developmentstoa hievethis userintera tion. Inase ond step,this willletusexperiment aboutrulesandta ti s,towardsthelong-termgoalof omputeraidedoptimal he kpoint-ing. This paperpresentsourrstexperimentsinthisdire tion.

The assumption behind he kpointing is that

snapshot(C) < tape(C)

. To keep the snapshotssmall, weneedtodevelopthealgebrai notationofequation(3). Whensegment

C

is he kpointed (denoted with surrounding parentheses), reverse dierentiation of the program

P = U ; C; D

isdenedbythere ursiveequation

P = U ; (C); D =

−

→

U ; PUSH(

snp

(C)); C; D; POP(

snp

(C)); C;

←

−

U

(4)

where

U

/

D

arethe odesegments

U

pstream/

D

ownstreamof

C

andsnp

(C)

isthe snap-shotstoredto re-exe ute

C

. Intuitively, ifavariableis notmodied by

C

norby

D

, then

(11)

0

1

0

1

000

111

000

111

000

111

00

11

00

11

00

11

00

11

00

11

11 store snapshot(C)

−

→

C f orward sweep C

←

−

C backward sweep C

C original code

storing tape(C)

restoring tape(C)

restore snapshot(C)

m

peak

−

→

C

←

−

C

←

−

C

−

→

C

T I M E

Sweep

t

Backward

Sweep

t

c

Backward

F orward

C

m

peak

c

Sweep

F orward

(stack)

M EM ORY

Figure 3: Che kpointinginReverseModeAD.

its value will be unmodied when

C

is runagain and it is notne essary to store it. We shalldenotebyout

(X)

thesetofvariablesoverwrittenbythe odesegment

X

. Also,only the variables that are going to be used by

C

need to be in the snapshot. Indeed, only

(12)

thevariables that are used by

C

need to be stored, and this set is oftensmaller thanthe variablesusedby

C

. Weshall allitlive

(C)

, anditisdetermined bytheso- alledadjoint livenessanalysis. Thereforeagoodenough onservativedenitionofthesnapshotis:

snp

(C) =

live

(C) ∩ (

out

(C) ∪

out

(D))

(5)

Thedata-owequationsofadjointlivenessanalysisweredenedformallyin[11℄. Snapshots an be rened further, taking into a ount the intera tions between su essive or nested he kpointedsegments. A studyonminimalsnapshots anbefoundin[12℄.

Let'snowfo usonthe he kpointpla ementproblem. Intapenadelikeinmanyother ADtools,thenatural he kpointedsegmentis thepro edure all. Thereforein the sequel weshallexperimentwithvariouspla ementsof he kpoints,allaroundpro edure alls,and thereforeshownon alltrees. Thishypothesisisbynomeansrestri tiveandour on lusions anbeextendedtoarbitrary leanlynested odesegments. Figure4shows(ontheleft)the

take snapshot

use snapshot

original subroutine x

←

−

x

backward sweep for x

−

→

x

forward sweep for x

D

B

C

−

→

C

←

C

−

→

B

←

B

−

←

−

D

−

→

D

−

→

A

←

A

−

C

B

D

A

Figure4: Joint-All mode: Che kpointingall allsin ReverseModeAD

allgraphof anoriginal program, and the orrespondingreverse-dierentiated all graph, usingtheJoint-Allmode,whereallpro edure allsare he kpointed. ThisJoint-Allmodeis naturallythebasi mode,beingtheextremetrade-othat onsumestimeandsavesmemory. Memoryresour esarenite,whereasexe utiontimeresour esarenot. Thereforethis hoi e issafest,espe iallyifweassumethatsnapshotsaregenerallysmallerthanthe orresponding tape.

Figure 5showstheotherextreme alternative,whi h he kpointsnopro edure all. We allthisalternativeSplit-Allmode. Insplitmodetheforwardsweepandtheba kwardsweep areimplementedseparately. Thereisnodupli ateexe ution,sonosnapshotisrequiredand in theory theexe ution time is smallest. On the other hand the peak size of the tape is highest. Moreover,sin etheforwardsweepandtheba kwardsweepdonotfollowea hother during exe ution, even the valuesof the lo al variables need to be stored, whi h requires evenmoreintermediatevaluesinthetape.

Split-AllandJoint-Allmodesaretwoextremestrategies. Itisworthtryinghybrid ases, wepresenta oupleof asesin Figure6. Therststrategy(hybrid1)implementsthejoint

(13)

−

→

A

−

→

D

−

→

B

−

→

C

←

−

A

←

−

C

←

−

D

←

B

−

Figure5: Split-Allmode: noChe kpointinginReverseModeAD

modeforallpro eduresex eptfor

D

. Conversely,these ondstrategy(hybrid2)implements thesplitmodeforallpro eduresex eptforpro edure

D

,whi his he kpointed.

B

C

−

→

A

←

A

−

C

−

→

A

D

−

→

B

−

→

D

←

D

−

←

B

−

→

C

−

→

D

←

D

−

→

B

←

B

−

→

C

←

C

−

←

−

A

←

−

C

hybrid1

hybrid2

Figure6: Twohybridapproa hes(split-joint)

In order to havea morepre ise idea of theaforementioned trade-o we shall simulate the performan es of these four he kpointing strategies from gures 4, 5, and 6, for two motivating s enarios, namely when "tape

>

snapshot" and when "tape

<

snapshot". We assumethatallpro edures requirethesamesnapshotandtapesize. Also, weassumethat ea h pro edurehasthesameexe utiontime.

Forthersts enario,wesetthememorysizeofthesnapshotto6andthememorysize ofthe tapeto 10. This setup orrespondsto theusual assumptionthat thetape isbigger than thesnapshot for pro edures. Figure 7 shows the behavior of the four he kpointing strategies. As we expe ted, the urve that represents the joint onguration shows the smallestmemoryusebutthelargestexe utiontime. Conversely,the urvethat represents thesplitmodehasthehighestpeakofmemoryusebuttheshortestexe utiontime. Hybrid

(14)

0

5

10

15

20

25

30

35

40

0

10

20

30

40

50

60

70 memory

time

Joint v/s Split

Joint-All

Split-All

hybrid1

hybrid2

Figure 7: Numeri alSimulationresults,tape=10,snapshot=6

strategiesrangebetweenthesetwoextremes.

Thiss enarioassumedthatthetapeisbiggerthanthesnapshot. However,this assump-tion is not always valid. Therefore we make a se ond simulation where we assume that thetape osts6in memory,andea hsnapshot osts10. Figure8showsthat Joint-Alland Split-Allmodesarenottheextremeofthetrade-oanymore. Infa t,theextremeboundsin memory onsumption orrespondstothehybridmodes. Wealsonoti ethat themaximum peak ofmemoryuse issmallerthanin therst simulation,whi h is notsurprising sin eit depends mostlyonthetapesize,whi h isassumedsmaller. Inthiss enario,theadvantage of he kpointing is less obvious be ause of the osts of snapshots, therefore the Split-All modeisnearlythebestin everyrespe t.

Thereal dierentiated odeswillhaveforeverypro edure dierenttape,snapshotand exe ution time hara teristi s, making this motivating simulation look a bit unreal. This givesusafeelingofthebehaviorofreal odes,butexperimentswithreal odearemandatory. Beforewegetto that,weshallbrieydis ussthene essaryimplementationstep.

(15)

0

5

10

15

20

25

0

10

20

30

40

50

60

70 memory

time

Joint v/s Split

Joint-All

Split-All

hybrid1

hybrid2

Figure 8: Numeri alSimulationresults,tape=6,snapshot=10

4 IMPLEMENTATION

Weimplemented the algorithms and data-ow analysis mentioned in the previousse tion inside tapenade tool [10℄, whi h is a sour e-to-sour e AD engine. tapenade is written in JAVA and some modules are written in C. tapenade supports programs written in Fortran77andFortran90/95.

4.1 Modi ations of the Data-Flow Analyses

TheADmodelthat tapenadeimplementsrelies onseveraldata-owanalyses,allofthem formallydened in [9,11℄. However,these analysesimpli itlymadetheassumption ofthe Joint-All strategy. The he kpointingstrategyhassstrongimpa ton adjointliveness and TBRanalyses,whi hareinterpro edural.Morepre isely,itimpa tsthewaydata-ow infor-mationispropagatedonthe allgraphduringthebottom-upandtop-downanalysessweeps.

Forexample,sin efora he kpointedsegmenttheforwardsweepisfollowedimmediately bythe reversesweep,we anusethe fa t that alloriginal variables areuseless atthe end of the forward sweep. This is the foundation of the adjointliveness analysis [11℄. In the initialstateoftheADtoolwhere every allis he kpointed,thisallowedthe"adjoint-live" set at the tail of ea h pro edure to be the empty set. The adjoint-liveness analysis an thenpro eed,ba kwardsinsidetheow-graphofthepro edure,progressivelya umulating

(16)

variablesinto the set oflivevariables. Inthenewsituation where apro edure anbeleft insplitmode,theinitial "adjoint-live"setatthetailofthispro edure must hange,andit dependsof thelivevariablesin ea hofits allingsites. Morepre isely,weshallsetthelive variablesatthetailofanon- he kpointedpro edure(i.e. splitmode)totheunionofallthe livevariablesjust afterea hofthe allsitesforthispro edure.

Inordertoimplementthementionedadaptationwehavetoruntheadjointliveness anal-ysistwi e. Arstsweeprunsbottom-upontheoriginalprogram allgraph. Inthissweep webuildtheee tofea hpro edureonthesetoflivevariables,tobeusedinea hofits all sites. These ondrunis top-downand a umulates thesets oflivevariable after ea h all site,beforeitisusedastheinitialsetfortheadjointlivenessanalysisofeverysplitpro edure.

SimilarlytheTBRanalysishadtobetransformed. TheTBRanalysisrunsforward,from theheadtothetailofea hpro edure. Attheouterlevelofthe allgraph,theanalysis ould runinonlyonebottom-upsweep. Be auseTBRanalysisnowrequiresa ontextinformation in the aseof anon- he kpointed pro edure,that will arrythe unionof the TBR status justbeforethe allsites,wehadtoaddatop-downsweepintotheTBR analysis.

4.2 General Implementation Notes

Alongwith the modi ationof the analyses, the generation of the dierentiated program must alsobeadapted. The ADmodel dened by equation (4) showsthat thejoint mode runs the ba kward sweepof

C

,

←

−

C

, immediately after itsforwardsweep

−

→

C

. When

C

is a pro edure,

−

→

C

and

←

−

C

anbeeasily merged into asingle pro edure

C

. As a onsequen e, lo al variables of

C

(and therefore of

−

→

C

) are still in s ope when

←

−

C

starts, and naturally preservetheir values. Thisis no longer possible in split mode, sin epro edure

−

→

C

and

←

−

C

are separated. Consequently, lo al variables of

−

→

C

must be stored before they vanish and restoredwhen

←

−

C

starts. This wasaddressedin theimplementationbyaddinganextension totheTBRanalysis. Thisextensionlooksforthelo alsvariablesthatarene essaryforthe ba kwardsweep,whentheendoftheforwardsweepisrea hed. ThesevariablesarePUSH'ed justat theend oftheforwardsweepandPOP'edatthebeginningoftheba kwardsweep.

Wemakethe hoi eofgeneralizationversusspe ialization,byallowingforonlyonesplit mode perpro edure. Eventhen, this requires are in namingthepro edures. Weneed to reate up to four names (original, forward sweep, ba kward sweep and reverse dierenti-ated) when split and joint strategiesare ombined. This problem is te hni al, but it has impli ationswithinthewholewaytapenadehandlesthenamesofdierentiatedelements.

The split strategy isdriven bythe userby means ofa dire tive(C$AD NOCHECKPOINT) whi h is pla edjust before thepro edure all,orthrougha ommand lineoption(-split "[list of pro edure names℄"). Theintrodu tionofdire tivesisanovelfeaturefor tape-nade.

(17)

5 EXPERIMENTAL MEASUREMENTS

Weappliedthesplitmodeto ertainpro edure alls,lookingforexperimental onrmation ofthe intuitionsfrom Se tion 5. Inparti ular, wewantto showthe interestof lettingthe userdrivethe he kpointingstrategy.

The pro edures hosen to be split were the ones that best illustrate the memory and run-timetrade-o. The riteriato hoosepro edures relyontwovalues,whi h anbe ob-tainedby studyingthe reversegenerated ode. These valuesare: the size ofthe snapshot andthesize ofthetape. Theimplementationof bothsnapshotandtapeis basedonPUSH alls,thusthemeasurementsand omparisonsbetweenthese valuesarestraightforward.

In gures 9 and 11, loops are denoted by square bra kets. For instan e, on Figure 9 wehavetwoloops,onewhi hinvolvesfromsubroutinepasdtlto subroutinequaind,and a se ond one whi h in ludes all inbigfun 's pro edures. In general, these loops are the segmentsoftheprogramsthat onsumemostofmemoryandtime.

5.1 Experiment I: UNS2D

uns2disaCFDsolver. Ithas2.055linesof ode(lo ). Thereversedierentiatedversion has2.200 lo .

QUAIND

CALGRA

ENTHALD

INBIGFUNC

PASDTL

CALGRA

DIFFAR

FLW2D

SYMMT

CALGRA

BIGFUNCTION

CALCL

(18)

Experiment Time Memory Id Des ription Total[s℄ % gain Peak[Mb℄ % gain

01 Joint-Allstrategy 41.69 184.69

02 splitmode al l(all allsites) 37.66 9.7 167.53 9.3

03 splitmodequaind 37.54 9.9 162.13 12.2

04 splitmode algra(all allsites) 36.63 12.1 163.92 11.2 05 splitmodeenthald 34.33 17.6 162.17 12.2 06 splitmodeinbigfun 31.83 23.6 468.13 -153.5

07 02and 05 33.95 18.6 163.20 11.6 08 03and 06 31.75 23.8 446.82 -141.9 09 02,04and05 35.81 14.1 174.45 5.5 10 02,05and06 35.49 14.8 533.23 -188.7 11 02,03,04and05 38.50 7.6 184.45 0.13 12 02,04,05and06 30.92 25.8 408.88 -121.4 13 splitmodealltheabovepro edures 32.67 21.6 443.56 -140.2

Table5.1: Memoryandtimeperforman eforuns2d.

The rst four experiments02 - 05 of Table5.1 report gainboth in time and memory, remindingusofthe asewhere

tape < snp

(Figure8). Thisisindeedwhatweobservewhen wemeasurethea tualsizesoftapeandsnapshotforthepro eduresinquestion. Therefore, whenea hof algra, al l,quaindorenthaldaresplittheprogramsavesmemoryfor thesnapshotwithoutusing as mu h for thetape. Atthe sametimeit savestime be ause thepro edureisnotexe utedtwi e.

Experiment06 exhibits againin time at the ost of alargermemory use. As we sus-pe tedfromthesimulationsonFigure7,this orrespondstothe asewhere

snp < tape

. This onrmstheintuitionthat he kpointingisreallyworthwhileonlargese tionsofprogram. Inthissituation he kpointingisreally atime/memorytrade-o. Therefore he kpointing inbigfun (in otherwordsthejointmode)isawise hoi e whenmemorysize islimited.

Experiments07- 13 anbeseparatedin twosets: whether inbigfun is he kpointed (08,10,12and 13)ornot(07,09and11). Theseparation riterionunderlinestherelative weightofthesubroutineinbigfun .

Experiments07, 09and 11showsaremarkablebehavioron theexe utiontime perfor-man e. Wewouldexpe t theexe utiontime savingsof ombinedsplitmodepro eduresto a umulate,asweobservedin Figures 7and 8. Surprisingly, theexe ution timefor these experiments do not behave like that. In parti ular, the experiment 11's exe ution time saving(3.18s)is smallerthan theexe utiontimesavings(4.03s,4.15s,5.03sand 7.36s)for anyofthepro edures splitindividually. Wehaveat presentno learunderstanding ofthis

(19)

behavior.Itislikelythatthepresentmodelwehaveabouttheperforman esof he kpointed reverseprograms,isstillinsu ientto apturethisbehavior,andmustberenedfurther.

Asfor on retere ommendationsforthisexample,weadvise toapply splitmode spar-ingly,onlyononeortwoofsubroutines algra, al l,orquaindinthe asewherethere are stri t memory onstraints. This allowsfor memory savingsup to 12%. On the other hand,ifmemoryisnotanissueandspeedis,were ommendthe ongurationofexperiment 12.

5.2 Experiment II: SONICBOOM

soni boomisapartof aCFDsolverwhi h omputesthe residualof astate equation. It has14.263lo , but only818lo to bedierentiated, generating 2.987 lo of derivative pro edures.

GRADNOD

FLUROE

VCURVM

TRANSPIRATION

CONDDIRFLUX

PSIROE

Figure10: soni boom allgraph.

The rst groupof experiments 02 - 04 from Table 5.2, showsgains in exe ution time, be ause the pro edures are exe uted only on e. There is no gain in memory be ausethe sizeofthesnapshotandthetapearevery lose.

Theexperimentswheregradnodisamongthesplitsubroutinesexhibitthelargestgain in exe utiontime. This is relatedto thefa t thatgradnod a ountsforthe largestpart ofthe omputation, andsin ethetapesizegrowslikethenumberofexe utedinstru tions,

tape(

gradnod

)

is mu h largerthan

snp(

gradnod

)

. Forthe other pro edures in this

ex-perimentwealsohave

tape < snp

, butto asmallerextent. Therefore,everythingbehaves likeinthe lassi al aseofFigure7. Inparti ular,thereisnopro edureforwhi hthesplit modewould giveagainin amemory onsumption.

Itisworthnoti ingthattheee tofthesplitmodeisreallyanin reaseinmemory traf- ratherthanin memorypeak size. Forexamplesplitting onddirflux ertainlyresults

(20)

Experiment Time Memory Id Des ription Total[s℄ % gain Peak[Mb℄ % gain

02 splitmodev urnvm 0.2725 6.0 10.84 0.0

03 splitmode onddirflux 0.2699 6.9 10.84 0.0

04 splitmodefluroe 0.2500 13.8 11.06 -2.0

05 splitmodegradnod 0.2374 18.1 18.77 -73.1

06 02and 03 0.2624 9.5 10.84 0.0

07 04and 05 0.2374 18.1 19.00 -75.2

08 02,03and04 0.2475 14.7 11.08 -2.2

09 02,03and05 0.2360 18.6 18.77 -73.1

10 splitmodealltheabovepro edures 0.2374 18.1 19.00 -75.2

Table5.2: Memoryandtimeperforman eforsoni boom.

in a higher memory tra , but the lo al in rease of the lo al memory peak is hidden by themain memory peak whi ho urs just after

−

−−−−−−

→

gradnod. We are urrently arryingnew experimentsanddevelopingrenedmodelsthatin ludethis memorytra .

Pra ti allyforthisexperiment,ouradvi ewouldbetorunsubroutinesfluroe,v urvm and onddirflux(experiment08)insplitmodeinany ase,andthisalreadygivesa14.7% improvementintimeat virtuallyno ostin memory. Inthe asewherememorysizeis not limited strongly, then it is advisable to run gradnod in split mode too, whi h givesan additionalgainintimeatthe ostofalargein reaseinmemorypeak.

5.3 Experiment III: STICS

sti sisanagronomymodelingprogram. Ithas21.010lo ,andthereversedierentiated ode generated has 46.921lo . Inthe ode of sti s, weintrodu e three levels of nested loops around subroutine onebigloop be ause this ode simulates and unsteady pro ess over400timesteps. Thesenestedloopsareamanualmodi ationthatallowustoperform he kpointingonvariousgroupsoftimesteps. Wea knowledgethatthissimplisti method isfarfromtheknownoptimalstrategyrstdes ribedin[5℄.

Forthisexperiment,thedefault(Split-All)strategyappliedbytapenadegaveverybad resultsin time, with aslowdown fa torof about100from theoriginal ode to thereverse dierentiated ode. Wemadesomemeasurementsofthetapesizes omparedto the snap-shot sizes, and we found out that tape was mu h smaller than snapshot for subroutines densira , roira and onebigloop. This is a spe ial ase of the situation of Figure 8 andis ree tedon theexperimental guresof Table5.3. Wesee that split modeon these threepro eduresgainexe utiontimeatnomemory ost. Combinedsplitmodeonthethree

(21)

ONEBIGLOOP

BIGFUNCTION

BIOMAER

CROIRA

TRANSPI

MINERAL

LIXIV

INICLIM

LECSTAT

RECUP

BILAN

PROFIL

INITIAL

DENSIRAC

Figure11: sti s allgraph.

Experiment Time Memory

Id Des ription Total[s℄ % gain Peak[Mb℄ % gain

02 splitmodebiomaer 36.15 6.3 229.23 0.0 03 splitmodemineral 35.78 7.2 229.28 0.0 04 splitmodedensira 30.02 22.1 229.23 0.0 05 splitmode roira 24.45 36.6 229.23 0.0 06 splitmodeonebigloop 23.75 38.4 229.75 -0.2

07 04and05 16.79 56.5 229.23 0.0

08 04and06 15.64 59.4 229.75 -0.2

08 05and06 11.71 69.6 206.81 9.8

09 04,05and06 3.93 89.8 149.11 34.9

09 03,04,05and06 3.92 89.8 149.11 34.9 09 splitalltheabovepro edures 3.90 89.9 149.11 34.9

Table5.3: Memoryandtimeperforman eforsti s.

(22)

Theenormousgaininexe utiontimemakesthedierentiated/originalratiogodownto about7, whi h is what AD toolsgenerally laim. In thesti s experiment, theexe ution timeoftheSplit-Allversiondidnot omefromthedupli ateexe utionsdueto he kpointing butratherfromthetimeneededtoPUSHandPOPtheseverylargesnapshots. Thissuggests thata ompletemodeltostudyoptimal he kpointingstrategiesshould denitelytakeinto a ountthetimespentfortapeandsnapshotsoperations.

Pra ti ally,inthesti sexamplethereisnodoubtdensira , roiraandonebigloop should bedierentiated in split mode. Inaddition, one andierentiate additional pro e-duresinsplit mode,(e.g. mineral),buttheadditionalexe utiontimegainismarginal.

6 CONCLUSION,RELATED WORKS,FUTUREWORK

Thispaperisa ontribution towardstheultimategoalof optimallypla ing he kpointsin adjoint odes built by reverse mode Automati Dierentiation. We started from the ob-servation that the strategy onsisting in he kpointingea h and everypro edure allis in general,althoughsafefrom thememorypointof view,farfrom optimal. Both simulations onverysmall examples,andreal experimentsonreal-lifeprogramsshowthat some pro e-duresshouldneverbe he kpointed,andthatothersmaybe he kpointeddependingonthe availablememory. Thegreatvarietyofpossiblesituationsmakestheobje tiveofautomati sele tionof he kpointing sites very distant. It seemstherefore reasonableto let theuser drivethis hoi ethroughanadapteduserinterfa e. Wedis ussedthedevelopmentsthatwe madeintotheAD tooltapenadeto addthisfun tionality. This newfun tionallyallowed usto ondu textensiveexperimentsonreal odes,thatjustiedaposterioriourhypotheses onthisoptimal he kpointingproblemandsuggesttherelevant riteriaforafuturehelping tool namely, for ea h pro edure, its exe ution time, its tape and snapshot sizes, and the timerequiredbytapePUSHandPOPtra .

Relatedworksonoptimal he kpointinghavebeen ondu ted mostlyonthemodel ase ofloopsofxed-sizeiterations. Onlyin theparti ularsub- asewhere thenumberof itera-tionsin knowninadvan ewasanoptimals hemefoundmathemati ally[5℄. Thisgaverise to the treeverse/revolve [6℄ tool for an automati appli ation of this s heme. In the asewherethenumberofiterationsisnotknowninadvan e,averyinterestingsub-optimal s heme wasproposed in [13℄. We arenot aware of optimal he kpointing s hemes forthe aseof anarbitrary all-treeor all graph. Noti e that he kpointingis nottheonly way toimprovethe performan eof thereversemode ofAD.Lo al optimization anredu ethe omputation ostofthederivativesbyre-orderingthesub-expressionsinsidederivatives[8℄. Other optimizations implement a ne-grain time/memory trade-o by storing expensive sub-expressionsthato urseveraltimesinthederivatives. Inany asethesearelo al opti-mizationsthatonlygiveaxedsmallbenet. Forlargeprograms,onlynested he kpointing anmakereversedierentiated odesa tuallyrunwithoutex eedingthememory apa ity of the ma hine, and therefore the study of optimal he kpointing s hemes is an absolute

(23)

ne essity.

User-drivenpla ementof he kpointingisanimportantstepinthisdire tion,butfurther work is needed to help this pla ement or to propose a good enough automati strategy. This ould be based on exe ution time proling of the original program or even of the dierentiated ode itself. In any ase, we need to study the experimental gures found and to rene themodel we havebuiltfor the performan eof reversedierentiated odes. In parti ular this model must better take into a ount some of the surprising ee ts we have found, su h astime gains that do not add up. This suggests a pro ess of iterative improvementsofthereversedierentiated odes,basedonpreviousruns,mu hlikewhatis donein iterative ompilation[15℄.

Referen es

[1℄ AhoA.,SethiR.,andUllmanJ.Compilers: Prin iples,Te hniquesandTools. Addison-Wesley,1986.

[2℄ Corliss,G., Faure,Ch., Griewank,A., Has oët,L., andNaumann,U. Automati Dif-ferentiationofAlgorithms,fromSimulationtoOptimization.Springer,Sele tedpapers from AD2000,2001.

[3℄ Giering,R.: Tangentlinearandadjointmodel ompiler,usermanual.Te hni alreport, [wwwhttp://www.autodi. om/tam ℄,1997.

[4℄ F. Courty,A. Dervieux, B.Koobus, L.Has oët.Reverseautomati dierentiationfor optimum design: from adjoint state assembly to gradient omputation. Optimization MethodsandSoftware,Vol.18(5),615627,2003.

[5℄ A. Griewank. A hievinglogarithmi growthof temporal andspatial omplexity in re-verseautomati dierentiation.OptimizationMethods andSoftware.Vol.1(1), 3554, 1992.

[6℄ A. Griewankand A.Walther. Algorithm799: Revolve: An Implementationof Che k-pointfortheReverseorAdjointModeofComputationalDierentiation. ACMTrans. Math. Software.26(1).1999.

[7℄ A.Griewank.EvaluatingDerivatives: Prin iplesandTe hniquesof Algorithmi Dier-entiation. FrontiersinAppl.Math. SIAM,2000.

[8℄ A. Griewank and U. Naumann. A umulating Ja obians as Chained Sparse Matrix Produ ts.Math. Prog.,3(95),555571,Springer,2003.

[9℄ Has oët,L.,Naumann,U.,Pas ual,V.TBRAnalysisinReverse-ModeAutomati Dif-ferentiation. Preprint ANL-MCS/P936-0202, Argonne National Laboratory, also Re-sear hReport#RR-4856,INRIA,2002.

(24)

[10℄ L.Has oët,V. Pas ual.tapenade2.1 User'sguide.Te hni al Report.#RT-0300. IN-RIA,2004.

[11℄ L. Has oët, M. Araya-Polo. The Adjoint Data-Flow Analyses: Formalization, Prop-erties, and Appli ations.Automati Dierentiation: Appli ations, Theory, and Tools. Le ture NotesinComputationalS ien eandEngineering,135146,Springer,2005.

[12℄ L.Has oët,B.Dauvergne. TheData-FlowEquationsofChe kpointingin reverse Au-tomati Dierentiation,a eptedpaperatICCS2006,UniversityofReading,UK,May 28-31, 2006.

[13℄ M. Hinze, J. Sternberg. A-Revolve: an adaptive memory and run-time-redu ed pro- edure for al ulating adjoints; with anappli ation to theinstationary Navier-Stokes system.Optim. MethodsSoftw.,20,645663,2005.

[14℄ A.Jameson.Aerodynami designvia ontroltheory.SIAMJournalonS ienti Com-puting,Vol3(3), 233261,1998.

[15℄ T. Kisuki, P.M.W. Knijnenburg, M.F.P.O'Boyle,H.A.G Wijsho. Iterative Compila-tionin ProgramOptimization,InPro .CPC2000,pages35-44,2000.

(25)