HAL Id: inria-00079223
https://hal.inria.fr/inria-00079223v2
Submitted on 12 Jun 2006
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
Reverse-mode Automatic Differentiation
Laurent Hascoet, Mauricio Araya-Polo
To cite this version:
Laurent Hascoet, Mauricio Araya-Polo. Enabling user-driven Checkpointing strategies in
Reverse-mode Automatic Differentiation. [Research Report] RR-5930, INRIA. 2006, pp.21. �inria-00079223v2�
inria-00079223, version 2 - 12 Jun 2006
a p p o r t
d e r e c h e r c h e
9
-6
3
9
9
IS
R
N
IN
R
IA
/R
R
--5
9
3
0
--F
R
+
E
N
G
Thème NUM
Enabling user-driven Checkpointing strategies in
Reverse-mode Automatic Differentiation
Laurent Hascoët — Mauricio Araya-Polo
N° 5930
Laurent Has oët ,Mauri io Araya-Polo
ThèmeNUMSystèmesnumériques ProjetTropi s
Rapportdere her he n°5930Mai200621pages
Abstra t: This paperpresentsanewfun tionalityoftheAutomati Dierentiation(AD) Tool tapenade. tapenade generates adjoint odes whi h are widely used for optimiza-tion orinverse problems. Unfortunately, for large appli ations the adjoint ode demands agreat dealof memory, be ause it needs to store alarge set of intermediatesvalues. To opewiththatproblem, tapenadeimplementsasub-optimalversionofate hnique alled he kpointing,whi hisatrade-obetweenstorageandre omputation. Ourlong-termgoal istoprovideanoptimal he kpointingstrategyforevery ode,notyeta hievedbyanyAD tool. Towardsthat goal,werstintrodu emodi ationsin tapenadein ordertogivethe user the hoi e to sele t the he kpointing strategy most suitablefor their ode. Se ond, we ondu t experimentsin real-sizes ienti odesinorder togatherhintsthathelp usto dedu eanoptimal he kpointingstrategy. Someof theexperimentalresultsshowmemory savingsupto35%andexe utiontimeupto90%.
Résumé: Nousprésentonsunenouvellefon tionnalitédel'outildeDiérentiationAutomatique (DA)tapenade. LemodeinversedelaDA onstruitdes odesadjoints,quisontlargement utilisésen al uls ientique,pourl'optimisationoulesproblèmesinverses. Bienqu'apriori remarquablemente a e,lemodeinversesouredelatrèsgrande onsommationmémoire requisepour onserverdesvaleursintermédiairesdu programmeinitial. LeChe kpointing estun ompromissto kage-re al ulquiréduit ette onsommation. Notrebutestderendre l'utilisationduChe kpointingdanstapenadeplusexible,enparti ulierpardesdire tives utilisateur. Notrebutàtermeestdedévelopperdesstratégiessemi-automatiquesoptimales d'appli ationduChe kpointing. Dans erapport,nousprésentonslesmodi ationsapportées àtapenadepourrendrele Che kpointingexible, puisnous étudionset nous omparons ertainesstratégiesdeChe kpointingsurplusieursappli ationsréellesprovenantd'utilisations industrielles de tapenade, dans le but de dégager des heuristiques génerales. Certaines expérien esmontrentdesaméliorationsenmémoiredel'ordrede35%,etentempsd'exé ution del'ordrede90%.
1 INTRODUCTION
The ontextofthisworkisAutomati Dierentiation(AD)[2,7℄. ThereversemodeofAD isapromisingwaytobuildadjoint odesto omputegradients. Thefundamentaladvantage ofadjoint odesisthatthey omputegradientsata ostwhi hisindependentofthe dimen-sion of theinput spa e, and they are thus a key ingredient to solve inverse problems and optimizationproblems[14, 4℄. ADadjoint odesarefundamentallymadeof twosu essive sweeps,aforwardsweeprunning theoriginal odeand storingasigni antpartofthe in-termediatevalues,andaba kwardsweepusingthesevaluesto omputethederivatives. For largeappli ations,su hasCFDprograms,reversedierentiated odesmayendupusingfar toomu hmemory.
Che kpointing is a standardtime/memory trade-o ta ti to redu e the peak of this memoryuse. Whenasegmentoftheprogramis he kpointed,itisexe utedwithoutstorage of theintermediatevalues. Later on, when the ba kwardsweeprea hesthe he kpointed segment,thissegmentmustbeexe utedase ondtimewithstorage,andnallytheba kward sweepmayresume. Che kpointinghasabenet: therearetwopla eswherethememorysize rea hesapeak,namelyattheend oftheforwardsweepandat theendofthe he kpointed segment, and both peaks are generally smaller than thepeak without he kpointing. On theotherhand, he kpointinghasa ost: (1)inexe utiontimebe ausesegmentisexe uted twi e and (2) in memory be ause intermediate values must be store to run the segment twi e. Hopefullythislast memory ostislessthanthememorybenetabove.
In ADtools, he kpointingis applied systemati ally, for instan e at pro edure alls or aroundloopsbodies. Experien eshowsthat he kpointingeverypro edure allisingeneral sub-optimal. Optimalstrategieshavebeenfoundonlyforthe aseofaxed-lengthloop[5℄, andnotforthenestedpro edurestru tureofreal-life odes.
TowardstheultimategoalofanADtoolembedding anoptimal he kpointingstrategy forallprograms, we propose in arst stepto a tivate he kpointingfor onlyanumberof user-sele tedpro edure alls. Therefore, in addition to the default systemati he kpoint mode ( alledjoint modein [7℄), ea h pro eduremaynowbe dierentiatedin theso- alled splitmode, i.e. without he kpointing. Insplit mode,thepro edure givesriseto two sepa-ratedierentiatedpro edures,onefortheforwardsweepandonefortheba kwardsweep.
This paperpresentstheimplementationof thisnewsplit mode fun tionality inside our ADtooltapenade[10℄,whi huptonowonlyfeaturedthejointmode. Wealsodis ussthe ne essaryadaption of theexisting preliminary data-owanalyses namely, adjoint-liveness analysis[11℄and TBRanalysis [9,11℄. Inase ondstep,weusethisuser ontrolon he k-pointing to make experimental measurements of various he kpointing hoi es on several large s ienti odes. We present the results of these experiments, some of whi h show savingsofmemoryupto 35%andexe utiontimeupto 90%. Also,these resultsgivehints toageneralautomati strategyofwhereto use he kpointing. Atpresent,noADtoolhas
su hageneral he kpointingstrategy,andourlongtermgoalistoprovideoneintapenade.
The remainder of this paperis stru tured asfollows: Se tion 2 introdu esthe reverse mode of AD.In Se tion3 wepresentthe he kpointing te hnique andshowhowdierent he kpointingpla ementstrategiesae tthebehaviorofthereversedierentiated ode. In Se tion 4we dis uss the implementation issues. In Se tion 5 we present and dis uss the experimental measurements. Finally, we dis uss the future work and the on lusions in Se tion6.
2 REVERSE AUTOMATIC DIFFERENTIATION
In our ontext, AD is a program transformation te hnique to obtain derivatives, and in parti ular gradients. We aregivenaprogram
P
that evaluates afun tionF
. ProgramP
anbeseenasasequentiallistofinstru tionsI
j
P = I
1
; I
2
; . . . ; I
j
; . . . ; I
p−1
; I
p
,
wheretheinstru tionsrepresentelementaryfun tions
f
i
. Thenthefun tionF
isindeedF = f
p
◦
f
p−1
◦
. . . ◦ f
j
◦
. . . ◦ f
2
◦
f
1
.
ADtakesadvantageofthistoapplythe hainruleof al ulustobuildanewprogramthat evaluatesthederivativesof
F.
ThereversemodeofAD omputesgradients. Roughlyspeaking,foragivens alaroutput,it returnsthedire tionin theinputspa ethatmaximizesthein reaseofthisoutput. Stri tly speakinggradientisdenedonlyfors alaroutputfun tions. Therefore,webuildave tor
Y
thatdenes theweightsofea h omponentoftheoriginaloutputY = F (X)
. Thisdenes as alaroutputY
t
×
Y = Y
t
×
Y = F
t
(X) × Y
. Itsgradienthasthusthefollowingform:X = F
′t
(X) × Y = f
′t
1
(x
0
) × . . . × f
j+1
′t
(x
j
) × . . . × f
p
′t
(x
p−1
) × Y
(1)where
x
i−1
is the set of all variables values just before exe ution of the instru tion that implementsf
′t
i
,andF
′t
(X)
isthetransposed Ja obian.
Formula1isimplementedfromrighttoleft,be ausematrix
×
ve torprodu tsare heaper to omputethan matrix×
matrix produ ts. This resultin probably themost e ient way to omputeagradient. Unfortunately,thismodeofADhasadi ulty: thef
′t
i
instru tions requiretheintermediate valuesx
i−1
in thereverseof their reationorder. The troubleis thatprogramsoftenoverwritevariables,andthereforethesevaluesmaybelostwhenneeded bythef
′t
i
.There are two main strategies to ope with this problem: Re ompute-All [3℄ or Store-All[7℄. Re ompute-Allstrategyisverydemandinginexe utiontime,quadrati withrespe t
tothenumberofrun-timeinstru tions,be auseitre omputestheintermediatesvaluesevery timetheyarerequired,fromasavedinitialpoint. Ontheotherhand,theStore-Allstrategy islinearwithrespe ttothenumberofrun-timeinstru tions,bothformemory onsumption and exe ution time, be ause it onsists in storing on a sta k all values required later by derivatives,and then restore them when theyareneeded. This results in thestru ture of reversedierentiatedprogramsshownonFigure1.
0
0
1
1
0
0
0
1
1
1
0
0
1
1
000
111
000
111
000000
111111
0
0
1
1
0
0
0
1
1
1
0
0
1
1
0
0
0
1
1
1
0
0
1
1
F orward
Sweep
Backward
Sweep
m
peak
¯
x
= f
′t
1
(x
0
) × ¯
x
1
;
x
j
= f
j
(x
j−1
);
x
p−1
= f
p−1
(x
p−2
);
¯
x
j
= f
j+1
′t
(x
j
) × ¯
x
j+1
;
x
0
;
T IM E
...
...
..
.
...
¯
x
p−1
= f
p
′t
(x
p−1
) × ¯
y;
restore values
store values
(stack)
M EM ORY
Figure1: Thehorizontalaxisrepresentstheamountofvalues urrentlyonthesta k.
Be ause we will need to reason formally about adjoint programs in the sequel of this paper,weneedtodenotethemin amorealgebrai way. Thereversedierentiatedprogram
P
hastwo parts. The rst is alled the forward sweep−
→
P
, and is basi ally the ne essary sli e"oftheoriginalprogramP
plussomeinstru tionstostorerequiredvalues. These ond partis alled theba kwardsweep←
−
P
, and onsistsof the instru tionsthat implementthe fun tionsf
′t
i
(x)
fromFormula1,plussomeinstru tionsto re overtheneededintermediate values.Formalizingthestru tureoftheprograminFigure1,thestru tureofthereverse dier-entiatedprogram
P
ofaprogramP
isroughlydes ribedbyequation(2)P =
−
→
P ;
←
P = I
−
1
; . . . ; I
p−1
;
←
−
I
p
; . . . ;
←
−
I
1
(2)Figure2showsthereversedierentiatedversionofasmallexamplepro edure,featuring theforwardandba kwardsweeps. ThePUSH()andPOP() alls storeandrestorevaluesof
requiredintermediatesvariables. We annowreneformula(2)byinsertingthese alls. For anyinstru tion
I
andany programtailD
afterI
, theprogramP
is dened re ursivelyby thefollowingequation:P = I ; D =
−
→
I ; D ;
←
−
I = PUSH(
out(I)) ; I ; D ; POP(
out(I)) ; I
′
(3)
where out
(I)
is aset of values overwritten by instru tionI
. In reality, westore onlythe intermediatesvalueswhi harerequiredto omputethederivativesofI
andofitspre eding instru tions. Thedata-owequationsofthestati analysis thatevaluatesthesevalues"To BeRe orded",knownasthe"TBR"analysis,wasgivenin [11℄.Original pro edure Reverse dierentiated pro edure
subroutine sub1(x,y,z)
I
1
tmp1 = SIN(y)I
2
y = y * yI
3
tmp1 = tmp1 * xI
4
z = y / tmp1 end subroutine sub1_b(x,xb,y,yb,z,zb)I
1
tmp1 = SIN(y) PUSH(y)I
2
y = y * y PUSH(tmp1)I
3
tmp1 = tmp1 * x<forward sweep ends, ba kward sweep begins>
I
′
4
yb
= zb/tmp1
tmp1b
= −(y ∗ zb/tmp1 ∗ ∗2)
POP(tmp1)I
′
3
xb
= tmp1 ∗ tmp1b
tmp1b
= x ∗ tmp1b
POP(y)I
′
2
yb = 2 * y * ybI
′
1
yb = COS(y) * tmp1b end3 CHECKPOINTING
To ontrolthememoryproblem ausedbythestorageofintermediatesvalues,theStore-All strategy anbeimprovedintwomaindire tions: (1)renethedata-owanalysesin order toredu ethenumberofvaluestostore,and(2)dea tivatetheStore-Allstrategyfor hosen segmentsof the ode,therefore savingmemoryspa e. Theformer isdes ribedin [11, 12℄, thelatteristhefo usofthiswork.
The me hanism whi h dea tivates the Store-Allstrategy for ertain hosensegmentis alled he kpointing. Ithas two onsequen es on thebehaviorof thereverse dierentiated program:
1. whentheba kwardsweeprea hesthe hosensegment,itmustbeexe utedagain,this timewithStore-Allstrategyturnedon.
2. in order to exe ute the segment twi e, a su ient set of values ( alled a snapshot) mustbestoredbeforetherstexe utionofthesegment.
OnFigure 3,weassumethat
snapshot(C) < tape(C)
. This isareasonableassumption in most ases, and parti ularly when C is large. As a onsequen ewe see thatm
peak
c
is smallerthanm
peak
,be ausein the he kpointed asetherstexe utionofsegmentCdoes not store anything. Conversely, we see that the timet
c
is longer thant
, be ause in the no- he kpointing aseeverypie e ofthe odeisexe utedonlyon e,whereasweobservein the he kpointing asethat segmentC
isexe utedtwi e (C
and−
→
C
).Che kpointedsegments anbe hosenindierentways,and anbenested. One lassi al strategyisto he kpointea handeverypro edure all. However,experien eindi ates that thisstrategy isnotoptimal, thoughtheoptimalsituation isnoteasyto foresee. Sin ethe optimal he kpointingstrategy is stillout rea h, it seemsnaturalto lettheuser inuen e the hoi e. A ompletelyuser-driven he kpointingwillallowtheusertotryea handevery ombination, looking for an optimal pla ement of he kpoints. This paper des ribes the developmentstoa hievethis userintera tion. Inase ond step,this willletusexperiment aboutrulesandta ti s,towardsthelong-termgoalof omputeraidedoptimal he kpoint-ing. This paperpresentsourrstexperimentsinthisdire tion.
The assumption behind he kpointing is that
snapshot(C) < tape(C)
. To keep the snapshotssmall, weneedtodevelopthealgebrai notationofequation(3). WhensegmentC
is he kpointed (denoted with surrounding parentheses), reverse dierentiation of the programP = U ; C; D
isdenedbythere ursiveequationP = U ; (C); D =
−
→
U ; PUSH(
snp(C)); C; D; POP(
snp(C)); C;
←
−
U
(4)where
U
/D
arethe odesegmentsU
pstream/D
ownstreamofC
andsnp(C)
isthe snap-shotstoredto re-exe uteC
. Intuitively, ifavariableis notmodied byC
norbyD
, then0
0
0
1
1
1
0
0
1
1
000
111
000
111
000
111
00
00
00
00
00
11
11
11
11
11
00
00
00
00
00
11
11
11
11
11
00
00
00
00
00
11
11
11
11
11
00
00
00
00
11
11
11
11
00
00
00
00
11
11
11
11
store snapshot(C)
−
→
C f orward sweep C
←
−
C backward sweep C
C original code
storing tape(C)
restoring tape(C)
restore snapshot(C)
m
peak
−
→
C
←
−
C
←
−
C
−
→
C
T I M E
Sweep
Sweep
t
Backward
Sweep
t
c
Backward
F orward
C
m
peak
c
Sweep
F orward
(stack)
M EM ORY
Figure 3: Che kpointinginReverseModeAD.
its value will be unmodied when
C
is runagain and it is notne essary to store it. We shalldenotebyout(X)
thesetofvariablesoverwrittenbythe odesegmentX
. Also,only the variables that are going to be used byC
need to be in the snapshot. Indeed, onlythevariables that are used by
C
need to be stored, and this set is oftensmaller thanthe variablesusedbyC
. Weshall allitlive(C)
, anditisdetermined bytheso- alledadjoint livenessanalysis. Thereforeagoodenough onservativedenitionofthesnapshotis:snp
(C) =
live(C) ∩ (
out(C) ∪
out(D))
(5)Thedata-owequationsofadjointlivenessanalysisweredenedformallyin[11℄. Snapshots an be rened further, taking into a ount the intera tions between su essive or nested he kpointedsegments. A studyonminimalsnapshots anbefoundin[12℄.
Let'snowfo usonthe he kpointpla ementproblem. Intapenadelikeinmanyother ADtools,thenatural he kpointedsegmentis thepro edure all. Thereforein the sequel weshallexperimentwithvariouspla ementsof he kpoints,allaroundpro edure alls,and thereforeshownon alltrees. Thishypothesisisbynomeansrestri tiveandour on lusions anbeextendedtoarbitrary leanlynested odesegments. Figure4shows(ontheleft)the
take snapshot
use snapshot
original subroutine x
←
−
x
x
backward sweep for x
−
→
x
forward sweep for x
D
B
C
C
−
→
C
←
C
−
−
→
B
←
B
−
←
−
D
−
→
D
−
→
A
←
A
−
C
B
D
A
Figure4: Joint-All mode: Che kpointingall allsin ReverseModeAD
allgraphof anoriginal program, and the orrespondingreverse-dierentiated all graph, usingtheJoint-Allmode,whereallpro edure allsare he kpointed. ThisJoint-Allmodeis naturallythebasi mode,beingtheextremetrade-othat onsumestimeandsavesmemory. Memoryresour esarenite,whereasexe utiontimeresour esarenot. Thereforethis hoi e issafest,espe iallyifweassumethatsnapshotsaregenerallysmallerthanthe orresponding tape.
Figure 5showstheotherextreme alternative,whi h he kpointsnopro edure all. We allthisalternativeSplit-Allmode. Insplitmodetheforwardsweepandtheba kwardsweep areimplementedseparately. Thereisnodupli ateexe ution,sonosnapshotisrequiredand in theory theexe ution time is smallest. On the other hand the peak size of the tape is highest. Moreover,sin etheforwardsweepandtheba kwardsweepdonotfollowea hother during exe ution, even the valuesof the lo al variables need to be stored, whi h requires evenmoreintermediatevaluesinthetape.
Split-AllandJoint-Allmodesaretwoextremestrategies. Itisworthtryinghybrid ases, wepresenta oupleof asesin Figure6. Therststrategy(hybrid1)implementsthejoint
−
→
A
−
→
D
−
→
B
−
→
C
←
−
A
←
−
C
←
−
D
←
B
−
Figure5: Split-Allmode: noChe kpointinginReverseModeAD
modeforallpro eduresex eptfor
D
. Conversely,these ondstrategy(hybrid2)implements thesplitmodeforallpro eduresex eptforpro edureD
,whi his he kpointed.B
C
−
→
A
←
A
−
C
−
→
A
D
−
→
B
−
→
D
←
D
−
←
B
−
−
→
C
−
→
D
←
D
−
−
→
B
←
B
−
−
→
C
←
C
−
←
−
A
←
−
C
hybrid1
hybrid2
Figure6: Twohybridapproa hes(split-joint)
In order to havea morepre ise idea of theaforementioned trade-o we shall simulate the performan es of these four he kpointing strategies from gures 4, 5, and 6, for two motivating s enarios, namely when "tape
>
snapshot" and when "tape<
snapshot". We assumethatallpro edures requirethesamesnapshotandtapesize. Also, weassumethat ea h pro edurehasthesameexe utiontime.Forthersts enario,wesetthememorysizeofthesnapshotto6andthememorysize ofthe tapeto 10. This setup orrespondsto theusual assumptionthat thetape isbigger than thesnapshot for pro edures. Figure 7 shows the behavior of the four he kpointing strategies. As we expe ted, the urve that represents the joint onguration shows the smallestmemoryusebutthelargestexe utiontime. Conversely,the urvethat represents thesplitmodehasthehighestpeakofmemoryusebuttheshortestexe utiontime. Hybrid
0
5
10
15
20
25
30
35
40
0
10
20
30
40
50
60
70
memory
time
Joint v/s Split
Joint-All
Split-All
hybrid1
hybrid2
Figure 7: Numeri alSimulationresults,tape=10,snapshot=6
strategiesrangebetweenthesetwoextremes.
Thiss enarioassumedthatthetapeisbiggerthanthesnapshot. However,this assump-tion is not always valid. Therefore we make a se ond simulation where we assume that thetape osts6in memory,andea hsnapshot osts10. Figure8showsthat Joint-Alland Split-Allmodesarenottheextremeofthetrade-oanymore. Infa t,theextremeboundsin memory onsumption orrespondstothehybridmodes. Wealsonoti ethat themaximum peak ofmemoryuse issmallerthanin therst simulation,whi h is notsurprising sin eit depends mostlyonthetapesize,whi h isassumedsmaller. Inthiss enario,theadvantage of he kpointing is less obvious be ause of the osts of snapshots, therefore the Split-All modeisnearlythebestin everyrespe t.
Thereal dierentiated odeswillhaveforeverypro edure dierenttape,snapshotand exe ution time hara teristi s, making this motivating simulation look a bit unreal. This givesusafeelingofthebehaviorofreal odes,butexperimentswithreal odearemandatory. Beforewegetto that,weshallbrieydis ussthene essaryimplementationstep.
0
5
10
15
20
25
0
10
20
30
40
50
60
70
memory
time
Joint v/s Split
Joint-All
Split-All
hybrid1
hybrid2
Figure 8: Numeri alSimulationresults,tape=6,snapshot=10
4 IMPLEMENTATION
Weimplemented the algorithms and data-ow analysis mentioned in the previousse tion inside tapenade tool [10℄, whi h is a sour e-to-sour e AD engine. tapenade is written in JAVA and some modules are written in C. tapenade supports programs written in Fortran77andFortran90/95.
4.1 Modi ations of the Data-Flow Analyses
TheADmodelthat tapenadeimplementsrelies onseveraldata-owanalyses,allofthem formallydened in [9,11℄. However,these analysesimpli itlymadetheassumption ofthe Joint-All strategy. The he kpointingstrategyhassstrongimpa ton adjointliveness and TBRanalyses,whi hareinterpro edural.Morepre isely,itimpa tsthewaydata-ow infor-mationispropagatedonthe allgraphduringthebottom-upandtop-downanalysessweeps.
Forexample,sin efora he kpointedsegmenttheforwardsweepisfollowedimmediately bythe reversesweep,we anusethe fa t that alloriginal variables areuseless atthe end of the forward sweep. This is the foundation of the adjointliveness analysis [11℄. In the initialstateoftheADtoolwhere every allis he kpointed,thisallowedthe"adjoint-live" set at the tail of ea h pro edure to be the empty set. The adjoint-liveness analysis an thenpro eed,ba kwardsinsidetheow-graphofthepro edure,progressivelya umulating
variablesinto the set oflivevariables. Inthenewsituation where apro edure anbeleft insplitmode,theinitial "adjoint-live"setatthetailofthispro edure must hange,andit dependsof thelivevariablesin ea hofits allingsites. Morepre isely,weshallsetthelive variablesatthetailofanon- he kpointedpro edure(i.e. splitmode)totheunionofallthe livevariablesjust afterea hofthe allsitesforthispro edure.
Inordertoimplementthementionedadaptationwehavetoruntheadjointliveness anal-ysistwi e. Arstsweeprunsbottom-upontheoriginalprogram allgraph. Inthissweep webuildtheee tofea hpro edureonthesetoflivevariables,tobeusedinea hofits all sites. These ondrunis top-downand a umulates thesets oflivevariable after ea h all site,beforeitisusedastheinitialsetfortheadjointlivenessanalysisofeverysplitpro edure.
SimilarlytheTBRanalysishadtobetransformed. TheTBRanalysisrunsforward,from theheadtothetailofea hpro edure. Attheouterlevelofthe allgraph,theanalysis ould runinonlyonebottom-upsweep. Be auseTBRanalysisnowrequiresa ontextinformation in the aseof anon- he kpointed pro edure,that will arrythe unionof the TBR status justbeforethe allsites,wehadtoaddatop-downsweepintotheTBR analysis.
4.2 General Implementation Notes
Alongwith the modi ationof the analyses, the generation of the dierentiated program must alsobeadapted. The ADmodel dened by equation (4) showsthat thejoint mode runs the ba kward sweepof
C
,←
−
C
, immediately after itsforwardsweep−
→
C
. WhenC
is a pro edure,−
→
C
and←
−
C
anbeeasily merged into asingle pro edureC
. As a onsequen e, lo al variables ofC
(and therefore of−
→
C
) are still in s ope when←
−
C
starts, and naturally preservetheir values. Thisis no longer possible in split mode, sin epro edure−
→
C
and←
−
C
are separated. Consequently, lo al variables of
−
→
C
must be stored before they vanish and restoredwhen←
−
C
starts. This wasaddressedin theimplementationbyaddinganextension totheTBRanalysis. Thisextensionlooksforthelo alsvariablesthatarene essaryforthe ba kwardsweep,whentheendoftheforwardsweepisrea hed. ThesevariablesarePUSH'ed justat theend oftheforwardsweepandPOP'edatthebeginningoftheba kwardsweep.Wemakethe hoi eofgeneralizationversusspe ialization,byallowingforonlyonesplit mode perpro edure. Eventhen, this requires are in namingthepro edures. Weneed to reate up to four names (original, forward sweep, ba kward sweep and reverse dierenti-ated) when split and joint strategiesare ombined. This problem is te hni al, but it has impli ationswithinthewholewaytapenadehandlesthenamesofdierentiatedelements.
The split strategy isdriven bythe userby means ofa dire tive(C$AD NOCHECKPOINT) whi h is pla edjust before thepro edure all,orthrougha ommand lineoption(-split "[list of pro edure names℄"). Theintrodu tionofdire tivesisanovelfeaturefor tape-nade.
5 EXPERIMENTAL MEASUREMENTS
Weappliedthesplitmodeto ertainpro edure alls,lookingforexperimental onrmation ofthe intuitionsfrom Se tion 5. Inparti ular, wewantto showthe interestof lettingthe userdrivethe he kpointingstrategy.
The pro edures hosen to be split were the ones that best illustrate the memory and run-timetrade-o. The riteriato hoosepro edures relyontwovalues,whi h anbe ob-tainedby studyingthe reversegenerated ode. These valuesare: the size ofthe snapshot andthesize ofthetape. Theimplementationof bothsnapshotandtapeis basedonPUSH alls,thusthemeasurementsand omparisonsbetweenthese valuesarestraightforward.
In gures 9 and 11, loops are denoted by square bra kets. For instan e, on Figure 9 wehavetwoloops,onewhi hinvolvesfromsubroutinepasdtlto subroutinequaind,and a se ond one whi h in ludes all inbigfun 's pro edures. In general, these loops are the segmentsoftheprogramsthat onsumemostofmemoryandtime.
5.1 Experiment I: UNS2D
uns2disaCFDsolver. Ithas2.055linesof ode(lo ). Thereversedierentiatedversion has2.200 lo .
QUAIND
CALGRA
ENTHALD
INBIGFUNC
PASDTL
CALGRA
DIFFAR
FLW2D
SYMMT
CALGRA
BIGFUNCTION
CALCL
CALCL
Experiment Time Memory Id Des ription Total[s℄ % gain Peak[Mb℄ % gain
01 Joint-Allstrategy 41.69 184.69
02 splitmode al l(all allsites) 37.66 9.7 167.53 9.3
03 splitmodequaind 37.54 9.9 162.13 12.2
04 splitmode algra(all allsites) 36.63 12.1 163.92 11.2 05 splitmodeenthald 34.33 17.6 162.17 12.2 06 splitmodeinbigfun 31.83 23.6 468.13 -153.5
07 02and 05 33.95 18.6 163.20 11.6 08 03and 06 31.75 23.8 446.82 -141.9 09 02,04and05 35.81 14.1 174.45 5.5 10 02,05and06 35.49 14.8 533.23 -188.7 11 02,03,04and05 38.50 7.6 184.45 0.13 12 02,04,05and06 30.92 25.8 408.88 -121.4 13 splitmodealltheabovepro edures 32.67 21.6 443.56 -140.2
Table5.1: Memoryandtimeperforman eforuns2d.
The rst four experiments02 - 05 of Table5.1 report gainboth in time and memory, remindingusofthe asewhere
tape < snp
(Figure8). Thisisindeedwhatweobservewhen wemeasurethea tualsizesoftapeandsnapshotforthepro eduresinquestion. Therefore, whenea hof algra, al l,quaindorenthaldaresplittheprogramsavesmemoryfor thesnapshotwithoutusing as mu h for thetape. Atthe sametimeit savestime be ause thepro edureisnotexe utedtwi e.Experiment06 exhibits againin time at the ost of alargermemory use. As we sus-pe tedfromthesimulationsonFigure7,this orrespondstothe asewhere
snp < tape
. This onrmstheintuitionthat he kpointingisreallyworthwhileonlargese tionsofprogram. Inthissituation he kpointingisreally atime/memorytrade-o. Therefore he kpointing inbigfun (in otherwordsthejointmode)isawise hoi e whenmemorysize islimited.Experiments07- 13 anbeseparatedin twosets: whether inbigfun is he kpointed (08,10,12and 13)ornot(07,09and11). Theseparation riterionunderlinestherelative weightofthesubroutineinbigfun .
Experiments07, 09and 11showsaremarkablebehavioron theexe utiontime perfor-man e. Wewouldexpe t theexe utiontime savingsof ombinedsplitmodepro eduresto a umulate,asweobservedin Figures 7and 8. Surprisingly, theexe ution timefor these experiments do not behave like that. In parti ular, the experiment 11's exe ution time saving(3.18s)is smallerthan theexe utiontimesavings(4.03s,4.15s,5.03sand 7.36s)for anyofthepro edures splitindividually. Wehaveat presentno learunderstanding ofthis
behavior.Itislikelythatthepresentmodelwehaveabouttheperforman esof he kpointed reverseprograms,isstillinsu ientto apturethisbehavior,andmustberenedfurther.
Asfor on retere ommendationsforthisexample,weadvise toapply splitmode spar-ingly,onlyononeortwoofsubroutines algra, al l,orquaindinthe asewherethere are stri t memory onstraints. This allowsfor memory savingsup to 12%. On the other hand,ifmemoryisnotanissueandspeedis,were ommendthe ongurationofexperiment 12.
5.2 Experiment II: SONICBOOM
soni boomisapartof aCFDsolverwhi h omputesthe residualof astate equation. It has14.263lo , but only818lo to bedierentiated, generating 2.987 lo of derivative pro edures.
GRADNOD
FLUROE
VCURVM
TRANSPIRATION
CONDDIRFLUX
PSIROE
Figure10: soni boom allgraph.
The rst groupof experiments 02 - 04 from Table 5.2, showsgains in exe ution time, be ause the pro edures are exe uted only on e. There is no gain in memory be ausethe sizeofthesnapshotandthetapearevery lose.
Theexperimentswheregradnodisamongthesplitsubroutinesexhibitthelargestgain in exe utiontime. This is relatedto thefa t thatgradnod a ountsforthe largestpart ofthe omputation, andsin ethetapesizegrowslikethenumberofexe utedinstru tions,
tape(
gradnod)
is mu h largerthansnp(
gradnod)
. Forthe other pro edures in thisex-perimentwealsohave
tape < snp
, butto asmallerextent. Therefore,everythingbehaves likeinthe lassi al aseofFigure7. Inparti ular,thereisnopro edureforwhi hthesplit modewould giveagainin amemory onsumption.Itisworthnoti ingthattheee tofthesplitmodeisreallyanin reaseinmemory traf- ratherthanin memorypeak size. Forexamplesplitting onddirflux ertainlyresults
Experiment Time Memory Id Des ription Total[s℄ % gain Peak[Mb℄ % gain
01 Joint-Allstrategy 0.2900 10.84
02 splitmodev urnvm 0.2725 6.0 10.84 0.0
03 splitmode onddirflux 0.2699 6.9 10.84 0.0
04 splitmodefluroe 0.2500 13.8 11.06 -2.0
05 splitmodegradnod 0.2374 18.1 18.77 -73.1
06 02and 03 0.2624 9.5 10.84 0.0
07 04and 05 0.2374 18.1 19.00 -75.2
08 02,03and04 0.2475 14.7 11.08 -2.2
09 02,03and05 0.2360 18.6 18.77 -73.1
10 splitmodealltheabovepro edures 0.2374 18.1 19.00 -75.2
Table5.2: Memoryandtimeperforman eforsoni boom.
in a higher memory tra , but the lo al in rease of the lo al memory peak is hidden by themain memory peak whi ho urs just after
−
−−−−−−
→
gradnod. We are urrently arryingnew experimentsanddevelopingrenedmodelsthatin ludethis memorytra .
Pra ti allyforthisexperiment,ouradvi ewouldbetorunsubroutinesfluroe,v urvm and onddirflux(experiment08)insplitmodeinany ase,andthisalreadygivesa14.7% improvementintimeat virtuallyno ostin memory. Inthe asewherememorysizeis not limited strongly, then it is advisable to run gradnod in split mode too, whi h givesan additionalgainintimeatthe ostofalargein reaseinmemorypeak.
5.3 Experiment III: STICS
sti sisanagronomymodelingprogram. Ithas21.010lo ,andthereversedierentiated ode generated has 46.921lo . Inthe ode of sti s, weintrodu e three levels of nested loops around subroutine onebigloop be ause this ode simulates and unsteady pro ess over400timesteps. Thesenestedloopsareamanualmodi ationthatallowustoperform he kpointingonvariousgroupsoftimesteps. Wea knowledgethatthissimplisti method isfarfromtheknownoptimalstrategyrstdes ribedin[5℄.
Forthisexperiment,thedefault(Split-All)strategyappliedbytapenadegaveverybad resultsin time, with aslowdown fa torof about100from theoriginal ode to thereverse dierentiated ode. Wemadesomemeasurementsofthetapesizes omparedto the snap-shot sizes, and we found out that tape was mu h smaller than snapshot for subroutines densira , roira and onebigloop. This is a spe ial ase of the situation of Figure 8 andis ree tedon theexperimental guresof Table5.3. Wesee that split modeon these threepro eduresgainexe utiontimeatnomemory ost. Combinedsplitmodeonthethree
ONEBIGLOOP
BIGFUNCTION
BIOMAER
CROIRA
TRANSPI
MINERAL
LIXIV
INICLIM
LECSTAT
RECUP
BILAN
PROFIL
INITIAL
DENSIRAC
Figure11: sti s allgraph.
Experiment Time Memory
Id Des ription Total[s℄ % gain Peak[Mb℄ % gain
01 Joint-Allstrategy 38.56 229.23
02 splitmodebiomaer 36.15 6.3 229.23 0.0 03 splitmodemineral 35.78 7.2 229.28 0.0 04 splitmodedensira 30.02 22.1 229.23 0.0 05 splitmode roira 24.45 36.6 229.23 0.0 06 splitmodeonebigloop 23.75 38.4 229.75 -0.2
07 04and05 16.79 56.5 229.23 0.0
08 04and06 15.64 59.4 229.75 -0.2
08 05and06 11.71 69.6 206.81 9.8
09 04,05and06 3.93 89.8 149.11 34.9
09 03,04,05and06 3.92 89.8 149.11 34.9 09 splitalltheabovepro edures 3.90 89.9 149.11 34.9
Table5.3: Memoryandtimeperforman eforsti s.
Theenormousgaininexe utiontimemakesthedierentiated/originalratiogodownto about7, whi h is what AD toolsgenerally laim. In thesti s experiment, theexe ution timeoftheSplit-Allversiondidnot omefromthedupli ateexe utionsdueto he kpointing butratherfromthetimeneededtoPUSHandPOPtheseverylargesnapshots. Thissuggests thata ompletemodeltostudyoptimal he kpointingstrategiesshould denitelytakeinto a ountthetimespentfortapeandsnapshotsoperations.
Pra ti ally,inthesti sexamplethereisnodoubtdensira , roiraandonebigloop should bedierentiated in split mode. Inaddition, one andierentiate additional pro e-duresinsplit mode,(e.g. mineral),buttheadditionalexe utiontimegainismarginal.
6 CONCLUSION,RELATED WORKS,FUTUREWORK
Thispaperisa ontribution towardstheultimategoalof optimallypla ing he kpointsin adjoint odes built by reverse mode Automati Dierentiation. We started from the ob-servation that the strategy onsisting in he kpointingea h and everypro edure allis in general,althoughsafefrom thememorypointof view,farfrom optimal. Both simulations onverysmall examples,andreal experimentsonreal-lifeprogramsshowthat some pro e-duresshouldneverbe he kpointed,andthatothersmaybe he kpointeddependingonthe availablememory. Thegreatvarietyofpossiblesituationsmakestheobje tiveofautomati sele tionof he kpointing sites very distant. It seemstherefore reasonableto let theuser drivethis hoi ethroughanadapteduserinterfa e. Wedis ussedthedevelopmentsthatwe madeintotheAD tooltapenadeto addthisfun tionality. This newfun tionallyallowed usto ondu textensiveexperimentsonreal odes,thatjustiedaposterioriourhypotheses onthisoptimal he kpointingproblemandsuggesttherelevant riteriaforafuturehelping tool namely, for ea h pro edure, its exe ution time, its tape and snapshot sizes, and the timerequiredbytapePUSHandPOPtra .
Relatedworksonoptimal he kpointinghavebeen ondu ted mostlyonthemodel ase ofloopsofxed-sizeiterations. Onlyin theparti ularsub- asewhere thenumberof itera-tionsin knowninadvan ewasanoptimals hemefoundmathemati ally[5℄. Thisgaverise to the treeverse/revolve [6℄ tool for an automati appli ation of this s heme. In the asewherethenumberofiterationsisnotknowninadvan e,averyinterestingsub-optimal s heme wasproposed in [13℄. We arenot aware of optimal he kpointing s hemes forthe aseof anarbitrary all-treeor all graph. Noti e that he kpointingis nottheonly way toimprovethe performan eof thereversemode ofAD.Lo al optimization anredu ethe omputation ostofthederivativesbyre-orderingthesub-expressionsinsidederivatives[8℄. Other optimizations implement a ne-grain time/memory trade-o by storing expensive sub-expressionsthato urseveraltimesinthederivatives. Inany asethesearelo al opti-mizationsthatonlygiveaxedsmallbenet. Forlargeprograms,onlynested he kpointing anmakereversedierentiated odesa tuallyrunwithoutex eedingthememory apa ity of the ma hine, and therefore the study of optimal he kpointing s hemes is an absolute
ne essity.
User-drivenpla ementof he kpointingisanimportantstepinthisdire tion,butfurther work is needed to help this pla ement or to propose a good enough automati strategy. This ould be based on exe ution time proling of the original program or even of the dierentiated ode itself. In any ase, we need to study the experimental gures found and to rene themodel we havebuiltfor the performan eof reversedierentiated odes. In parti ular this model must better take into a ount some of the surprising ee ts we have found, su h astime gains that do not add up. This suggests a pro ess of iterative improvementsofthereversedierentiated odes,basedonpreviousruns,mu hlikewhatis donein iterative ompilation[15℄.
Referen es
[1℄ AhoA.,SethiR.,andUllmanJ.Compilers: Prin iples,Te hniquesandTools. Addison-Wesley,1986.
[2℄ Corliss,G., Faure,Ch., Griewank,A., Has oët,L., andNaumann,U. Automati Dif-ferentiationofAlgorithms,fromSimulationtoOptimization.Springer,Sele tedpapers from AD2000,2001.
[3℄ Giering,R.: Tangentlinearandadjointmodel ompiler,usermanual.Te hni alreport, [wwwhttp://www.autodi. om/tam ℄,1997.
[4℄ F. Courty,A. Dervieux, B.Koobus, L.Has oët.Reverseautomati dierentiationfor optimum design: from adjoint state assembly to gradient omputation. Optimization MethodsandSoftware,Vol.18(5),615627,2003.
[5℄ A. Griewank. A hievinglogarithmi growthof temporal andspatial omplexity in re-verseautomati dierentiation.OptimizationMethods andSoftware.Vol.1(1), 3554, 1992.
[6℄ A. Griewankand A.Walther. Algorithm799: Revolve: An Implementationof Che k-pointfortheReverseorAdjointModeofComputationalDierentiation. ACMTrans. Math. Software.26(1).1999.
[7℄ A.Griewank.EvaluatingDerivatives: Prin iplesandTe hniquesof Algorithmi Dier-entiation. FrontiersinAppl.Math. SIAM,2000.
[8℄ A. Griewank and U. Naumann. A umulating Ja obians as Chained Sparse Matrix Produ ts.Math. Prog.,3(95),555571,Springer,2003.
[9℄ Has oët,L.,Naumann,U.,Pas ual,V.TBRAnalysisinReverse-ModeAutomati Dif-ferentiation. Preprint ANL-MCS/P936-0202, Argonne National Laboratory, also Re-sear hReport#RR-4856,INRIA,2002.
[10℄ L.Has oët,V. Pas ual.tapenade2.1 User'sguide.Te hni al Report.#RT-0300. IN-RIA,2004.
[11℄ L. Has oët, M. Araya-Polo. The Adjoint Data-Flow Analyses: Formalization, Prop-erties, and Appli ations.Automati Dierentiation: Appli ations, Theory, and Tools. Le ture NotesinComputationalS ien eandEngineering,135146,Springer,2005.
[12℄ L.Has oët,B.Dauvergne. TheData-FlowEquationsofChe kpointingin reverse Au-tomati Dierentiation,a eptedpaperatICCS2006,UniversityofReading,UK,May 28-31, 2006.
[13℄ M. Hinze, J. Sternberg. A-Revolve: an adaptive memory and run-time-redu ed pro- edure for al ulating adjoints; with anappli ation to theinstationary Navier-Stokes system.Optim. MethodsSoftw.,20,645663,2005.
[14℄ A.Jameson.Aerodynami designvia ontroltheory.SIAMJournalonS ienti Com-puting,Vol3(3), 233261,1998.
[15℄ T. Kisuki, P.M.W. Knijnenburg, M.F.P.O'Boyle,H.A.G Wijsho. Iterative Compila-tionin ProgramOptimization,InPro .CPC2000,pages35-44,2000.