Optimizing Java Bytecode Using the Soot Framework: Is It Feasible?

(1)

Framework: Is it Feasible?

RajaVallée-Rai,EtienneGagnon,LaurieHendren,

PatrickLam,PatricePominvilleand VijaySundaresan

{rvalleerai,gagnon,hendren,plam,patrice}@sable.mcgill.ca

[email protected]

SableResearchGroup,SchoolofComputerScience,McGillUniversity

Abstract. ThispaperpresentsSoot,aframeworkforoptimizingJava TM

bytecode.TheframeworkisimplementedinJavaandsupportsthree

in-termediaterepresentationsforrepresentingJavabytecode:Baf,a

stream-lined representation of Java's stack-based bytecode; Jimple, a typed

three-addressintermediaterepresentationsuitableforoptimization;and

Grimp,anaggregatedversionofJimple.

Ourapproachtoclassleoptimizationistorstconvertthestack-based

bytecodeintoJimple,athree-addressformmoreamenabletotraditional

programoptimization, and thenconvertthe optimized Jimple back to

bytecode.

In orderto demonstratethat our approach is feasible, we present

ex-perimental results showing the eects of processing class les through

ourframework. Inparticular, we studythe techniquesnecessaryto

ef-fectivelytranslateJimplebacktobytecode,withoutlosingperformance.

Finally,we demonstratethat classleoptimizationcan bequite

eec-tivebyshowingtheresultsofsomebasicoptimizationsusingour

frame-work. Ourexperimentsweredone on ten benchmarks, including seven

SPECjvm98benchmarks,andwereexecutedonvedierentJavavirtual

machineimplementations.

1 Introduction

Javaprovides many attractivefeatures such asplatform independence,

execu-tion safety, garbagecollection and objectorientation. These features facilitate

application developmentbut are expensive to support; applications written in

JavaareoftenmuchslowerthantheircounterpartswritteninCorC++.Touse

thesefeatureswithouthavingtopayagreatperformancepenalty,sophisticated

optimizationsandruntimesystemsarerequired.Forexample,Just-In-Time

com-pilers[1],adaptivecompilerssuchas Hotspot TM

andWay-Ahead-Of-Time Java

compilers[18,17]arethree approachesusedtoimproveperformance.

Our approach is to statically optimize Java bytecode. There are several

reasons for optimizing at the bytecode level. Firstly, the optimized bytecode

(2)

imple-improvementisduetobothourstaticbytecodeoptimization,andthe

optimiza-tionsandsophisticatedruntimesystemsusedinthevirtualmachinesexecuting

theoptimizedbytecode.Secondly,manydierentcompilersforavarietyof

lan-guages(Ada,Scheme,Fortran,Eiel,etc.)nowproduceJavabytecodeastheir

target code. Thus, ouroptimization techniquescanbeapplied asabackend to

allofthesecompilers.

Thegoalofourworkistodeveloptoolsthatsimplify thetaskofoptimizing

Javabytecode,andtodemonstratethatsignicantoptimizationcanbeachieved

using these tools. Thus,wehavedevelopedtheSoot[20]framework which

pro-videsasetofintermediaterepresentationsandasetofJavaAPIsforoptimizing

Javabytecodedirectly. Since ourframework iswritten in Java,and providesa

set ofclean APIs,itshould beportableandeasytobuildupon.

Earlyin our work we found that optimizing stack-basedbytecode directly

was,ingeneral,toodicult.Thus,ourframeworkconsistsofthreeintermediate

representations,twoofwhicharestacklessrepresentations.Bafisastreamlined

representationofthestack-basedbytecode,whereasJimpleandGrimparemore

standard,stacklessintermediaterepresentations.Jimple isathree-address

rep-resentation,whereeachinstructionissimple,andeachvariablehasatype.Itis

idealforimplementingstandardcompileranalysesandtransformations.Grimp

issimilartoJimple,buthasaggregatedexpressions.Grimpisusefulfor

decom-pilation,andas ameanstogenerateecientbytecodefromJimple.

InordertooptimizebytecodewerstconvertbytecodetoJimple(the

three-addressrepresentation),analyzeandoptimizetheJimplecode,andthenconvert

theoptimizedJimplebacktobytecode.Inthispaperwefocusonthetechniques

usedto translatefromJimple toecientbytecode,andwegivetwoalternative

approachestothis translation.

Our framework is designed so that many dierent analyses could be

im-plemented, above and beyond those carried out by javacor aJIT. A typical

transformationmightberemovingredundanteldaccesses,orinlining.Forthese

sortsof optimizations,improvementsatthe Jimplelevelcorresponddirectly to

improvementsinnalbytecodeproduced.

We have performed substantial experimental studies to validate our

ap-proach.Ourrstresultsshowthatwecanindeedgothroughthecycle,bytecode

to Jimple and back to bytecode, without losing any performance. This means

thatanyoptimizationsmadeinJimplewilllikelyalsoresultin optimizationsin

thenal bytecode. Oursecondset ofresultsshowtheeectof optimizingclass

les using method inlining and a set of simple intraprocedural optimizations.

TheseresultsshowthatwecanachieveperformancegainsoveravarietyofJava

VirtualMachines.

Insummary,ourcontributionsinthispaperare:(1)threeintermediate

rep-resentationswhich provideageneral-purposeframeworkforbytecode

(3)

oftheframeworkandtheintermediaterepresentations.Section3describestwo

alternativeapproachestotranslatingJimplebacktobytecode.Section4presents

and discusses our experimental results. Section 5 discusses related work and

Section 6coverstheconclusionsandfuture work.

2 Framework Overview

The Soot framework has been designed to simplify the process of developing

newoptimizationsforJavabytecode.Figure1showstheoverallstructureofthe

framework.Asindicatedatthetopofthegure,manydierentcompilerscanbe

usedto generatetheclassles.ThelargeboxlabeledSOOTdemonstratesthat

theframeworktakestheoriginalclasslesasinput,andproducesoptimizedclass

lesasoutput.Finally,asshownatthebottomofthegure,theseoptimizedclass

lescanthenbeusedasinputtoJavainterpreters,Just-In-Time(JIT)compilers,

adaptiveexecutionengineslikeHotspot TM

, andAhead-of-Timecompilerssuch

as theHighPerformanceCompilerforJava(HPCJ) TM

orTowerJ TM

.

Theinternalstructure ofSootis indicatedinside thelargeboxin Figure1.

Shadedcomponentscorrespondtomodulesthatwehavedesigned/implemented

andeachcomponentisdiscussedin thefollowingsubsections.

2.1 Jimplify

The rstphaseis to jimplify the inputclass les, that is, to convert class les

totheJimplethree-addressrepresentation.Weconverttoathree-address

repre-sentationbecauseoptimizingstackcodedirectlyisawkwardforseveralreasons.

First,thestackimplicitlyparticipatesineverycomputation;thereareeectively

twotypesof variables, the implicitstack variablesand explicit local variables.

Second,the expressionsare notexplicit,and mustbelocated onthestack[25].

Forexample,asimpleinstructionsuchasaddcanhaveitsoperandsseparatedby

anarbitrarynumberofstackinstructions,and evenbybasicblockboundaries.

Anotherdicultyistheuntypednatureofthestackandofthelocalvariablesin

thebytecode,asthisdoesnotallowanalysestomakeuseofthedeclaredtypeof

variables.Afourthproblemisthejsrbytecode.Thejsrbytecodeisdicultto

handle becauseitis essentiallyainterproceduralfeature which isinserted into

atraditionallyintraproceduralcontext.

WeproduceJimplefrom bytecodebyrstprocessingthebytecodefromthe

inputclasslesandrepresentingitinaninternalformat. 1

Duringthistranslation

jsrinstructionsareeliminatedbyinlineexpansionofcode.Afterproducingthe

internalform ofbytecode,thetranslationtoJimple codeproceedsasfollows:

1. Producenaive3-addresscode:Mapeverystackvariabletoalocalvariable,by

determiningthestackheightateveryinstruction.Thenmapeachinstruction

which acts implicitly on thestack variables to a3-address codestatement

(4)

source

Java

source

SML

source

Scheme

source

Eiffel

Baf

Jasmin files

SOOT

Jimple

Optimized class files

class files

Grimp

JIMPLE

Optimized Jimple

OPTION I

OPTION II

javac

MLJ

KAWA

JIMPLIFY

OPTIMIZE JIMPLE

Baf

GENERATE NAIVE

STACK CODE

OPTIMIZE

STACK CODE

GENERATE

STACK CODE

GENERATE JASMIN

ASSEMBLER

JASMIN

Interpreter

JIT

Adaptive Engine

Ahead-of-Time

Compiler

SmallEiffel

AGGREGATE

Fig.1.SootFrameworkStructure

whichreferstothelocalvariablesexplicitly.Forexample,ifthecurrentstack

height is 3, then the instruction iadd would be translated to the Jimple

statement$i2 = $i2 + $i3.This isastandardtechniqueandisalso used

inothersystems[18,17].

2. Type the local variables: The resulting Jimple code may be untypable

be-causealocalvariablemaybeusedwithtwodierenttypesin dierent

(5)

ablescanbegivenaprimitive,class,orinterfacetype. Todothis,weinvoke

analgorithmdescribedin[9].Thecompletesolutiontothistypingproblemis

NP-complete,butinpracticesimplepolynomialalgorithmssuce.Although

splittingthe local variablesin this stepproducesmanylocal variables, the

resultingJimple codetendstobeeasiertoanalyzebecauseitinherits some

ofthedisambiguationbenetsofSSAform[5].

3. Cleanupthe code:Jimplecodemustnowbecompactedbecausestep1

pro-ducedextremelyverbosecode[18,17].Wehavefoundthatsimpleaggregation

(collapsingsingledef/usepairs)followedbycopypropagation/eliminationof

stackvariablestobesucienttoeliminatealmost allredundantstack

vari-ables.

Figure 2(a)shows aninput Javaprogram andFigure 2(b) showsthe

byte-code generated by javac as we would represent it in our Baf representation.

Figure2(c)showsthe Jimplecodethatwouldresultfromjimplifying the

byte-code. Note that all local variables have been giventypes. Variables beginning

with $ correspond to variables that were inserted to standfor stack locations,

whereas variables that do notbeginwith $ correspondto local variablesfrom

thebytecode.

2.2 Optimize Jimple

Most of the analyses and transformations developed by our group, as well as

otherresearchgroups,areimplementedusingtheJimplerepresentation.Wehave

currentlyimplementedmanyintraproceduralandinterproceduraloptimizations.

In thispaperwefocus onstudying theeect ofseveral simpleintraprocedural

techniquesin conjunctionwithmethodinlining.

2.3 ConvertJimpleback to Stack Code

Producing bytecodenaivelyfrom Jimple code produceshighly inecientcode.

EventhebestJITthatwetestedcannotmakeupforthisineciency.Anditis

veryimportantthat wedo notintroduceinecienciesat thispoint thatwould

negatetheoptimizationsperformedonJimple.

Wehaveinvestigatedtwoalternativesforproducingstackcode,labeled

Op-tionI andOptionIIin Figure1.These twooptionsarediscussedin moredetail

in Section3,andtheyareexperimentallyvalidated inSection4.

2.4 Generate JasminAssembler

AfterconvertingJimpletoBafstackcode,thenalphaseistogenerateJasmin

[11]assemblerlesfromtheinternalBafrepresentation.SinceBafisarelatively

directencodingoftheJavabytecode,thisphaseisquitesimple.

2

(6)

int x;

public void f(int i, int c)

{ g(x *= 10);

while(i * 2 < 10)

{ a[i++] = new Object();

}

(a)OriginalJavaSource

.method public f(II)V

aload_0 aload_0 dup getfield A/x I bipush 10 imul dup_x1 putfield A/x I invokevirtual A/g(I)V goto Label1 Label0: aload_0

getfield A/a [Ljava/lang/Object;

iload_1 iinc 1 1 new java/lang/Object dup invokenonvirtual java/lang/Object/<init>()V aastore Label1: iload_1 iconst_2 imul bipush 10 if_icmplt Label0 return

public void f(int, int)

{

Example this;

int i, c, $i0, $i1, $i2, $i3;

java.lang.Object[] $r1; java.lang.Object $r2; this := @this; i := @parameter0; c := @parameter1; X:$i0 = this.x; X:$i1 = $i0 * 10; this.x = $i1; this.g($i1); goto label1; label0: Y:$r1 = this.a; $i2 = i; i = i + 1; $r2 = new java.lang.Object; specialinvoke $r2.<init>(); Y:$r1[$i2] = $r2; label1: Z:$i3 = i * 2;

Z:if $i3 < 10 goto label0;

return;

}

(b)Bytecode (c)Jimple

Fig.2.TranslatingbytecodetoJimple

3 Transformations

After optimizingJimple code, itisnecessarytotranslatebackto ecientJava

bytecode. Figure 3(a)illustrates thebytecode produced byanaivetranslation

fromJimple. 3

Notethat thisnaivecodehasmanyredundantstoreandload

in-structionsreectingthefactthatJimplecomputationresultsarestoredtolocal

variables, whilestack-basedbytecodecanusethestacktostoremany

interme-diate computations.

3

Note that our Baf representation closely mirrors Java bytecode. However, we do

(7)

ThemainideabehindBafoptimizationsisto identifyandeliminateredundant

store/loadcomputations.Figure 3(b)showstheresultsofapplyingtheBaf

op-timizationsonthenaivecodegivenin Figure3(a).

Theoptimizationscurrentlyinuseareallperformedondiscretebasicblocks

withinamethod.Inter-blockoptimizationsthatcrossblockboundarieshavealso

beenimplemented,buttodatethesehavenotyieldedanyappreciablespeedups

on the runtime and are not described here. In the following discussion it is

assumedthatallinstructionsbelongto thesamebasicblock.

Inpractice,themajorityoftheredundantstoreandloadinstructionspresent

in naiveBafcodebelongto oneofthefollowingcodepatterns:

store/load (sl pair) :astoreinstructionfollowedbyaloadinstruction

refer-ringtothesamelocal variablewithnoother uses.Boththestoreand load

instructionscanbeeliminated,andthevaluewillsimplyremainonthestack.

store/load/load (slltriple) : astore instructionfollowedby 2load

instruc-tions, all referring to the same local variable with no other uses. The 3

instructionscan be eliminated and a dupinstruction introduced. The dup

instructionreplacesthesecondloadbyduplicatingthevalueleftonthestack

aftereliminatingthestoreandtherstload.

WecanobservethesepatternsoccurringinFigure3(a)wherelabelsA,B,C

andEeachidentifydistinctslpairs,andwherelabelsDandGidentifyslltriples.

To optimize a Baf method our algorithm performs a xed point iteration

overitsbasicblocksidentifyingand reducingthese patternswheneverpossible.

Byreducing,wemeaneliminatingan slpairorreplacingansll tripleto adup

instruction.

Reducing a pattern is trivial when all its instructions directly follow each

other in theinstruction stream. Forexample,the sl pairsat labels Aand Ein

Figure 3(a)aretrivialones.However,oftenthestore andloadinstructionsare

far apart (like at labels B and C.) To identify pairs of statements to reduce,

wemust determine theeect on the stack of the intervening bytecode. These

bytecodesarecalledtheinterleavingsequences.

We compute two pieces of information asa means to analyse interleaving

sequences and their eects/dependencies on the stack. The net eect on the

stack height after executing a sequence of bytecode is referred to as the net

stack height variation or nshv. The minimum stack height variation attained

while executing a sequence of bytecode is referred to as the minimum stack

height variation ormshv.A sequenceofinstructionshavingbothnshv=0and

mshv=0isreferredto asalevel sequence.

If interleaving sequences are level sequences, then the target patterns can

bereduced directly. This is because onecan just leavethe valueon thestack,

executetheinterleavinginstructions,andthenusethevaluethatwasleftonthe

stack.Thisis thecasefortheslpairlabeledB,andlaterthesl pairlabeledC,

once Bhasbeenreduced.Finally,the slltriple Dcanbe reduced,oncebothB

andChavebeenreduced.Ingeneral,however,manysuchinterleavingsequences

(8)

{ word this, i, c, $i2, $r2;

this := @this: Example;

i := @parameter0: int;

c := @parameter1: int;

load.r this;

fieldget <Example: int x>;

A:store.i c; A:load.i c; push 10; mul.i; G:store.i c; F:load.r this; G:load.i c;

fieldput <Example: int x>;

F:load.r this;

G:load.i c;

virtualinvoke

<Example: void g(int)>;

goto label1;

label0:

load.r this;

fieldget

<Example: java.lang.Object[] a>;

B:store.r c; load.i i; C:store.i $i2; inc.i i 1; new java.lang.Object; D:store.r $r2; D:load.r $r2; specialinvoke

<java.lang.Object: void <init>()>;

B:load.r c; C:load.i $i2; D:load.r $r2; arraywrite.r; label1: load.i i; push 2; mul.i; E:store.i $r2; E:load.i $r2; push 10; ifcmplt.i label0; return; }

public void f(int, int)

{ word this, i, c;

load.r this;

F:load.r this;

push 10;

mul.i;

G:store.i c;

G:load.i c;

virtualinvoke

goto label1;

label0:

load.r this;

fieldget

load.i i;

inc.i i 1;

new java.lang.Object;

D:dup1.r;

specialinvoke

arraywrite.r; label1: load.i i; push 2; mul.i; push 10; ifcmplt.i label0; return; }

(a)naiveBafgeneratedfromJimple (b)optimizedBaf

(9)

pair,thenouralgorithmwilltrytolowerthenshvvaluebyrelocatingabytecode

havingapositivenshvto anearlierlocationin theblock.Thisis illustratedby

themovementofinstructionslabeledbyFinFigures3(a)and3(b)inanattempt

to reducethepattern identiedby G.Anotherstrategyusedwhen nshv<0is

to movelevelsubsequenceending withthe pattern's storeinstructionpastthe

interleavingsequence.Of course,thiscanonlybedoneifnodatadependencies

areviolated.

Applyingthese heuristicsproducesoptimizedBafcodewhichbecomesJava

bytecode and is extremely similar to the original bytecode. We observe that

except for two minor dierences, theoptimized Bafcode in Figure 3(b) is the

sameastheoriginalbytecodefoundin Figure2(b).Thedierencesare:(1)the

secondloadlabeledbyFisnotconvertedtoadup;and(2)thepatternidentied

by G is not reduced to a dup_x1 instruction. We have actually implemented

thesepatterns,butourexperimentalresultsdidnotjustifyenablingtheseextra

transformations.Infact,introducingbytecodessuchasdup_x1oftenyields

non-standard,albeitlegal,bytecodesequencesthatincreaseexecutiontimeandcause

manyJITcompilerstofail.

3.2 OptionII:Build Grimpand Traverse

In this section, wedescribe thesecond route for translating Jimple into

byte-code.Thecompilerjavacisabletoproduceecientbytecodebecauseithasthe

structuredtreerepresentationoftheoriginalprogram,and thestackbased

na-tureofthebytecodeisparticularlywellsuitedforcodegenerationfromtrees[2].

Essentially,thisphaseattemptstorecovertheoriginalstructuredtree

represen-tation, by building Grimp, anaggregated form of Jimple, and then producing

stackcodebystandardtreetraversaltechniques.

GrimpisessentiallyJimplebuttheexpressionsaretreesofarbitrarydepth.

Figure4(a)showstheGrimpversionoftheJimple programinFigure2(c).

Aggregation of bytecode The basicalgorithm foraggregationis asfollows.

Weconsiderpairs(def,use)inextendedbasicblocks,wheredefisanassignment

statementwithsoleuseuse,andusehastheuniquedenitiondef.Weinspectthe

pathbetweendefanduse,todetermineifdefcanbesafelymovedintouse.This

means checkingfor dependencies andnotmovingacrossexception boundaries.

We perform this algorithm iteratively, and the pairs are considered in reverse

pseudo-topologicalorder tocascade theoptimizations asecientlyaspossible.

Examples of these aggregation opportunities are shown in gure 2(c) at the

labelsX, YandZ.XandZ aretrivialcasesbecausetheiraggregationpairsare

adjacent, butthe pairat Yareafew statements apart.Thusbefore producing

the aggregated code in 4(a) the interleaving statements must be checked for

writesto this.a.

Peephole optimizations Insomecases,GrimpcannotconciselyexpressJava

(10)

can-public void f(int, int)

{

Example this;

int i, c, $i1, $i2;

this := @this; i := @parameter0; c := @parameter1; X:$i1 = this.x * 10; this.x = $i1; this.g($i1); goto label1; label0: $i2 = i; i = i + 1; Y:this.a[$i2] = new java.lang.Object(); label1:

Z:if i * 2 < 10 goto label0;

return;

}

{ word this, i, c, $i2;

load.r this;

push 10;

mul.i;

store.i c;

load.r this;

load.i c;

load.r this;

load.i c;

virtualinvoke

goto label1;

label0:

load.r this;

fieldget

load.i i;

inc.i i 1;

new java.lang.Object;

dup1.r

specialinvoke

arraywrite.r; label1: load.i i; push 2; mul.i push 10; ifcmplt.i label0; return; }

(a)Grimp (b)BafgeneratedfromGrimp

Fig.4.GeneratingBaffromGrimp

sideofanassignmentstatement,notasasideeect. 4

Toremedythis problem,

weusesomepeepholeoptimizationsin thecodegeneratorforGrimp.

For example, for the increment case, we search for Grimp patterns of the

form:

s1: local = <lvalue>;

s2: <lvalue> = local/<lvalue> + 1;

s3: use(local)

andweensurethatthelocaldenedins

1

hasexactlytwouses,andthattheuses

ins

2 ;s

3

haveexactlyonedenition.Giventhissituation,weemitcodeforonly

s

3

.However,duringthegenerationofcodefors

3

,whenlocalistobeemitted,

(11)

exampleofthispatternoccursjust afterLabel0in Figure4(a).

Thisapproachproducesreasonablyecientbytecode.Insomesituationsthe

peephole patterns fail and the complete original structure is not recovered.In

thesecases,theBafapproachusuallyperformsbetter.Seethesection4formore

details.

4 Experimental Results

Herewepresenttheresultsoftwoexperiments.Therstexperiment,discussed

in Section 4.3, validates that we can pass class les through the framework,

withoutoptimizingtheJimple code, andproduceclasslesthathavethesame

performance as the original ones. In particular, this shows that our methods

of converting from Jimple to stack-basedbytecodeare acceptable. The second

experiment,discussedinSection4.4,showstheeectofapplyingmethodinlining

onJimplecodeanddemonstratesthatoptimizingJavabytecodeisfeasibleand

desirable.

4.1 Methodology

All experimentswereperformedondual 400MhzPentiumII TM

machines. Two

operating systemswere used, Debian GNU/Linux (kernel 2.2.8)and Windows

NT 4.0 (service pack 5). Under GNU/Linux we ran experiments using three

dierentcongurationsoftheBlackdownLinuxJDK1.2,pre-releaseversion2. 5

Thecongurationswere:interpreter,SunJIT,andapublicbetaversionof

Bor-land'sJIT 6

.UnderWindowsNT,twodierentcongurationsofSun'sJDK1.2.2

wereused:theJIT,andHotSpot(version1.0.1)

Execution timeswere measuredbyrunningthe benchmarksten times,

dis-carding the best and worst runs, and averaging the remaining eight. All

exe-cutions were veried forcorrectness by comparing the output to the expected

output.

4.2 Benchmarks and Baseline Times

Thebenchmarksusedconsistofsevenoftheeightstandardbenchmarksfromthe

SPECjvm98 7

suite, plus three additionalapplications from our collection.See

gure5.Wediscardedthemtrtbenchmarkfromoursetbecauseitisessentially

the samebenchmark as raytrace. The soot-c benchmark is based on an older

versionofSoot,andisinterestingbecauseitisheavilyobjectoriented,schro

eder-s is an audio editing program which manipulates sound les, and jpat-p is a

proteinanalysistool.

Figure 5also givesbasiccharacteristicssuchas size,and runningtimes on

theveplatforms.Allofthese benchmarksarerealworldapplications thatare

5

http://www.blackdown.org

6

http://www.borland.com

(12)

Linux interpreter as the base time, and all the fractional execution times are

withrespectto thisbase.

Benchmarks for which a dash is given for the running time indicates that

the benchmark failed validity checks. In all these cases, the virtual machine

isto blameas theprogramsruncorrectly withtheinterpreterwiththeverier

explicitlyturnedon.Arithmeticaveragesandstandarddeviationsarealsogiven,

andtheseautomatically excludethoserunningtimeswhich arenotvalid.

Forthissetofbenchmarks,wecandrawthefollowingobservations.TheLinux

JIT isabouttwice asfastasthe interpreterbut itvaries widely depending on

thebenchmark.Forexample,withcompressitismorethansixtimesfaster,but

forabenchmarklikeschroeder-sitisonly56%faster.TheNTvirtualmachines

also tend to be twice as fast as the Linux JIT. Furthermore,the performance

of theHotSpot performance engineseemsto be, onaverage,notthat dierent

from the standard Sun JIT. Perhaps this is because the benchmarks are not

longrunningserversideapplications(i.e.notthekindsofapplicationsforwhich

HotSpotwasdesigned).

4.3 Straight throughSoot

Figure6comparestheeectof processingapplicationswithSoot withBafand

Grimp, without performing any optimizations. Fractional execution times are

given, and these are with respect to the original execution time of the

bench-mark fora givenplatform. The idealresult is 1.00.This means that thesame

performance is obtained as the original application. Forjavac the ratio is .98

which indicates that javac's execution time has beenreduced by 2%. raytrace

hasaratioof1.02whichindicatesthatitwasmadeslightlyslower;itsexecution

time has been increasedby 2%. The idealarithmetic averages forthese tables

is 1.00becausewearetryingtosimply reproducethe programasis.Theideal

standarddeviationis0whichwouldindicate thatthetransformationis having

aconsistenteect,andtheresultsdonotdeviatefrom1.00.

On average,using Baf tends to reproduce the original execution time. Its

averageislowerthanGrimp's,andthestandarddeviationisloweraswell.For

the faster virtual machines (the ones on NT), this dierence disappears. The

main disadvantage of Grimp is that it can producea noticeableslowdown for

benchmarkslikecompresswhichhavetightloopsonJavastatementscontaining

sideeects, whichitdoesnotalwayscatch.

Bothtechniqueshavesimilarrunningtimes,butimplementingGrimpandits

aggregationisconceptuallysimpler.IntermsofcodegenerationforJavavirtual

machines, webelievethat ifoneis interestedin generatingcodefor slowVMs,

then the Baf-like approach is best. For fast VMs, or if one desires a simpler

compilerimplementation,then Grimpismoresuitable.

Wehavealso measuredthe sizeofthe bytecode before andafter processing

withSoot.Thesizesareverysimilar,with thecodeafterprocessingsometimes

slightlylarger,and sometimes slightly smaller.ForthesevenSPECjvm

(13)

We haveinvestigated the feasibility of optimizing Javabytecode with Soot by

implementingmethodinlining. Althoughinliningisawholeprogram

optimiza-tion,Sootis alsosuitableforoptimizationsapplicabletopartialprograms.Our

approachto inlining is simple.We build aninvokegraphusing class hierarchy

analysis[8] and inline method calls wheneverthey resolveto one method. Our

inlinerisabottom-upinliner,andattemptstoinlineallcallsites subjecttothe

followingrestrictions:1)themethodtobeinlinedmustcontainlessthan20

Jim-ple statements, 2) nomethod maycontain morethan 5000Jimple statements,

and3)nomethod mayhaveitssizeincreasedmorethanbyafactorof3.

After inlining, the following traditional intraprocedural optimizations are

performed to maximize the benet from inlining: copy propagation, constant

propagationandfolding,conditionalandunconditionalbranchfolding,dead

as-signmenteliminationand unreachablecodeelimination.These aredescribedin

[2].

Figure7givestheresultofperformingthisoptimization. Thenumbers

pre-sentedarefractionalexecutiontimeswithrespecttotheoriginalexecutiontime

of thebenchmarkforagivenplatform. FortheLinux virtualmachines, we

ob-tain asignicantimprovement in speed.In particular, for the Linux Sun JIT,

theaverageratiois.92whichindicatesthattheaveragerunningtimeisreduced

by8%.Forraytrace,theresultsarequitesignicant,asweobtainaratioof.62,

areductionof38%.

ForthevirtualmachinesunderNT,theaverageis1.00or1.01,butanumber

ofbenchmarksexperienceasignicantimprovement.Onebenchmark,javac

un-dertheSunJIT, experiencessignicantdegradation.However,underthesame

JIT,raytraceyieldsaratioof.89,andunderHotSpot,javac,jackandmpegaudio

yieldsomeimprovements.GiventhatHotSpot itselfperformsdynamicinlining,

thisindicatesthatourstaticinliningheuristicssometimescaptureopportunities

thatHotSpotdoesnot.OurheuristicsforinliningwerealsotunedfortheLinux

VMs,andfuture experimentationcouldproducevalueswhich arebettersuited

fortheNTvirtualmachines.

Theseresultsare highly encouragingastheystronglysuggest thata

signif-icant amount of improvement can be achieved by performing aggressive

opti-mizationswhicharenotperformedbythevirtualmachines.

5 Related Work

Relatedworkfallsintovedierentcategories:

Javabytecode optimizers:ThereareonlytwoJavatoolsofwhichweareaware

that perform signicant optimizations on bytecode and produce new class

les:Cream[3] andJax[23].Cream performs optimizationssuchasloop

in-variantremovalandcommonsub-expressioneliminationusingasimpleside

eect analysis.Only extremely small speed-ups (1% to 3%) are reported,

(14)

ex-Stmts SunInt. Sun Bor. Sun Sun

(secs) JIT JIT JIT Hot.

compress 7322 440.30 .15 .14 .06 .07 db 7293 259.09 .56 .58 .26 .14 jack 16792 151.39 .43 .32 .15 .16 javac 31054 137.78 .52 .42 .24 .33 jess 17488 109.75 .45 .32 .21 .12 jpat-p 1622 47.94 1.01 .96 .90 .80 mpegaudio 19585 368.10 .15 - .07 .10 raytrace 10037 121.99 .45 .23 .16 .12 schroeder-s 9713 48.51 .64 .62 .19 .12 soot-c 42107 85.69 .58 .45 .29 .53 average .49 .45 .25 .25 std.dev. .23 .23 .23 .23

Fig.5.Benchmarksandtheircharacteristics.

Baf Grimp

Linux NT Linux NT

Sun Sun Bor. Sun Sun Sun Sun Bor. Sun Sun

Int. JIT JIT JIT Hot. Int. JIT JIT JIT Hot.

compress 1.01 1.00 .99 .99 1.00 1.07 1.02 1.04 1.00 1.01 db .99 1.01 1.00 1.00 1.00 1.01 1.05 1.01 1.01 1.02 jack 1.00 1.00 1.00 - 1.00 1.01 .99 1.00 - 1.00 javac 1.00 .98 1.00 1.00 .97 .99 1.03 1.00 1.00 .95 jess 1.02 1.01 1.04 .99 1.01 1.01 1.02 1.04 .97 1.00 jpat-p 1.00 .99 1.00 1.00 1.00 .99 1.01 1.01 1.00 1.00 mpegaudio 1.05 1.00 - - 1.00 1.03 1.00 - - 1.01 raytrace 1.00 1.02 1.00 .99 1.00 1.01 1.00 .99 .99 1.00 schroeder-s .97 1.01 - 1.03 1.01 .98 .99 - 1.03 1.00 soot-c .99 1.00 1.02 .99 1.03 1.00 1.01 1.00 1.01 1.01 average 1.00 1.00 1.01 1.00 1.00 1.01 1.01 1.01 1.00 1.00 std.dev. .02 .01 .01 .01 .01 .02 .02 .02 .02 .02

Fig.6.TheeectofprocessingclassleswithSootusingBaforGrimp,without

opti-mization.

compressed. They also are interested in speed optimizations, but at this

timetheircurrentpublished speedupresultsareextremelylimited.

Bytecode manipulation tools:There are anumberof Javatoolswhich provide

frameworksformanipulatingbytecode:JTrek[13],Joie[4],Bit[14]and

Java-Class[12].ThesetoolsareconstrainedtomanipulatingJavabytecodeintheir

originalform,however.Theydonotprovideconvenientintermediate

repre-sentationssuchasBaf,JimpleorGrimpforperforminganalysesor

transfor-mations.

(15)

ap-Sun Sun Bor. Sun Sun

Int. JIT JIT JIT Hot.

compress 1.01 .78 1.00 1.01 .99 db .99 1.01 1.00 1.00 1.00 jack 1.00 .98 .99 - .97 javac .97 .96 .97 1.11 .93 jess .93 .93 1.01 .99 1.00 jpat-p .99 .99 1.00 1.00 1.00 mpegaudio 1.04 .96 - - .97 raytrace .76 .62 .74 .89 1.01 schroeder-s .97 1.00 .97 1.02 1.06 soot-c .94 .94 .96 1.03 1.05 average .96 .92 .96 1.01 1.00 std.dev. .07 .12 .08 .06 .04

Fig.7.Theeectofinliningwithclasshierarchyanalysis.

packagingconsists ofcode compressionand/or code obfuscation.Although

we have not yet applied Soot to this application area, we have plans to

implementthisfunctionalityaswell.

Java native compilers: The tools in thiscategorytakeJavaapplications and

compilethem tonativeexecutables.These arerelated becausethey allare

forcedtobuild 3-address codeintermediaterepresentations,and some

per-formsignicantoptimizations.Thesimplestoftheseis Toba[18]which

pro-ducesunoptimized Ccode and relies on GCC to produce the nativecode.

Slightlymoresophisticated, Harissa[17]also producesCcode butperforms

somemethoddevirtualizationandinliningrst.Themostsophisticated

sys-tems are Vortex[7]and Marmot[10]. Vortex is anative compilerfor Cecil,

C++ and Java, and contains a complete set of optimizations. Marmot is

also a complete Java optimization compiler and is SSA based. There are

alsonumerouscommercialJavanativecompilers,suchastheIBM(R)High

Performance Compiler for Java, Tower Technology's TowerJ[24], and

Su-perCede[22],buttheyhaveverylittlepublishedinformation.

Stackcode optimization:Someresearchhasbeenpreviouslydoneintheeldof

optimizingstackcode.Theworkpresentedin[15]isrelatedbutoptimizesthe

stackcodebasedontheassumptionthatstackoperationsarecheaperthan

manipulatinglocals.ThisisclearlynotthecasefortheJavaVirtualMachines

weareinterestedin.Ontheotherhand,somecloselyrelatedworkhasbeen

doneonoptimizingnaiveJavabytecodecodeatUniversityofMaryland[19].

Theirtechniqueissimilar,buttheypresentresultsforonlytoybenchmarks.

6 Conclusions and Future Work

WehavepresentedSoot,aframeworkforoptimizingJavabytecode.Sootconsists

of three intermediate representations (Baf, Jimple & Grimp), transformations

(16)

representa-ing onthe mechanismsfortranslating Javastack-basedbytecode to ourtyped

three-addressrepresentationJimple,andthetranslationofJimplebackto

byte-code.Jimplewasdesignedtomaketheimplementationofcompileranalysesand

transformationssimple.Ourexperimentalresultsshowthatwecanperformthe

conversions without losingperformance, and so wecan eectively optimize at

theJimple level.Wedemonstrated theeectiveness of aset ofintraprocedural

transformations andinlining on vedierent Java VirtualMachine

implemen-tations.

Weare encouraged by ourresultsso far, andwehavefound that theSoot

APIshavebeeneectiveforavarietyoftasksincluding theoptimizations

pre-sentedin thispaper.

We,andotherresearchgroups,areactivelyengagedinfurtherworkonSoot

on manyfronts.Our groupis currently focusing onnew techniques forvirtual

method resolution, pointer analyses, side-eect analyses,and various

transfor-mationsthatcantakeadvantageofaccurateside-eect analysis.

Acknowledgements

This research was supported by IBM's Centre for Advanced Studies (CAS),

NSERCandFCAR.

References

1. Ali-Reza Adl-Tabatabai, Michal Cierniak, Guei-Yuan Lueh, Vishesh M. Parikh,

andJamesM.Stichnoth. Fastandeectivecodegenerationinajust-in-timeJava

compiler. ACMSIGPLANNotices,33(5):280290,May1998.

2. Alfred V. Aho,Ravi Sethi,and Jerey D. Ullman. Compilers Principles,

Tech-niquesandTools. Addison-Wesley,1986.

3. Lars R. Clausen. A Java bytecode optimizerusingside-eectanalysis.

Concur-rency: Practice&Experience,9(11):10311045, November1997.

4. Geo A. Cohen,Jerey S.Chase, and DavidL. Kaminsky. Automatic program

transformationwithJOIE. InProceedingsoftheUSENIX1998 Annual Technical

Conference,pages167178,Berkeley,USA,June15191998.USENIXAssociation.

5. RonCytron,JeanneFerrante,BarryK.Rosen,MarkK.Wegman,andF.Kenneth

Zadeck. Anecientmethodofcomputingstaticsingleassignmentform. In16th

Annual ACMSymposiumonPrinciples ofProgramming Languages, pages2535,

1989.

6. DashOPro. http://www.preemptive.com/products.html.

7. JereyDean,GregDeFouw,DavidGrove,VassilyLitvinov,andCraigChambers.

VORTEX: An optimizingcompiler for object-orientedlanguages. InProceedings

OOPSLA '96 Conference on Object-Oriented Programming Systems, Languages,

and Applications, volume 31 of ACM SIGPLAN Notices, pages 83100. ACM,

October1996.

8. JereyDean,DavidGrove,andCraigChambers.Optimizationofobject-oriented

programs using static class hierarchy analysis. In Walter G. Oltho, editor,

ECOOP'95Object-OrientedProgramming,9thEuropeanConference,volume952

(17)

Au-forJavaBytecode. SableTechnicalReport1999-1,SableResearchGroup,McGill

University,March1999.

10. RobertFitzgerald,Todd B.Knoblock, ErikRuf,Bjarne Steensgaard,and David

Tarditi. Marmot: anOptimizingCompiler for Java. Microsoft technical report,

MicrosoftResearch,October1998.

11. Jasmin:AJavaAssemblerInterface. http://www.cat.nyu.edu/meyer/jasmin/.

12. JavaClass. http://www.inf.fu-berlin.de/dahm/JavaClass/.

13. CompaqJTrek. http://www.digital.com/java/download/jtrek .

14. HanBokLeeandBenjamin G.Zorn. AToolforInstrumentingJava

Byte-codes. InThe USENIX Symposiumon InternetTechnologies and Systems,

pages7382,1997.

15. Martin Maierhofer and M. Anton Ertl. Local stack allocation. In Kai

Koskimies,editor,Compiler Construction (CC'98), pages189203,Lisbon,

1998.SpringerLNCS1383.

16. StevenS.Muchnick. AdvancedCompiler Design andImplementation.

Mor-ganKaufmann,1997.

17. GillesMuller,BárbaraMoura,FabriceBellard,andCharlesConsel.Harissa:

AexibleandecientJavaenvironmentmixingbytecodeandcompiledcode.

In Proceedings of the 3rd Conference on Object-OrientedTechnologies and

Systems,pages120,Berkeley,June16201997.UsenixAssociation.

18. ToddA. Proebsting,Gregg Townsend, PatrickBridges, JohnH. Hartman,

TimNewsham,andScottA.Watterson.Toba:Javaforapplications:Away

ahead of time (WAT) compiler. In Proceedings of the 3rd Conference on

Object-OrientedTechnologies andSystems,pages4154,Berkeley,June16

201997.UsenixAssociation.

19. TatianaShpeismanandMustafaTikir. GeneratingEcientStackCodefor

Java. Technicalreport,UniversityofMaryland,1999.

20. Soot-aJavaOptimizationFramework.http://www.sable.mcgill.ca/soot/.

21. 4thpassSourceGuard. http://www.4thpass.com/sourceguard/.

22. SuperCede,Inc. SuperCedeforJava. http://www.supercede.com/.

23. Frank Tip, Chris Lara, Peter F. Sweeney, and David Streeter. Practical

ExperiencewithanApplicationExtractor forJava. IBMResearchReport

RC21451,IBMResearch,1999.

24. TowerTechnology.TowerJ. http://www.twr.com/.

25. RajaVallée-RaiandLaurieJ.Hendren. Jimple:SimplifyingJavaBytecode

for Analyses and Transformations. Sable Technical Report 1998-4, Sable