• Aucun résultat trouvé

De Zen `a Aum

N/A
N/A
Protected

Academic year: 2022

Partager "De Zen `a Aum"

Copied!
75
0
0

Texte intégral

(1)

De Zen `a Aum

G ´erard Huet

LIMSI, Orsa y , 16 mars 2004

-1-

(2)

The Zen to olkit - Generic tec hnology

Afewspecificapplicativetechniques:

•Localprocessingoffocuseddata

•Sharing

•Lexicaltrees

•Finitetransducersaslexiconmorphisms

•Searchbyresumptioncoroutines

•Multisetorderingconvergence

-2-

(3)

Basics: lists vs stac ks

valuel5=[1;2;3;4;5];

values5=[5;4;3;2;1];

valuerecstack_ons=fun[[]->s|[h::t]->stack_on[h::s]t];

(*stack_onsl=unstackls=(revl)@s*)

valuerevl=stack_on[]l;

valuestate3=([3;2;1],[4;5]);

-3-

(4)

F o cus

Instate3,wefocusonthefourthelementofl5bylookingatthesublist[4;5]inacontextgivenbythestack[3;2;1]:

...1;2;3]*[4;5...

ThisisthewayaTuringmachinenavigates,andcomputes.Intwodimensions,wegetaneditingcontextoflinesandcharactersinaTEXTfilerepresentedinanEmacsbufferfocusedatthecurrentmark.

-4-

(5)

Zipp ers

Zippersgiveagenericviewtotherepresentationofafocusedstructurebyapair(context,substructure).

FirstpresentationatFLoC’96.Publishedas:G.Huet.TheZipper.J.FunctionalProgramming7,5(1997),549-554.

Largescaleimplementationsinsyntaxeditorswithincomputationallinguisticsplatforms:

•G.Huet.LexicalmorphismswiththeZentoolkit.

•A.Ranta.GrammaticalFrameworks.

-5-

(6)

Example: binary trees

typetree2=[Leaf2|Node2of(tree2*tree2)];

typesum2=[Proj21oftree2|Proj22oftree2]

andcontext2=listsum2

andfocus2=(context2*tree2);

valueleft(c,t)=matchcwith

[[(Proj22s)::z]->([(Proj21t)::z],s)|_->raise(Failure"leftoftop")]

andup(c,t)=matchcwith

[[]->raise(Failure"upoftop")

|[(Proj21s)::z]->(z,Node2(t,s))

|[(Proj22s)::z]->(z,Node2(s,t))]

-6-

(7)

Zipp ers, LL and Differen tiation

Thistechnologyisgeneric,andthefocusnavigationoperationsmaybeprogrammeduniformly,asshownbyHinze,JeuringandL¨ohasapolytypicfunction.See“Type-indexeddatatypes”,MathematicsforProgramConstruction,LNCS2386(2002).

Zippersarerelatedtolinearfunctionsonstructures,inthesenseoflinearlogic.ThefocusoperationsareinteractionoperatorsofY.Lafont.

ConorMcBride,in“TheDerivativeofaRegularTypeisitsTypeofOne-HoleContexts”,showedthatthezipperdatatypemaybederivedfromthestructuredatatypebyformalpartialdifferentiation.Inotherword,datastructuresareintegralsoftheircreatingcontexts.

-7-

(8)

Ordered trees

typetree=[Treeofforest]

andforest=listtree;

typetree_zipper=

[Top

|Zipof(forest*tree_zipper*forest)

];

typefocused_tree=(tree_zipper*tree);

Afocusedtreeisatreewithafocuspointofinterest,i.e.asubtreeanditsstackedcontext.

-8-

(9)

Op erations on fo cused trees

valuedown(z,t)=matchtwith[Tree(forest)->matchforestwith

[[]->raise(Failure"down")

|[hd::tl]->(Zip([],z,tl),hd)

]

];

valueup(z,t)=matchzwith

[Top->raise(Failure"up")

|Zip(l,u,r)->(u,Tree(stack_on[t::r]l))];

-9-

(10)

More op erations on fo cused trees

valueleft(z,t)=matchzwith[Top->raise(Failure"left")|Zip(l,u,r)->matchlwith[[]->raise(Failure"left")|[elder::rest]->(Zip(elders,u,[t::r]),rest)]];valueright(z,t)=matchzwith[Top->raise(Failure"right")|Zip(l,u,r)->matchrwith[[]->raise(Failure"right")|[young::rest]->(Zip([t::l],u,rest),young)]];

-10-

(11)

Applicativ e up dating

valuedel_l(z,_)=matchzwith

[Top->raise(Failure"del_l")

|Zip(l,u,r)->matchlwith

[[]->raise(Failure"del_l")

|[elder::elders]->(Zip(elders,u,r),elder)

]];

valuereplace(z,_)t=(z,t);

-11-

(12)

P oin ts of view ab out fo cused structures

•Manipulationoffocuseddataislocal

•Redundantrepresentation-efficiency

•TheInteractionCombinatorsParadigm

Remark.Zippersarelinearcontexts.TheyaresuperiortoΩ-terms,notablybecausetheapproximationorderingissubstructural.

-12-

(13)

Computational linguistics

Wewanttoprocess(parseandgenerate)naturallanguagesentences,dialogues,corpusesofvariouskinds(oral,written,news,books,websites,etc).Weassumethatthedataisalreadydigitalisedanddiscretizedasastreamofletters(phonemesfororaldata,lettersforwrittenone).

Afundamentalentityinthisprocessingistheword.Onetraditionallydistinguishesprocessingbetweenstreamsoflettersandwords(morphology,lexicalanalysis)andprocessingbetweenwordsandsentences(syntax,parsing).

-13-

(14)

W ords

Wordsarerepresentedaslistofpositiveintegers.

typeletter=int(*lettersorphonemes*)andword=listletter;

Weprovidecoercionsencode:string->wordanddecode:word->string.Hereislexicographicordering.

valuereclexicol1l2=matchl1with[[]->True|[c1::r1]->matchl2with[[]->False|[c2::r2]->ifc2<c1thenFalseelseifc2=c1thenlexicor1r2elseTrue]];

-14-

(15)

Differen tial w ords

typedelta=(int*word);

Adifferentialwordisanotationpermittingtoretrieveawordwfromanotherwordw 0sharingacommonprefix.Itdenotestheminimalpathconnectingthewordsinatree,asasequenceofupsanddowns:ifd=(n,u)wegoupntimesandthendownalongwordu.

Wecomputethedifferencebetweenwandw 0asadifferentialword diffww 0=(|w1|,w2)wherew=p.w1andw 0=p.w2,withmaximalcommonprefixp.

Theconverseofdiff:word->word->deltais

patch:delta->word->word:w 0mayberetrievedfromwand d=diffww 0asw 0=patchdw.

-15-

(16)

T ries

Tries,orlexicaltrees,storesparsesetsofwordssharinginitialprefixes.TheyareduetoRen´edelaBriantais(1959).Weuseaverysimplerepresentationwithlistsofsiblings.

typetrie=[Trieof(bool*forest)]

andforest=list(Word.letter*trie);

Triesaremanaged(search,insertion,etc)usingthezippertechnology.

AsideTernarytrees.

-16-

(17)

Imp ortan t remarks

Triesmaybeconsideredasdeterministicfinitestateautomatagraphsforacceptingthe(finite)languagetheyrepresent.Thisremarkisthebasisformanylexiconprocessinglibraries.

Suchgraphsareacyclic(trees).Butmoregeneralfinitestateautomatagraphsmayberepresentedasannotatedtrees.Theseannotationsaccountfornon-deterministicchoicepoints,andforvirtualpointersinthegraph.

-17-

(18)

Lexicon

Hereisasimplisticlexiconcompiler

make_lex:liststring->trie:

valuemake_lex=letenter1lexc=Trie.enterlex(Word.encodec)

inList.fold_leftenter1Trie.empty;

Forinstance,withenglish.lststoringalistof173528words,asatextfileofsize2Mb,thecommandmake_lex<english.lst>english.remproducesatrierepresentationasafileof4.5Mb.

Triessharethewordsbythereprefixes,butcommonsuffixesaccountforalotofredundancyinthestructure.Weshalleliminatethisredundancybysharingandgetaminimalstructure.

-18-

(19)

The Share F unctor

moduleShare:functor(Algebra:sigtypedomain=’a;valuesize:int;end)->

sigvalueshare:Algebra.domain->int->Algebra.domain;end;

Thatis,SharetakesasargumentamoduleAlgebraprovidingatypedomainandanintegervaluesize,anditdefinesavalueshareofthestatedtype.WeassumethattheelementsfromthedomainarepresentedwithanintegerkeyboundedbyAlgebra.size.Thatis,

sharexkwillassumeaspreconditionthat0≤k<Maxwith

Max=Algebra.size.

Weshallconstructthesharingmapwiththehelpofahashtable,madeupofbuckets(k,[e1;e2;...en])whereeachelementeihaskeyk.

-19-

(20)

Memoizing

typebucket=listAlgebra.domain;

valuememo=Array.createAlgebra.size([]:bucket);

Weshalluseaservicefunctionsearch,suchthatsearchelreturnsthefirstyinlsuchthaty=eororelseraisestheexception

Not_found.

valuesearche=List.find(funx->x=e);

-20-

(21)

The share function

valueshareelementkey=

letbucket=memo.(key)in

trysearchelementbucketwith[Not_found->

do{memo.(key):=[element::bucket];element}

];

Sharingisjustrecalling!

-21-

(22)

Compressing trees as dags

WemayforinstanceinstantiateShareonthealgebraoftrees,withasizehashmaxdependingontheapplication:

moduleDag=Share(structtypedomain=tree;

valuesize=hash_max;end);

Andnowwecompressatrieintoaminimaldagusingsharebyasimplebottom-uptraversal,wherethekeyiscomputedalongbyhashing.Forthiswedefineageneralbottom-uptraversalfunction,whichappliesaparametriclookupfunctiontoeverynodeanditsassociatedkey.

-22-

(23)

Dynamic programming

Bottom-uptraversingwithinductivehash-codecomputation.

valueh1keyindexsum=sum+index*key

andh0=1andhforest=forestmodhash_max;

valuetraverselookup=travel

whererectravel=fun

[Treeforest->

letf(trees,index,span)t=

let(t’,k)=traveltin

([t’::trees],index+1,h1kindexspan)in

let(forest’,_,span)=List.fold_leftf([],1,h0)forestin

letkey=hspanin(lookup(Tree(List.revforest’))key,key)];

-23-

(24)

Compressing a tree as a dag

Now,compressingatreeoptimallyasaminimaldagissimplyeffectedbyasharingtraversal:

valuecompress=traverseDag.share;

valueminimizetree=let(dag,_)=compresstreeindag;

-24-

(25)

Adv an tages and extensions

Hashingkeysandsizeisontheclientside:wedonotdelegatehashingtoShare,whichisjustanassociativememory.Thishastwoadvantages:

•Thecomputationisfullylinear

•Itisadaptedtothestatisticsofthedata

Extension:Auto-sharingtypes(controlledhash-consing).Suggestsamonadofsharedhashedstructuresaccommodatingentropyofthedata.

-25-

(26)

Dagified lexicons

Wemaydagifyalexiconaposterioriinonepass:

valuerecdagify()=

letlexicon=(input_valuestdin:Trie.trie)

inletdag=Mini.minimizelexiconinoutput_valuestdoutdag;

Andnowifweapplythistechniquetoourenglishlexicon,withcommanddagify<english.rem>small.rem,wenowgetanoptimalrepresentationwhichonlyneeds1Mbofstorage,halfoftheoriginalASCIIstringrepresentation.

-26-

(27)

Adv ertisemen t

Therecursivealgorithmsgivensofararefairlystraightforward.Theyareeasytodebug,maintainandmodifyduetothestrongtypingsafeguardofML,andeveneasytoformallycertify.Theyarenonethelessefficientenoughforproductionuse,thankstotheoptimizingnative-codecompilerofObjectiveCaml.

InourSanskritapplication,thetrieof11500entriesisshrunkfrom219Kbto103Kbin0.1s,whereasthetrieof120000flexedformsisshrunkfrom1.63Mbto140Kbin0.5sona864MHzPC.Ourtrieof173528Englishwordsisshrunkfrom4.5Mbto1Mbin2.7s.Measurementsshowedthatthetimecomplexityislinearwiththesizeofthelexicon(withincomparablesetsofwords).

-27-

(28)

V ariations

Manyvariationsontriesexist.OptimisationsoflexicalanalysersforprogramminglanguagesaredescribedintheDragonbook.Butthedragonbookofcomputationallinguisticshasnotbeenwrittenyet.

Variationwithternarytrees.TernarytreesareinspiredfromBentleyandSedgewick.Ternarytreesaremorecomplexthantries,butuseslightlylessstorage.Accessispotentiallyfasterinbalancedtreesthantries.Agoodmethodologyseemstousetriesforedition,andtotranslatethemtobalancedternarytreesforproductionusewithafixedlexicon.

Theternaryversionofourenglishlexicontakes3.6Mb,asavingsof20%overitstrieversionusing4.5Mb.Afterdagminimization,ittakes1Mb,asavingsof10%overthetriedagversionusing1.1Mb.Foroursanskritlexiconindex,thetrietakes221Kbandthetertree180Kb.Sharedasdagsthetrietakes103Kbandthetertree96Kb.

-28-

(29)

Decos, Lexmaps, Autos

WeunderstandtheTriestructureofasetofWordsasaspecialcaseofafinitelybasedmappingDeco=Word→AnnotationinthecaseofBooleanannotationssharedbyprefixarguments(andbycommonsubexpressionswhenshared).

Westoremorphologyconstructionsasbeingofthistype,andweinvestigatethereversemappingbygeneralisingthemtorelations,typicallyinductivelydefinedthroughfinitestatemachines.

Themoresharingwegetthebetterweoptimisethisdatalayout.Itisthusofparamountimportancethattheannotationsbelocalquasi-morphismsdecorations.

-29-

(30)

Decos

typedeco’a=[Decoof(list’a*dforest’a)]

anddforest’a=list(Word.letter*deco’a);

Wethinkofthedecorationoftypelist’aasaninformationassociatedwiththewordstoredatthatnode.

Wecaneasilygeneralizesharingtodecoratedtries.However,substantialsavingswillresultonlyiftheinformationatagivennodeisafunctionofthesubtrieatthatnode,i.e.ifsuchinformationisdefinedasatriemorphism.

Definition.Adecoisatreemorphismiftheinformationateverynodeisafunctionofthecorrespondingsub-tree.Suchdecospreservethesharingofthetreestheydecorate.

-30-

(31)

Enco ding morphological parameters as decorations

Wethusprofitoftheregularityofmorphologicaltransformationstohaveterserepresentationsofthelexicondecoratedbygrammaticalinformation.Thusifallpluralsareobtainedbyadding‘s’tothesingularstemexceptforafewexceptions,wedonotpayanycostinencodingthispluralinformationasanexplicitinstruction

[pl:suffixs]decoratingthestems,sinceitwillnotcreateanynewnodeexceptforthefewexceptions.Asopposedtolistingexplicitlythepluralform,whichwouldundoallsharing.

Inoursanskritimplementation,thevariousgendersassociatedwithanounstemaredefinedinadecousedforproducingtheflexedforms.Theflexedformsarethengeneratedusinganad-hocinternalsandhialgorithm,difficulttoencodeasafinite-stateprocess,andthusdifficulttoinverse.

-31-

(32)

Explicit morphology vs implicit morphology

ByexplicitmorphologyImeanlistingexplicitlytheformsgeneratedbymorphologyoperationsfromrootstems,prefixesandsuffixes.

ByimplicitmorphologyImeanjusthavingprogramswhichwillgeneratetheseflexedformsondemand.

Implicitmorphologyisnotenoughtorecognizethesegmentsofsentencesidenticalwithaflexedform:themorphologicalfunctionsmustbeinvertible.

-32-

(33)

Compromise

Ontheotherhand,thedelimitationbetweenimplicitandexplicitisblurredsincee.g.afinite-statemachinestategraphmaybebothconsideredaprogramandapieceofdata;forinstance,atriestoreswords,butactuallythewordsare“recognizedasbeinginthelexicon”by“runningthelexiconoverthemasinputdata”.

Thusweshallrepresent“explicitly”flexedformsandtheinformationonhowtheyarederivedfromrootstemsasatriebearingasdecorationsinstructionsonhowto“undomorphology”locally.Forthispurpose,weshallusethenotionofdifferentialwordabove.Wemaynowstoreinversemapsoflexicalrelations(suchasmorphologyderivations)usingtheLexmapstructure.

Thiswaywebypassthe(hard)problemofinternalsandhifsmaxiomatisation.

-33-

(34)

Lexmaps

typeinverse’a=(Word.delta*’a)

andinverse_map’a=list(inverse’a);

typelexmap’a=[Mapof(inverse_map’a*mforest’a)]andmforest’a=list(Word.letter*lexmap’a);

Typically,ifwordwisstoredatanodeMap([...;(d,r);...],...),thisrepresentsthefactthatwistheimagebyrelationrof

w 0=patchdw.Suchalexmapisthusarepresentationoftheimagebyrofasourcelexicon.Thisrepresentationisinvertible,whilepreservingmaximallythesharingoffinalsubstrings,andthusbeingamenabletosharing.

Example:catsanddogssharingtheir‘s’nodewhileimplicitlyreferringtotheirrespectivesingularstem.

-34-

(35)

Uniformit y

Weremarkthatourdifferentialwordsmaybeseenaszipperoperationsbytecode:theintegerpartiteratesgoingup,whilethewordparttellshowtogodown,thewholethingbeingthecodefornavigatinginthestructurealongtheshortestpathfromonenodetotheother,throughtheirclosestcommonancestor.Thisshowsinanutshellthatthevarioustechniquesweareexhibitingareverycomplementary.

-35-

(36)

Lexicon rep ositories using tries and decos

Inatypicalcomputationallinguisticsapplication,grammaticalinformation(partofspeechrole,gender/numberforsubstantives,valencyandothersubcategorizationinformationforverbs,etc)maybestoredasdecorationofthelexiconofroots/stems.Fromsuchadecoratedtrieamorphologicalprocessormaycomputethelexmapofallflexedforms,decoratedwiththeirderivationinformationencodedasaninversemap.Thisstructuremayitselfbeusedbyataggingprocessortoconstructthelinearrepresentationofasentencedecoratedbyfeaturestructures.Sucharepresentationwillsupportfurtherprocessing,suchascomputingsyntacticandfunctionalstructures,typicallyassolutionsofconstraintsatisfactionproblems.

-36-

(37)

Example: Sanskrit

Themaincomponentinourtoolsisastructuredlexicaldatabase.Fromthisdatabase,varioushypertextdocumentsmaybeproducedmechanically.TheindexCGIenginesearchesforwordsbynavigatinginapersistenttrieindexofstementries.Thecurrentdatabasecomprises12000items,anditsindexhasasizeof103KB.

Whencomputingthisindex,anotherpersistentstructureiscreated.Itrecordsinadecoallthegendersassociatedwithanounentry.Atpresent,thisdecorecordsgendersfor5700nouns,andithasasizeof268KB.

Weiterateonthisgendersstructureagrammaticalengine,whichgeneratesdeclinedforms.Thislexmaprecordsabout120000suchflexedformswithassociatedgrammaticalinformation,andithasasizeof341KB.Acompaniontrie,withouttheinformation,keepstheindexofflexedwordsasaminimizedstructureof140KB.

-37-

(38)

Finite State Lore

Computationalphonologyaremorphologyuseextensivelyfinitestatetechnology:rationallanguagesandrelations,transducers,bimachines,etc.

•Sch¨utzenberger

•Koskenniemi

•KaplanandKay

Finitestatetoolsetshavebeendeveloped,wherewordtransformationsaresystematicallycompiledinalow-levelalgebraoffinite-statemachinesoperators.SuchtoolsetshavebeendevelopedatXerox,ParisVII,BellLabs,MitsubishiLabs,etc.Compilingcomplexrewriterulesinrationaltransducersmaybesubtle.Wedepartfromthisfine-grainedmethodologyandproposemoredirecttranslationspreservingthestructureofthelexicon.

-38-

(39)

Finite State Mac hines as Lexicon Morphisms

Westartwiththeremarkthatalexiconrepresentedasatrieisdirectlythestatespacerepresentationofthe(deterministic)finitestatemachinethatrecognizesitswords,andthatitsminimizationconsistsexactlyinsharingthelexicaltreeasadag.Weareinacasewherethestategraphofsuchfinitelanguagesrecognizersisanacyclicstructure.Suchapuredatastructuremaybeeasilybuiltwithoutmutablereferences,whichhascomputationalandrobustnessadvantages.

Inthesamespirit,wedefineautomatawhichimplementnon-trivialrationalrelations(andtheirinversion)andwhosestatestructureisnonethelessamoreorlessdirectdecorationofthelexicontrie.Thecrucialnotionisthatthestatestructureisalexiconmorphism.

-39-

(40)

Unglueing

Westartwithatoyproblemwhichisthesimplestcaseofjunctureanalysis,namelywhentherearenonon-trivialjuncturerules,andsegmentationconsistsjustinretrievingthewordsofasentencegluedtogetherinonelongstringofcharacters(orphonemes).ConsiderforinstancewrittenEnglish.Youhaveatextfileconsistingofasequenceofwordsseparatedwithblanks,andyouhavealexiconcompleteforthistext(forinstance,‘spell’hasbeensuccessfullyapplied).Now,supposeyoumakesomeeditingmistake,whichremovesallspaces,andthetaskistoundothisoperationtorestoretheoriginal.

Thetransducerisdefinedasafunctor,takingthelexicontriestructureasparameter.

-40-

(41)

Unglue

moduleUnglue(Lexicon:sigvaluelexicon:Trie.trie;end)=struct

typeinput=Word.word(*inputsentenceasaword*)

andoutput=listWord.word;(*outputissequenceofwords*)

typebacktrack=(input*output)

andresumption=listbacktrack;(*coroutineresumptions*)

exceptionFinished;

Wedefineourunglueingreactiveengineasarecursiveprocesswhichnavigatesdirectlyonthe(flexed)lexicontrie(typicallythecompressedtrieresultingfromtheDagmoduleconsideredabove).

-41-

(42)

The reactiv e engine

Thereactiveenginetakesasargumentsthe(remaining)input,the(partiallyconstructed)listofwordsreturnedasoutput,abacktrackstackwhoseitemsare(input,output)pairs,thepathoccinthestategraphstacking(thereverseof)thecurrentcommonprefixofthecandidatewords,andfinallythecurrenttrienodeasitscurrentstate.Whenthestateisaccepting,wepushitonthebacktrackstack,becausewewanttofavorpossiblelongerwords,andsowecontinuereadingtheinputuntileitherweexhausttheinput,orthenextinputcharacterisinconsistentwiththelexicondata.

-42-

(43)

The reactiv e engine co de

valuerecreactinputoutputbackocc=fun[Trie(b,forest)->ifbthenletpushout=[occ::output]inifinput=[]then(pushout,back)(*solutionfound*)elseletpushback=[(input,pushout)::back]incontinuepushbackelsecontinuebackwherecontinuecont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterforestinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]]

-43-

(44)

Bac ktrac k

andbacktrack=fun

[[]->raiseFinished|[(input,output)::back]->

reactinputoutputback[]Lexicon.lexicon

];

Now,unglueingasentenceisjustcallingthereactiveenginefromtheappropriateinitialbacktracksituation.

valueungluesentence=backtrack[(sentence,[])];

-44-

(45)

Non-deterministic programming

Non-deterministicprogrammingisnobigdeal.WhyshouldyousurrendercontroltoaPROLOGblackbox?

Thethreegoldenrulesofnon-deterministicprogramming:

•Identifywellyoursearchstatespace

•Representstatesasnon-mutabledata

•Provetermination

Thelastpointisessentialforunderstandingthegranularityandenforcingcompleteness.

Remark.Multisetorderingisanelegantmethodforprovingterminationofnon-deterministicprograms,independentlyofthesequentialstrategyofthegenerationofthesolutions.

-45-

(46)

More on state space considerations

Thisnon-deterministicprocess(recognizingL )usesthesamestatespaceasthelexicon/trie(recognizingL).

ThiscorrespondstothefactthatanautomatonforL maybeobtainedfromtheautomatonforLbyinserting-movesfromacceptingnodestotheinitialnode.Butsuchtransitionsmaybekeptcompletelyimplicit.Allyouhavetodoistomanagethenecessarynon-determinism(continuinginLwhichisnotingeneralaprefixlanguage(i.e.ifmayhappenthatbothwandw·sareinL)versusiterating)inthebacktrackstack,butyoudonothavetomodifyatallthestatespacedatastructure.Itisjustashiftinpointofviewconcerningthisdata.

-46-

(47)

Still more on state space considerations

RememberthatdagifiedtriesdefinetheminimalautomatonofafinitelanguageL.

Butitisnotthecasethatthisautomaton,completedwithtransitions,isminimalforL∗.ConsiderforinstanceL={a,aa}.

However,notethatweareusingitasatransducercomputingjustificationsforawordinL∗tobeaconcatenationofprecisewordsofL,andtheminimalautomatondoesnotkeepenoughinformationforthat:distinctsegmentationsofasentencemustbeseparated.

-47-

(48)

Childtalk

moduleChildtalk=struct

valuelexicon=Lexicon.make_lex["boudin";"caca";"pipi"];

end;

moduleChildish=Unglue(Childtalk);

let(sol,_)=Childish.unglue(Word.encode"pipicacaboudin")

inChildish.print_outsol;

Werecoverasexpected:pipicacaboudin.

-48-

(49)

Generating sev eral solutions

Weresumearesumptionwithresume:(resumption->int->resumption).

valueresumecontn=let(output,resumption)=backtrackcontindo{print_string"\nSolution";print_intn;print_string":\n";print_outoutput;resumption};

valueunglue_allsentence=restore[(sentence,[])]1whererecrestorecontn=tryletresumption=resumecontninrestoreresumption(n+1)with[Finished->ifn=1thenprint_string"Nosolutionfound\n"else()];

-49-

(50)

Solving a charade

moduleShort=struct

valuelexicon=Lexicon.make_lex

["able";"am";"amiable";"get";"her";"i";"to";"together"];

end;

moduleCharade=Unglue(Short);

Charade.unglue_all(Word.encode"amiabletogether");

Solution1:amiabletogether

Solution2:amiabletogether

Solution3:amiabletogether

Solution4:amiabletogether

-50-

(51)

Juncture euphon y and its discretization

Whensuccessivewordsareuttered,theminimizationoftheenergynecessarytoreconfiguratethevocalorgansatthejunctureofthewordsprovoquesaeuphonytransformation,discretizedatthelevelofphonemesbyacontextualrewriteruleoftheform:

[x]u|v→w

Thisjunctureeuphony,orexternalsandhi,isactuallyrecordedinsanskritinthewrittenrenderingofthesentence.Thefirstlinguisticprocessingisthereforesegmentation,whichgeneralisesunglueingintosandhianalysis.

-51-

(52)

uv w x

-52-

(53)

z uv w

u v

x

-53-

(54)

Auto

typelexicon=trie

andrule=(word*word*word);

Theruletriple(revu,v,w)representsthestringrewriteu|v→w.Nowforthetransducerstatespace:

typeauto=[Stateof(bool*deter*choices)]

anddeter=list(letter*auto)

andchoices=listrule;

moduleAuto=Share(structtypedomain=auto;

valuesize=hash_max;end);

-54-

(55)

Compiling the lexicon to a minimal transducer

(*build_auto:word->lexicon->(auto*stack*int)*)valuerecbuild_autoocc=fun[Trie(b,arcs)->letlocal_stack=ifbthenget_sandhioccelse[]inletf(deter,stack,span)(n,t)=letcurrent=[n::occ](*currentoccurrence*)inlet(auto,st,k)=build_autocurrenttin([(n,auto)::deter],mergeststack,hash1nkspan)inlet(deter,stack,span)=fold_leftf([],[],hash0)arcsinlet(h,l)=matchstackwith[[]->([],[])|[h::l]->(h,l)]inletkey=hashbspanhinlets=Auto.share(State(b,deter,h))keyin(s,mergelocal_stackl,key)];

-55-

(56)

Segmen ting T ransducer Data Structures

typetransition=

[Euphonyofrule(*(revu,v,w)stu|v->w*)

|Id(*identityornosandhi*)

]

andoutput=list(word*transition);

typebacktrack=

[Nextof(input*output*word*choices)|Initof(input*output)

]

andresumption=listbacktrack;(*coroutineresumptions*)

exceptionFinished;

-56-

(57)

Running the Segmen ting T ransducer

valuerecreactinputoutputbackocc=fun[State(b,det,choices)->(*wetrythedeterministicspacefirst*)letdetercont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterdetinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]inletnondets=ifchoices=[]thenbackelse[Next(input,output,occ,choices)::back]inifbthenletout=[(occ,Id)::output](*optfinalsandhi*)

-57-

(58)

inifinput=[]then(out,nondets)(*solution*)elseletalterns=[Init(input,out)::nondets](*wefirsttrythelongestmatchingword*)indeteralternselsedeternondets]andchooseinputoutputbackocc=fun[[]->backtrackback|[((u,v,w)asrule)::others]->letalterns=[Next(input,output,occ,others)::back]inifprefixwinputthenlettape=advance(lengthw)inputandout=[(u@occ,Euphony(rule))::output]inifv=[](*finalsandhi*)theniftape=[]then(out,alterns)elsebacktrackalterns

-58-

(59)

elseletnext_state=accessvinreacttapeoutalternsvnext_state

elsebacktrackalterns

]

andbacktrack=fun

[[]->raiseFinished

|[resume::back]->matchresumewith

[Next(input,output,occ,choices)->

chooseinputoutputbackoccchoices

|Init(input,output)->

reactinputoutputback[]automaton]

];

-59-

(60)

Example of Sanskrit Segmen tation

process"tacchrutvaa";

Chunk:tacchrutvaa

maybesegmentedas:

Solution1:

[tadwithsandhid|"s->cch]

["srutvaawithnosandhi]

-60-

(61)

More examples

process"o.mnama.h\"sivaaya";

Solution1:

[omwithsandhim|n->.mn]

[namaswithsandhis|"s->.h"s]

["sivaayawithnosandhi]

process"sugandhi.mpu.s.tivardhanam";

Solution1:[sugandhimwithsandhim|p->.mp]

[pu.s.tiwithnosandhi]

[vardhanamwithnosandhi]

-61-

(62)

Sanskrit T agging

process"sugandhi.mpu.s.tivardhanam";

Solution1:

[sugandhim

<{acc.sg.m.}[sugandhi]>withsandhim|p->.mp]

[pu.s.ti<{iic.}[pu.s.ti]>withnosandhi]

[vardhanam

<{acc.sg.m.|acc.sg.n.|nom.sg.n.

|voc.sg.n.}[vardhana]>withnosandhi]

-62-

(63)

Statistics

Thecompleteautomatonconstructionfromtheflexedformslexicontakesonly9sona864MHzPC.Wegetaverycompactautomaton,withonly7337states,1438ofwhichacceptingstates,fittingin746KBofmemory.Withoutthesharing,wewouldhavegeneratedabout200000statesforasizeof6MB!

Thetotalnumberofsandhirulesis2802,ofwhich2411arecontextual.While4150stateshavenochoicepoints,theremaining3187haveanon-deterministiccomponent,withafan-outreaching164intheworstsituation.Howeverinpracticetherearenevermorethan2choicesforagiveninput,andsegmentationisextremelyfast.

-63-

(64)

Soundness and Completeness of the Algorithms

Theorem.Ifthelexicalsystem(L,R)isstrictandweaklynon-overlappingsisan(L,R)-sentenceiffthealgorithm(segmentalls)returnsasolution;conversely,the(finite)setofallsuchsolutionsexhibitsalltheproofsforstobean(L,R)-sentence.

Fact.InclassicalSanskrit,externalsandhiisstronglynon-overlapping.

Cf.http://pauillac.inria.fr/~huet/PUBLIC/tagger.ps

-64-

(65)

A note on termination

Terminationisprovedbymultisetorderingonresumptions.

Thisallowstostatethealgorithmasanon-deterministicalgorithm,allowinganystrategyforpriorityoflexiconsearchversuseuphonyprediction,aswellasarbitraryselectionofresumptionswhenbacktracking.

Thisisimportant,sinceitleavesallfreedomforimplementingarbitraryprioritypolicieslearnedbycorpustraining.

-65-

(66)

Enjo y!

•Sanskritsite:http://pauillac.inria.fr/~huet/SKT/

•SandhiAnalysispaper:

http://pauillac.inria.fr/~huet/FREE/tagger.ps

•Coursenotes:

http://pauillac.inria.fr/~huet/ZEN/esslli.ps

•Courseslides:

http://pauillac.inria.fr/~huet/ZEN/Trento.ps

•Tutorialslides:

http://pauillac.inria.fr/~huet/ZEN/Hyderabad.ps

•ZENlibrary:http://pauillac.inria.fr/~huet/ZEN/zen.tar

•ObjectiveCaml:http://caml.inria.fr/ocaml/

-66-

(67)

Automata mista - AuM

-67-

(68)

Differen tial w ords

typedelta=(int*word);

Adifferentialwordisanotationpermittingtoretrieveawordwfromanotherwordw 0sharingacommonprefix.Itdenotestheminimalpathconnectingthewordsinatree,asasequenceofupsanddowns:ifd=(n,u)wegoupntimesandthendownalongwordu.

Wecomputethedifferencebetweenwandw 0asadifferentialword diffww 0=(|w1|,w2)wherew=p.w1andw 0=p.w2,withmaximalcommonprefixp.

Theconverseofdiff:word->word->deltais

patch:delta->word->word:w 0mayberetrievedfromwand d=diffww 0asw 0=patchdw.

-68-

(69)

The automaton structure

typeinput=word;

typedelta=(int*word)andaddress=[Globalofdelta|Localofdelta];

typeauto=[Stateof(bool*deter*choices)]anddeter=list(letter*auto)andchoices=list(input*address);

typeautomaton=(arrayauto*delta);

typebacktrack=(input*delta*choices)andresumption=listbacktrack;(*coroutineresumptions*)

-69-

(70)

Double compl ´etude

Toutautomatenon-d´eterministe(´eventuellementavecmoves)peutˆetrerepr´esent´eparunautomatemixteplat(dontlesstructuresd´eterministessontvides).

Toutautomated´eterministepeutˆetrerepr´esent´eparunautomatemixtedontlesseulspartiesdechoixState(b,[],[([],address)])nedonnentpaslieu`abacktrack.

Toutautomatemixteaunerepr´esentationminimale,obtenueparpartageendag.Lepartagedesadressesvirtuelleslocalesnecorrespondpasn´ecessairement`adesautomates´equivalentsparbisimulation.

-70-

(71)

The transducer structure

typeinput=wordandoutput=word;

typedelta=(int*word)

andaddress=[Globalofdelta|Localofdelta];

typetrans=[Stateof(bool*deter*choices)]

anddeter=list(letter*trans)

andchoices=list(input*output*address);

typetransducer=(arraytrans*delta);

typebacktrack=(input*output*delta*choices)

andresumption=listbacktrack;(*coroutineresumptions*)

-71-

(72)

-72-

(73)

foret

pile a1

an a1

an k 1k AuM

dag courant mot dag

-73-

(74)

M ´emorisation de l’acc `es couran t

Lapiled’acc`es[sn;sn1;...s0]`al’´etatcourantestn´ecessaire,pourinterpr´eterlesadressesvirtuelleslocales.Ilpeutˆetreavantageuxdegarderaussilemotcourantd’acc`esmot=[an;...a1],d´epil´eetempil´eaugr´edesacc`eslocaux.Onpeutainsidissocierdeuxconstructeursd’output:AbsoluofwordetRelatifofdelta.Danscederniercas,l’outputestcalcul´eparpatch`apartirdemot.

Applications:

•Dictionnairedesmotsfl´echisutilis´ecommelemmatiseur(plurielr´egulier:(δ=(1,[ 0s 0]))

•Unglue(δ=(0,[]))

•Segment(δ=(0,u))

-74-

(75)

Conclusion

Lesautomatesmixtesoffrentunesolutionefficace`adenombreuxprobl`emesmorpho-syntaxiques.Lastructured´eterministecouvrantl’espacedes´etatsestlelexique,quisetrouveainsinaturellementplac´eaucœurdutraitementinformatiquedelalangue.

-75-

Références

Documents relatifs

It signals the central control or the processor (depending on the Medium Data Processing System) to au- tomatically translate EBCDIC to BCL informa- tion as it is

The selector light pen feature supplies a hand-held, pen-like device that permits an operator of a display station to select fields of data from the display screen

When this option is implemented on the L-port and the L-drivers are disabled to use the L lines as inputs, the disabled depletion-mode device cannot be relied on to source

Input port lines and output port lines are accessed at 16-pin DIP sockets on the card.. A reset line is

The Magnetic Tape Result Descriptor has been extended to include Character Count, Begin- ning-of-Tape and End-of-Tape Flags, and a Flag for spacing over about

The OMA state machine sends an expansion OMA request to the Main Processor board's bus arbiter.. The DMA state machine ends the transfer by negating the

The customer may extend the coaxial signal cable to a maximum length of 1 500-meters (4,925-feet) using coaxial cable as outlined for the 3279 in the IBM 3270 Information

public static void main(String[] args) throws IOException { InputStream in=new FileInputStream(args[0]);.. Et avec