De Zen `a Aum
G ´erard Huet
LIMSI, Orsa y , 16 mars 2004
-1-
The Zen to olkit - Generic tec hnology
Afewspecificapplicativetechniques:
•Localprocessingoffocuseddata
•Sharing
•Lexicaltrees
•Finitetransducersaslexiconmorphisms
•Searchbyresumptioncoroutines
•Multisetorderingconvergence
-2-
Basics: lists vs stac ks
valuel5=[1;2;3;4;5];
values5=[5;4;3;2;1];
valuerecstack_ons=fun[[]->s|[h::t]->stack_on[h::s]t];
(*stack_onsl=unstackls=(revl)@s*)
valuerevl=stack_on[]l;
valuestate3=([3;2;1],[4;5]);
-3-
F o cus
Instate3,wefocusonthefourthelementofl5bylookingatthesublist[4;5]inacontextgivenbythestack[3;2;1]:
...1;2;3]*[4;5...
ThisisthewayaTuringmachinenavigates,andcomputes.Intwodimensions,wegetaneditingcontextoflinesandcharactersinaTEXTfilerepresentedinanEmacsbufferfocusedatthecurrentmark.
-4-
Zipp ers
Zippersgiveagenericviewtotherepresentationofafocusedstructurebyapair(context,substructure).
FirstpresentationatFLoC’96.Publishedas:G.Huet.TheZipper.J.FunctionalProgramming7,5(1997),549-554.
Largescaleimplementationsinsyntaxeditorswithincomputationallinguisticsplatforms:
•G.Huet.LexicalmorphismswiththeZentoolkit.
•A.Ranta.GrammaticalFrameworks.
-5-
Example: binary trees
typetree2=[Leaf2|Node2of(tree2*tree2)];
typesum2=[Proj21oftree2|Proj22oftree2]
andcontext2=listsum2
andfocus2=(context2*tree2);
valueleft(c,t)=matchcwith
[[(Proj22s)::z]->([(Proj21t)::z],s)|_->raise(Failure"leftoftop")]
andup(c,t)=matchcwith
[[]->raise(Failure"upoftop")
|[(Proj21s)::z]->(z,Node2(t,s))
|[(Proj22s)::z]->(z,Node2(s,t))]
-6-
Zipp ers, LL and Differen tiation
Thistechnologyisgeneric,andthefocusnavigationoperationsmaybeprogrammeduniformly,asshownbyHinze,JeuringandL¨ohasapolytypicfunction.See“Type-indexeddatatypes”,MathematicsforProgramConstruction,LNCS2386(2002).
Zippersarerelatedtolinearfunctionsonstructures,inthesenseoflinearlogic.ThefocusoperationsareinteractionoperatorsofY.Lafont.
ConorMcBride,in“TheDerivativeofaRegularTypeisitsTypeofOne-HoleContexts”,showedthatthezipperdatatypemaybederivedfromthestructuredatatypebyformalpartialdifferentiation.Inotherword,datastructuresareintegralsoftheircreatingcontexts.
-7-
Ordered trees
typetree=[Treeofforest]
andforest=listtree;
typetree_zipper=
[Top
|Zipof(forest*tree_zipper*forest)
];
typefocused_tree=(tree_zipper*tree);
Afocusedtreeisatreewithafocuspointofinterest,i.e.asubtreeanditsstackedcontext.
-8-
Op erations on fo cused trees
valuedown(z,t)=matchtwith[Tree(forest)->matchforestwith
[[]->raise(Failure"down")
|[hd::tl]->(Zip([],z,tl),hd)
]
];
valueup(z,t)=matchzwith
[Top->raise(Failure"up")
|Zip(l,u,r)->(u,Tree(stack_on[t::r]l))];
-9-
More op erations on fo cused trees
valueleft(z,t)=matchzwith[Top->raise(Failure"left")|Zip(l,u,r)->matchlwith[[]->raise(Failure"left")|[elder::rest]->(Zip(elders,u,[t::r]),rest)]];valueright(z,t)=matchzwith[Top->raise(Failure"right")|Zip(l,u,r)->matchrwith[[]->raise(Failure"right")|[young::rest]->(Zip([t::l],u,rest),young)]];
-10-
Applicativ e up dating
valuedel_l(z,_)=matchzwith
[Top->raise(Failure"del_l")
|Zip(l,u,r)->matchlwith
[[]->raise(Failure"del_l")
|[elder::elders]->(Zip(elders,u,r),elder)
]];
valuereplace(z,_)t=(z,t);
-11-
P oin ts of view ab out fo cused structures
•Manipulationoffocuseddataislocal
•Redundantrepresentation-efficiency
•TheInteractionCombinatorsParadigm
Remark.Zippersarelinearcontexts.TheyaresuperiortoΩ-terms,notablybecausetheapproximationorderingissubstructural.
-12-
Computational linguistics
Wewanttoprocess(parseandgenerate)naturallanguagesentences,dialogues,corpusesofvariouskinds(oral,written,news,books,websites,etc).Weassumethatthedataisalreadydigitalisedanddiscretizedasastreamofletters(phonemesfororaldata,lettersforwrittenone).
Afundamentalentityinthisprocessingistheword.Onetraditionallydistinguishesprocessingbetweenstreamsoflettersandwords(morphology,lexicalanalysis)andprocessingbetweenwordsandsentences(syntax,parsing).
-13-
W ords
Wordsarerepresentedaslistofpositiveintegers.
typeletter=int(*lettersorphonemes*)andword=listletter;
Weprovidecoercionsencode:string->wordanddecode:word->string.Hereislexicographicordering.
valuereclexicol1l2=matchl1with[[]->True|[c1::r1]->matchl2with[[]->False|[c2::r2]->ifc2<c1thenFalseelseifc2=c1thenlexicor1r2elseTrue]];
-14-
Differen tial w ords
typedelta=(int*word);
Adifferentialwordisanotationpermittingtoretrieveawordwfromanotherwordw 0sharingacommonprefix.Itdenotestheminimalpathconnectingthewordsinatree,asasequenceofupsanddowns:ifd=(n,u)wegoupntimesandthendownalongwordu.
Wecomputethedifferencebetweenwandw 0asadifferentialword diffww 0=(|w1|,w2)wherew=p.w1andw 0=p.w2,withmaximalcommonprefixp.
Theconverseofdiff:word->word->deltais
patch:delta->word->word:w 0mayberetrievedfromwand d=diffww 0asw 0=patchdw.
-15-
T ries
Tries,orlexicaltrees,storesparsesetsofwordssharinginitialprefixes.TheyareduetoRen´edelaBriantais(1959).Weuseaverysimplerepresentationwithlistsofsiblings.
typetrie=[Trieof(bool*forest)]
andforest=list(Word.letter*trie);
Triesaremanaged(search,insertion,etc)usingthezippertechnology.
AsideTernarytrees.
-16-
Imp ortan t remarks
Triesmaybeconsideredasdeterministicfinitestateautomatagraphsforacceptingthe(finite)languagetheyrepresent.Thisremarkisthebasisformanylexiconprocessinglibraries.
Suchgraphsareacyclic(trees).Butmoregeneralfinitestateautomatagraphsmayberepresentedasannotatedtrees.Theseannotationsaccountfornon-deterministicchoicepoints,andforvirtualpointersinthegraph.
-17-
Lexicon
Hereisasimplisticlexiconcompiler
make_lex:liststring->trie:
valuemake_lex=letenter1lexc=Trie.enterlex(Word.encodec)
inList.fold_leftenter1Trie.empty;
Forinstance,withenglish.lststoringalistof173528words,asatextfileofsize2Mb,thecommandmake_lex<english.lst>english.remproducesatrierepresentationasafileof4.5Mb.
Triessharethewordsbythereprefixes,butcommonsuffixesaccountforalotofredundancyinthestructure.Weshalleliminatethisredundancybysharingandgetaminimalstructure.
-18-
The Share F unctor
moduleShare:functor(Algebra:sigtypedomain=’a;valuesize:int;end)->
sigvalueshare:Algebra.domain->int->Algebra.domain;end;
Thatis,SharetakesasargumentamoduleAlgebraprovidingatypedomainandanintegervaluesize,anditdefinesavalueshareofthestatedtype.WeassumethattheelementsfromthedomainarepresentedwithanintegerkeyboundedbyAlgebra.size.Thatis,
sharexkwillassumeaspreconditionthat0≤k<Maxwith
Max=Algebra.size.
Weshallconstructthesharingmapwiththehelpofahashtable,madeupofbuckets(k,[e1;e2;...en])whereeachelementeihaskeyk.
-19-
Memoizing
typebucket=listAlgebra.domain;
valuememo=Array.createAlgebra.size([]:bucket);
Weshalluseaservicefunctionsearch,suchthatsearchelreturnsthefirstyinlsuchthaty=eororelseraisestheexception
Not_found.
valuesearche=List.find(funx->x=e);
-20-
The share function
valueshareelementkey=
letbucket=memo.(key)in
trysearchelementbucketwith[Not_found->
do{memo.(key):=[element::bucket];element}
];
Sharingisjustrecalling!
-21-
Compressing trees as dags
WemayforinstanceinstantiateShareonthealgebraoftrees,withasizehashmaxdependingontheapplication:
moduleDag=Share(structtypedomain=tree;
valuesize=hash_max;end);
Andnowwecompressatrieintoaminimaldagusingsharebyasimplebottom-uptraversal,wherethekeyiscomputedalongbyhashing.Forthiswedefineageneralbottom-uptraversalfunction,whichappliesaparametriclookupfunctiontoeverynodeanditsassociatedkey.
-22-
Dynamic programming
Bottom-uptraversingwithinductivehash-codecomputation.
valueh1keyindexsum=sum+index*key
andh0=1andhforest=forestmodhash_max;
valuetraverselookup=travel
whererectravel=fun
[Treeforest->
letf(trees,index,span)t=
let(t’,k)=traveltin
([t’::trees],index+1,h1kindexspan)in
let(forest’,_,span)=List.fold_leftf([],1,h0)forestin
letkey=hspanin(lookup(Tree(List.revforest’))key,key)];
-23-
Compressing a tree as a dag
Now,compressingatreeoptimallyasaminimaldagissimplyeffectedbyasharingtraversal:
valuecompress=traverseDag.share;
valueminimizetree=let(dag,_)=compresstreeindag;
-24-
Adv an tages and extensions
Hashingkeysandsizeisontheclientside:wedonotdelegatehashingtoShare,whichisjustanassociativememory.Thishastwoadvantages:
•Thecomputationisfullylinear
•Itisadaptedtothestatisticsofthedata
Extension:Auto-sharingtypes(controlledhash-consing).Suggestsamonadofsharedhashedstructuresaccommodatingentropyofthedata.
-25-
Dagified lexicons
Wemaydagifyalexiconaposterioriinonepass:
valuerecdagify()=
letlexicon=(input_valuestdin:Trie.trie)
inletdag=Mini.minimizelexiconinoutput_valuestdoutdag;
Andnowifweapplythistechniquetoourenglishlexicon,withcommanddagify<english.rem>small.rem,wenowgetanoptimalrepresentationwhichonlyneeds1Mbofstorage,halfoftheoriginalASCIIstringrepresentation.
-26-
Adv ertisemen t
Therecursivealgorithmsgivensofararefairlystraightforward.Theyareeasytodebug,maintainandmodifyduetothestrongtypingsafeguardofML,andeveneasytoformallycertify.Theyarenonethelessefficientenoughforproductionuse,thankstotheoptimizingnative-codecompilerofObjectiveCaml.
InourSanskritapplication,thetrieof11500entriesisshrunkfrom219Kbto103Kbin0.1s,whereasthetrieof120000flexedformsisshrunkfrom1.63Mbto140Kbin0.5sona864MHzPC.Ourtrieof173528Englishwordsisshrunkfrom4.5Mbto1Mbin2.7s.Measurementsshowedthatthetimecomplexityislinearwiththesizeofthelexicon(withincomparablesetsofwords).
-27-
V ariations
Manyvariationsontriesexist.OptimisationsoflexicalanalysersforprogramminglanguagesaredescribedintheDragonbook.Butthedragonbookofcomputationallinguisticshasnotbeenwrittenyet.
Variationwithternarytrees.TernarytreesareinspiredfromBentleyandSedgewick.Ternarytreesaremorecomplexthantries,butuseslightlylessstorage.Accessispotentiallyfasterinbalancedtreesthantries.Agoodmethodologyseemstousetriesforedition,andtotranslatethemtobalancedternarytreesforproductionusewithafixedlexicon.
Theternaryversionofourenglishlexicontakes3.6Mb,asavingsof20%overitstrieversionusing4.5Mb.Afterdagminimization,ittakes1Mb,asavingsof10%overthetriedagversionusing1.1Mb.Foroursanskritlexiconindex,thetrietakes221Kbandthetertree180Kb.Sharedasdagsthetrietakes103Kbandthetertree96Kb.
-28-
Decos, Lexmaps, Autos
WeunderstandtheTriestructureofasetofWordsasaspecialcaseofafinitelybasedmappingDeco=Word→AnnotationinthecaseofBooleanannotationssharedbyprefixarguments(andbycommonsubexpressionswhenshared).
Westoremorphologyconstructionsasbeingofthistype,andweinvestigatethereversemappingbygeneralisingthemtorelations,typicallyinductivelydefinedthroughfinitestatemachines.
Themoresharingwegetthebetterweoptimisethisdatalayout.Itisthusofparamountimportancethattheannotationsbelocalquasi-morphismsdecorations.
-29-
Decos
typedeco’a=[Decoof(list’a*dforest’a)]
anddforest’a=list(Word.letter*deco’a);
Wethinkofthedecorationoftypelist’aasaninformationassociatedwiththewordstoredatthatnode.
Wecaneasilygeneralizesharingtodecoratedtries.However,substantialsavingswillresultonlyiftheinformationatagivennodeisafunctionofthesubtrieatthatnode,i.e.ifsuchinformationisdefinedasatriemorphism.
Definition.Adecoisatreemorphismiftheinformationateverynodeisafunctionofthecorrespondingsub-tree.Suchdecospreservethesharingofthetreestheydecorate.
-30-
Enco ding morphological parameters as decorations
Wethusprofitoftheregularityofmorphologicaltransformationstohaveterserepresentationsofthelexicondecoratedbygrammaticalinformation.Thusifallpluralsareobtainedbyadding‘s’tothesingularstemexceptforafewexceptions,wedonotpayanycostinencodingthispluralinformationasanexplicitinstruction
[pl:suffixs]decoratingthestems,sinceitwillnotcreateanynewnodeexceptforthefewexceptions.Asopposedtolistingexplicitlythepluralform,whichwouldundoallsharing.
Inoursanskritimplementation,thevariousgendersassociatedwithanounstemaredefinedinadecousedforproducingtheflexedforms.Theflexedformsarethengeneratedusinganad-hocinternalsandhialgorithm,difficulttoencodeasafinite-stateprocess,andthusdifficulttoinverse.
-31-
Explicit morphology vs implicit morphology
ByexplicitmorphologyImeanlistingexplicitlytheformsgeneratedbymorphologyoperationsfromrootstems,prefixesandsuffixes.
ByimplicitmorphologyImeanjusthavingprogramswhichwillgeneratetheseflexedformsondemand.
Implicitmorphologyisnotenoughtorecognizethesegmentsofsentencesidenticalwithaflexedform:themorphologicalfunctionsmustbeinvertible.
-32-
Compromise
Ontheotherhand,thedelimitationbetweenimplicitandexplicitisblurredsincee.g.afinite-statemachinestategraphmaybebothconsideredaprogramandapieceofdata;forinstance,atriestoreswords,butactuallythewordsare“recognizedasbeinginthelexicon”by“runningthelexiconoverthemasinputdata”.
Thusweshallrepresent“explicitly”flexedformsandtheinformationonhowtheyarederivedfromrootstemsasatriebearingasdecorationsinstructionsonhowto“undomorphology”locally.Forthispurpose,weshallusethenotionofdifferentialwordabove.Wemaynowstoreinversemapsoflexicalrelations(suchasmorphologyderivations)usingtheLexmapstructure.
Thiswaywebypassthe(hard)problemofinternalsandhifsmaxiomatisation.
-33-
Lexmaps
typeinverse’a=(Word.delta*’a)
andinverse_map’a=list(inverse’a);
typelexmap’a=[Mapof(inverse_map’a*mforest’a)]andmforest’a=list(Word.letter*lexmap’a);
Typically,ifwordwisstoredatanodeMap([...;(d,r);...],...),thisrepresentsthefactthatwistheimagebyrelationrof
w 0=patchdw.Suchalexmapisthusarepresentationoftheimagebyrofasourcelexicon.Thisrepresentationisinvertible,whilepreservingmaximallythesharingoffinalsubstrings,andthusbeingamenabletosharing.
Example:catsanddogssharingtheir‘s’nodewhileimplicitlyreferringtotheirrespectivesingularstem.
-34-
Uniformit y
Weremarkthatourdifferentialwordsmaybeseenaszipperoperationsbytecode:theintegerpartiteratesgoingup,whilethewordparttellshowtogodown,thewholethingbeingthecodefornavigatinginthestructurealongtheshortestpathfromonenodetotheother,throughtheirclosestcommonancestor.Thisshowsinanutshellthatthevarioustechniquesweareexhibitingareverycomplementary.
-35-
Lexicon rep ositories using tries and decos
Inatypicalcomputationallinguisticsapplication,grammaticalinformation(partofspeechrole,gender/numberforsubstantives,valencyandothersubcategorizationinformationforverbs,etc)maybestoredasdecorationofthelexiconofroots/stems.Fromsuchadecoratedtrieamorphologicalprocessormaycomputethelexmapofallflexedforms,decoratedwiththeirderivationinformationencodedasaninversemap.Thisstructuremayitselfbeusedbyataggingprocessortoconstructthelinearrepresentationofasentencedecoratedbyfeaturestructures.Sucharepresentationwillsupportfurtherprocessing,suchascomputingsyntacticandfunctionalstructures,typicallyassolutionsofconstraintsatisfactionproblems.
-36-
Example: Sanskrit
Themaincomponentinourtoolsisastructuredlexicaldatabase.Fromthisdatabase,varioushypertextdocumentsmaybeproducedmechanically.TheindexCGIenginesearchesforwordsbynavigatinginapersistenttrieindexofstementries.Thecurrentdatabasecomprises12000items,anditsindexhasasizeof103KB.
Whencomputingthisindex,anotherpersistentstructureiscreated.Itrecordsinadecoallthegendersassociatedwithanounentry.Atpresent,thisdecorecordsgendersfor5700nouns,andithasasizeof268KB.
Weiterateonthisgendersstructureagrammaticalengine,whichgeneratesdeclinedforms.Thislexmaprecordsabout120000suchflexedformswithassociatedgrammaticalinformation,andithasasizeof341KB.Acompaniontrie,withouttheinformation,keepstheindexofflexedwordsasaminimizedstructureof140KB.
-37-
Finite State Lore
Computationalphonologyaremorphologyuseextensivelyfinitestatetechnology:rationallanguagesandrelations,transducers,bimachines,etc.
•Sch¨utzenberger
•Koskenniemi
•KaplanandKay
Finitestatetoolsetshavebeendeveloped,wherewordtransformationsaresystematicallycompiledinalow-levelalgebraoffinite-statemachinesoperators.SuchtoolsetshavebeendevelopedatXerox,ParisVII,BellLabs,MitsubishiLabs,etc.Compilingcomplexrewriterulesinrationaltransducersmaybesubtle.Wedepartfromthisfine-grainedmethodologyandproposemoredirecttranslationspreservingthestructureofthelexicon.
-38-
Finite State Mac hines as Lexicon Morphisms
Westartwiththeremarkthatalexiconrepresentedasatrieisdirectlythestatespacerepresentationofthe(deterministic)finitestatemachinethatrecognizesitswords,andthatitsminimizationconsistsexactlyinsharingthelexicaltreeasadag.Weareinacasewherethestategraphofsuchfinitelanguagesrecognizersisanacyclicstructure.Suchapuredatastructuremaybeeasilybuiltwithoutmutablereferences,whichhascomputationalandrobustnessadvantages.
Inthesamespirit,wedefineautomatawhichimplementnon-trivialrationalrelations(andtheirinversion)andwhosestatestructureisnonethelessamoreorlessdirectdecorationofthelexicontrie.Thecrucialnotionisthatthestatestructureisalexiconmorphism.
-39-
Unglueing
Westartwithatoyproblemwhichisthesimplestcaseofjunctureanalysis,namelywhentherearenonon-trivialjuncturerules,andsegmentationconsistsjustinretrievingthewordsofasentencegluedtogetherinonelongstringofcharacters(orphonemes).ConsiderforinstancewrittenEnglish.Youhaveatextfileconsistingofasequenceofwordsseparatedwithblanks,andyouhavealexiconcompleteforthistext(forinstance,‘spell’hasbeensuccessfullyapplied).Now,supposeyoumakesomeeditingmistake,whichremovesallspaces,andthetaskistoundothisoperationtorestoretheoriginal.
Thetransducerisdefinedasafunctor,takingthelexicontriestructureasparameter.
-40-
Unglue
moduleUnglue(Lexicon:sigvaluelexicon:Trie.trie;end)=struct
typeinput=Word.word(*inputsentenceasaword*)
andoutput=listWord.word;(*outputissequenceofwords*)
typebacktrack=(input*output)
andresumption=listbacktrack;(*coroutineresumptions*)
exceptionFinished;
Wedefineourunglueingreactiveengineasarecursiveprocesswhichnavigatesdirectlyonthe(flexed)lexicontrie(typicallythecompressedtrieresultingfromtheDagmoduleconsideredabove).
-41-
The reactiv e engine
Thereactiveenginetakesasargumentsthe(remaining)input,the(partiallyconstructed)listofwordsreturnedasoutput,abacktrackstackwhoseitemsare(input,output)pairs,thepathoccinthestategraphstacking(thereverseof)thecurrentcommonprefixofthecandidatewords,andfinallythecurrenttrienodeasitscurrentstate.Whenthestateisaccepting,wepushitonthebacktrackstack,becausewewanttofavorpossiblelongerwords,andsowecontinuereadingtheinputuntileitherweexhausttheinput,orthenextinputcharacterisinconsistentwiththelexicondata.
-42-
The reactiv e engine co de
valuerecreactinputoutputbackocc=fun[Trie(b,forest)->ifbthenletpushout=[occ::output]inifinput=[]then(pushout,back)(*solutionfound*)elseletpushback=[(input,pushout)::back]incontinuepushbackelsecontinuebackwherecontinuecont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterforestinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]]
-43-
Bac ktrac k
andbacktrack=fun
[[]->raiseFinished|[(input,output)::back]->
reactinputoutputback[]Lexicon.lexicon
];
Now,unglueingasentenceisjustcallingthereactiveenginefromtheappropriateinitialbacktracksituation.
valueungluesentence=backtrack[(sentence,[])];
-44-
Non-deterministic programming
Non-deterministicprogrammingisnobigdeal.WhyshouldyousurrendercontroltoaPROLOGblackbox?
Thethreegoldenrulesofnon-deterministicprogramming:
•Identifywellyoursearchstatespace
•Representstatesasnon-mutabledata
•Provetermination
Thelastpointisessentialforunderstandingthegranularityandenforcingcompleteness.
Remark.Multisetorderingisanelegantmethodforprovingterminationofnon-deterministicprograms,independentlyofthesequentialstrategyofthegenerationofthesolutions.
-45-
More on state space considerations
Thisnon-deterministicprocess(recognizingL ∗)usesthesamestatespaceasthelexicon/trie(recognizingL).
ThiscorrespondstothefactthatanautomatonforL ∗maybeobtainedfromtheautomatonforLbyinserting-movesfromacceptingnodestotheinitialnode.Butsuchtransitionsmaybekeptcompletelyimplicit.Allyouhavetodoistomanagethenecessarynon-determinism(continuinginLwhichisnotingeneralaprefixlanguage(i.e.ifmayhappenthatbothwandw·sareinL)versusiterating)inthebacktrackstack,butyoudonothavetomodifyatallthestatespacedatastructure.Itisjustashiftinpointofviewconcerningthisdata.
-46-
Still more on state space considerations
RememberthatdagifiedtriesdefinetheminimalautomatonofafinitelanguageL.
Butitisnotthecasethatthisautomaton,completedwithtransitions,isminimalforL∗.ConsiderforinstanceL={a,aa}.
However,notethatweareusingitasatransducercomputingjustificationsforawordinL∗tobeaconcatenationofprecisewordsofL,andtheminimalautomatondoesnotkeepenoughinformationforthat:distinctsegmentationsofasentencemustbeseparated.
-47-
Childtalk
moduleChildtalk=struct
valuelexicon=Lexicon.make_lex["boudin";"caca";"pipi"];
end;
moduleChildish=Unglue(Childtalk);
let(sol,_)=Childish.unglue(Word.encode"pipicacaboudin")
inChildish.print_outsol;
Werecoverasexpected:pipicacaboudin.
-48-
Generating sev eral solutions
Weresumearesumptionwithresume:(resumption->int->resumption).
valueresumecontn=let(output,resumption)=backtrackcontindo{print_string"\nSolution";print_intn;print_string":\n";print_outoutput;resumption};
valueunglue_allsentence=restore[(sentence,[])]1whererecrestorecontn=tryletresumption=resumecontninrestoreresumption(n+1)with[Finished->ifn=1thenprint_string"Nosolutionfound\n"else()];
-49-
Solving a charade
moduleShort=struct
valuelexicon=Lexicon.make_lex
["able";"am";"amiable";"get";"her";"i";"to";"together"];
end;
moduleCharade=Unglue(Short);
Charade.unglue_all(Word.encode"amiabletogether");
Solution1:amiabletogether
Solution2:amiabletogether
Solution3:amiabletogether
Solution4:amiabletogether
-50-
Juncture euphon y and its discretization
Whensuccessivewordsareuttered,theminimizationoftheenergynecessarytoreconfiguratethevocalorgansatthejunctureofthewordsprovoquesaeuphonytransformation,discretizedatthelevelofphonemesbyacontextualrewriteruleoftheform:
[x]u|v→w
Thisjunctureeuphony,orexternalsandhi,isactuallyrecordedinsanskritinthewrittenrenderingofthesentence.Thefirstlinguisticprocessingisthereforesegmentation,whichgeneralisesunglueingintosandhianalysis.
-51-
uv w x
-52-
z uv w
u v
x
-53-
Auto
typelexicon=trie
andrule=(word*word*word);
Theruletriple(revu,v,w)representsthestringrewriteu|v→w.Nowforthetransducerstatespace:
typeauto=[Stateof(bool*deter*choices)]
anddeter=list(letter*auto)
andchoices=listrule;
moduleAuto=Share(structtypedomain=auto;
valuesize=hash_max;end);
-54-
Compiling the lexicon to a minimal transducer
(*build_auto:word->lexicon->(auto*stack*int)*)valuerecbuild_autoocc=fun[Trie(b,arcs)->letlocal_stack=ifbthenget_sandhioccelse[]inletf(deter,stack,span)(n,t)=letcurrent=[n::occ](*currentoccurrence*)inlet(auto,st,k)=build_autocurrenttin([(n,auto)::deter],mergeststack,hash1nkspan)inlet(deter,stack,span)=fold_leftf([],[],hash0)arcsinlet(h,l)=matchstackwith[[]->([],[])|[h::l]->(h,l)]inletkey=hashbspanhinlets=Auto.share(State(b,deter,h))keyin(s,mergelocal_stackl,key)];
-55-
Segmen ting T ransducer Data Structures
typetransition=
[Euphonyofrule(*(revu,v,w)stu|v->w*)
|Id(*identityornosandhi*)
]
andoutput=list(word*transition);
typebacktrack=
[Nextof(input*output*word*choices)|Initof(input*output)
]
andresumption=listbacktrack;(*coroutineresumptions*)
exceptionFinished;
-56-
Running the Segmen ting T ransducer
valuerecreactinputoutputbackocc=fun[State(b,det,choices)->(*wetrythedeterministicspacefirst*)letdetercont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterdetinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]inletnondets=ifchoices=[]thenbackelse[Next(input,output,occ,choices)::back]inifbthenletout=[(occ,Id)::output](*optfinalsandhi*)
-57-
inifinput=[]then(out,nondets)(*solution*)elseletalterns=[Init(input,out)::nondets](*wefirsttrythelongestmatchingword*)indeteralternselsedeternondets]andchooseinputoutputbackocc=fun[[]->backtrackback|[((u,v,w)asrule)::others]->letalterns=[Next(input,output,occ,others)::back]inifprefixwinputthenlettape=advance(lengthw)inputandout=[(u@occ,Euphony(rule))::output]inifv=[](*finalsandhi*)theniftape=[]then(out,alterns)elsebacktrackalterns
-58-
elseletnext_state=accessvinreacttapeoutalternsvnext_state
elsebacktrackalterns
]
andbacktrack=fun
[[]->raiseFinished
|[resume::back]->matchresumewith
[Next(input,output,occ,choices)->
chooseinputoutputbackoccchoices
|Init(input,output)->
reactinputoutputback[]automaton]
];
-59-
Example of Sanskrit Segmen tation
process"tacchrutvaa";
Chunk:tacchrutvaa
maybesegmentedas:
Solution1:
[tadwithsandhid|"s->cch]
["srutvaawithnosandhi]
-60-
More examples
process"o.mnama.h\"sivaaya";
Solution1:
[omwithsandhim|n->.mn]
[namaswithsandhis|"s->.h"s]
["sivaayawithnosandhi]
process"sugandhi.mpu.s.tivardhanam";
Solution1:[sugandhimwithsandhim|p->.mp]
[pu.s.tiwithnosandhi]
[vardhanamwithnosandhi]
-61-
Sanskrit T agging
process"sugandhi.mpu.s.tivardhanam";
Solution1:
[sugandhim
<{acc.sg.m.}[sugandhi]>withsandhim|p->.mp]
[pu.s.ti<{iic.}[pu.s.ti]>withnosandhi]
[vardhanam
<{acc.sg.m.|acc.sg.n.|nom.sg.n.
|voc.sg.n.}[vardhana]>withnosandhi]
-62-
Statistics
Thecompleteautomatonconstructionfromtheflexedformslexicontakesonly9sona864MHzPC.Wegetaverycompactautomaton,withonly7337states,1438ofwhichacceptingstates,fittingin746KBofmemory.Withoutthesharing,wewouldhavegeneratedabout200000statesforasizeof6MB!
Thetotalnumberofsandhirulesis2802,ofwhich2411arecontextual.While4150stateshavenochoicepoints,theremaining3187haveanon-deterministiccomponent,withafan-outreaching164intheworstsituation.Howeverinpracticetherearenevermorethan2choicesforagiveninput,andsegmentationisextremelyfast.
-63-
Soundness and Completeness of the Algorithms
Theorem.Ifthelexicalsystem(L,R)isstrictandweaklynon-overlappingsisan(L,R)-sentenceiffthealgorithm(segmentalls)returnsasolution;conversely,the(finite)setofallsuchsolutionsexhibitsalltheproofsforstobean(L,R)-sentence.
Fact.InclassicalSanskrit,externalsandhiisstronglynon-overlapping.
Cf.http://pauillac.inria.fr/~huet/PUBLIC/tagger.ps
-64-
A note on termination
Terminationisprovedbymultisetorderingonresumptions.
Thisallowstostatethealgorithmasanon-deterministicalgorithm,allowinganystrategyforpriorityoflexiconsearchversuseuphonyprediction,aswellasarbitraryselectionofresumptionswhenbacktracking.
Thisisimportant,sinceitleavesallfreedomforimplementingarbitraryprioritypolicieslearnedbycorpustraining.
-65-
Enjo y!
•Sanskritsite:http://pauillac.inria.fr/~huet/SKT/
•SandhiAnalysispaper:
http://pauillac.inria.fr/~huet/FREE/tagger.ps
•Coursenotes:
http://pauillac.inria.fr/~huet/ZEN/esslli.ps
•Courseslides:
http://pauillac.inria.fr/~huet/ZEN/Trento.ps
•Tutorialslides:
http://pauillac.inria.fr/~huet/ZEN/Hyderabad.ps
•ZENlibrary:http://pauillac.inria.fr/~huet/ZEN/zen.tar
•ObjectiveCaml:http://caml.inria.fr/ocaml/
-66-
Automata mista - AuM
-67-
Differen tial w ords
typedelta=(int*word);
Adifferentialwordisanotationpermittingtoretrieveawordwfromanotherwordw 0sharingacommonprefix.Itdenotestheminimalpathconnectingthewordsinatree,asasequenceofupsanddowns:ifd=(n,u)wegoupntimesandthendownalongwordu.
Wecomputethedifferencebetweenwandw 0asadifferentialword diffww 0=(|w1|,w2)wherew=p.w1andw 0=p.w2,withmaximalcommonprefixp.
Theconverseofdiff:word->word->deltais
patch:delta->word->word:w 0mayberetrievedfromwand d=diffww 0asw 0=patchdw.
-68-
The automaton structure
typeinput=word;
typedelta=(int*word)andaddress=[Globalofdelta|Localofdelta];
typeauto=[Stateof(bool*deter*choices)]anddeter=list(letter*auto)andchoices=list(input*address);
typeautomaton=(arrayauto*delta);
typebacktrack=(input*delta*choices)andresumption=listbacktrack;(*coroutineresumptions*)
-69-
Double compl ´etude
Toutautomatenon-d´eterministe(´eventuellementavecmoves)peutˆetrerepr´esent´eparunautomatemixteplat(dontlesstructuresd´eterministessontvides).
Toutautomated´eterministepeutˆetrerepr´esent´eparunautomatemixtedontlesseulspartiesdechoixState(b,[],[([],address)])nedonnentpaslieu`abacktrack.
Toutautomatemixteaunerepr´esentationminimale,obtenueparpartageendag.Lepartagedesadressesvirtuelleslocalesnecorrespondpasn´ecessairement`adesautomates´equivalentsparbisimulation.
-70-
The transducer structure
typeinput=wordandoutput=word;
typedelta=(int*word)
andaddress=[Globalofdelta|Localofdelta];
typetrans=[Stateof(bool*deter*choices)]
anddeter=list(letter*trans)
andchoices=list(input*output*address);
typetransducer=(arraytrans*delta);
typebacktrack=(input*output*delta*choices)
andresumption=listbacktrack;(*coroutineresumptions*)
-71-
-72-
foret
pile a1
an a1
an k 1k AuM
dag courant mot dag
-73-
M ´emorisation de l’acc `es couran t
Lapiled’acc`es[sn;sn−1;...s0]`al’´etatcourantestn´ecessaire,pourinterpr´eterlesadressesvirtuelleslocales.Ilpeutˆetreavantageuxdegarderaussilemotcourantd’acc`esmot=[an;...a1],d´epil´eetempil´eaugr´edesacc`eslocaux.Onpeutainsidissocierdeuxconstructeursd’output:AbsoluofwordetRelatifofdelta.Danscederniercas,l’outputestcalcul´eparpatch`apartirdemot.
Applications:
•Dictionnairedesmotsfl´echisutilis´ecommelemmatiseur(plurielr´egulier:(δ=(1,[ 0s 0]))
•Unglue(δ=(0,[]))
•Segment(δ=(0,u))
-74-
Conclusion
Lesautomatesmixtesoffrentunesolutionefficace`adenombreuxprobl`emesmorpho-syntaxiques.Lastructured´eterministecouvrantl’espacedes´etatsestlelexique,quisetrouveainsinaturellementplac´eaucœurdutraitementinformatiquedelalangue.
-75-