Represen tation Structures for Computational
Linguistics
G ´erard Huet
ESSLLI2002,Trento
-1-
What the course is ab out
•AcomputationalplatformforSanskrit
•TheZENcomputationalmorphologytoolkit
•PidginML
•ThefunctionalprogrammingparadigmforCL
•ConcreteprogrammingissuesinObjectiveCaml+Camlp4
•GeneralarchitectureissuesforaCLplatform
•CooperationonfreeCLresources
Twospecificapplicativetechnologies:
•Localprocessingoffocuseddata
•Sharing
-2-
What shall not b e discussed
•MLvsC++
•MLvsJava
•MLvsProlog
-3-
What shall not b e discussed at length
•ObjectiveCAMLvsSML
•MLvsHaskell
•MLvsC
•PidginMLvsObjectiveCAML
-4-
Basics: lists vs stac ks
valuel5=[1;2;3;4;5];
values5=[5;4;3;2;1];
valuerecunstackls=
matchlwith
[[]->s
|[h::t]->unstackt[h::s]
];
valuerevl=unstackl[];
valuestate3=([3;2;1],[4;5]);
-5-
T uring mac hines, Emacs, and Zipp ers
Zippers.FirstpresentationatFLoC’96.Publishedas:G.Huet.TheZipper.J.FunctionalProgramming7,5(1997),549-554.
Largescaleimplementationsinsyntaxeditorswithincomputationallinguisticsplatforms:
•G.Huet.LexicalmorphismswiththeZenplatform.
•A.Ranta.Grammaticalframeworks.
-6-
Con texts as zipp ers
typetree=[Treeofforest]andforest=listtree;
typetree_zipper=[Top|Zipof(forest*tree_zipper*forest)];
typefocused_tree=(tree_zipper*tree);
Afocusedtreeisatreewithafocuspointofinterest,i.e.atreeandastackedcontext.
-7-
Op erations on fo cused trees
valuedown(z,t)=matchtwith[Tree(forest)->matchforestwith
[[]->raise(Failure"down")
|[hd::tl]->(Zip([],z,tl),hd)
]
];
valueup(z,t)=matchzwith
[Top->raise(Failure"up")
|Zip(l,u,r)->(u,Tree(unstackl[t::r]))];
-8-
More op erations on fo cused trees
valueleft(z,t)=matchzwith[Top->raise(Failure"left")|Zip(l,u,r)->matchlwith[[]->raise(Failure"left")|[elder::rest]->(Zip(elders,u,[t::r]),rest)]];valueright(z,t)=matchzwith[Top->raise(Failure"right")|Zip(l,u,r)->matchrwith[[]->raise(Failure"right")|[young::rest]->(Zip([t::l],u,rest),young)]];
-9-
Applicativ e up dating
valuedel_l(z,_)=matchzwith
[Top->raise(Failure"del_l")
|Zip(l,u,r)->matchlwith
[[]->raise(Failure"del_l")
|[elder::elders]->(Zip(elders,u,r),elder)
]];
valuereplace(z,_)t=(z,t);
-10-
P oin ts of view ab out fo cused structures
•Manipulationoffocuseddataislocal
•Redundantrepresentation-efficiency
•TheInteractionCombinatorsParadigm
Remark.Zippersarelinearcontexts.TheyaresuperiortoΩ-terms,notablybecausetheapproximationorderingissubstructural.
TheNaturalTransformationfromtreefunctorstozipperfunctorsisDifferentiation;ZippersmayalsobeseenasthelinearFunctionsoverTrees.
-11-
Bac k to linguistics
Wewanttoprocess(parseandgenerate)naturallanguagesentences,dialogues,corpusesofvariouskinds(oral,written,news,books,websites,etc).Weassumethatthedataisalreadydigitalisedanddiscretizedasastreamofletters(phonemesfororaldata,lettersforwrittenone).
Afundamentalentityinthisprocessingistheword.Onetraditionallydistinguishesprocessingbetweenstreamsoflettersandwords(morphology,lexicalanalysis)andprocessingbetweenwordsandsentences(syntax,parsing).However,thenatureofthewordisellusive.
-12-
What T esni `ere has to sa y
ThelinguistTesni`ere,inhisEl´ementsdeSyntaxeStructurale,says:
“Poursimplequ’elleparaisse,lanotiondemotestunedecellesdontlad´efinitionestlaplusd´elicatepourlelinguiste.C’estpeut-ˆetrequetropsouventonpartdelanotiondemotpourarriver`alanotiondephrase,aulieudepartirdelanotiondephrasepourarriver`alanotiondemot.Oronnesauraitd´efinirlaphrase`apartirdumot,maisseulementlemot`apartirdelaphrase.Carlanotiondephraseestlogiquementant´erieure`acelledemot.”
-13-
On tological Problem
WhatTesni`erereallysaysisanevidence:itistheontologicalpriorityoftheCorpusovertheLexicon.ThewordsarefoundintheCorpus,thencopiedtotheLexicon;theLanguageisdefinedbyitsCorpus.
ThepreeminenceoftheCorpusovertheLexiconisundeniable.Nevertheless,thewordsarerecognizedinthecorpusrelativelytothegenerativedevicesofmorphology;theinversionofthesegenerativerelationsextendsthestrictcoveringofthecorpusbythegenerativecapabilitiesofthegrammar;andthusthereisatensionbetweentheco-inductivestructureofthelexiconasarepositoryofutterancesandtheinductivestructureofwordsasgeneratedbymorphologicaldevicesofstemsinthelexicon.
-14-
Philosophical considerations
Anekdot.TheThamadasinGeorgia.
Puzzles.The‘oui’problem.The‘oiu’problem.
Researchtopic.Definethefunctorthefixpointofwhichisconstructed.
Technology.Chaseouthapaxes.Orrather,indexproperlythediachronicaldimensionofthelangageunderconsideration.
-15-
Bac k to the Lexicon
Words.Wordsarerepresentedaslistofpositiveintegers.
typeletter=intandword=listletter;
Weprovidecoercionsencode:string->wordanddecode:word->string.Hereislexicographicordering.
valuereclexicol1l2=matchl1with[[]->True|[c1::r1]->matchl2with[[]->False|[c2::r2]->ifc2<c1thenFalseelseifc2=c1thenlexicor1r2elseTrue]];
-16-
Differen tial w ords
typedelta=(int*word);
Adifferentialwordisanotationpermittingtoretrieveawordwfromanotherwordw 0sharingacommonprefix.Itdenotestheminimalpathconnectingthewordsinatree,asasequenceofupsanddowns:ifd=(n,u)wegoupntimesandthendownalongwordu.
Wecomputethedifferencebetweenwandw 0asadifferentialword diffww 0=(|w1|,w2)wherew=p.w1andw 0=p.w2,withmaximalcommonprefixp.
Theconverseofdiff:word->word->deltais
patch:delta->word->word:w 0mayberetrievedfromwand d=diffww 0asw 0=patchdw.
-17-
T ries
Triesstoresparsesetsofwordssharinginitialprefixes.TheyareduetoRen´edelaBriantais(1959).Weuseaverysimplerepresentationwithlistsofsiblings.
typetrie=[Trieof(bool*forest)]
andforest=list(Word.letter*trie);
Triesaremanaged(search,insertion,etc)usingthezippertechnology.
-18-
Imp ortan t remarks
Triesmaybeconsideredasdeterministicfinitestateautomatagraphsforacceptingthe(finite)languagetheyrepresent.Thisremarkisthebasisformanylexiconprocessinglibraries.
Suchgraphsareacyclic(trees).Butmoregeneralfinitestateautomatagraphsmayberepresentedasannotatedtrees.Theseannotationsaccountfornon-deterministicchoicepoints,andforvirtualpointersinthegraph.
-19-
Lexicon
Hereisasimplisticlexiconcompiler
make_lex:liststring->trie:
valuemake_lex=List.fold_left(funlexc->Trie.enterlex(Word.encodec))Trie.empty;
Forinstance,withenglish.lststoringalistof173528words,asatextfileofsize2Mb,thecommandmake_lex<english.lst>english.remproducesatrierepresentationasafileof4.5Mb.
Triessharethewordsbythereprefixes,butcommonsuffixesaccountforalotofredundancyinthestructure.Weshalleliminatethisredundancybysharing.
-20-
The Share F unctor
moduleShare:functor(Algebra:sigtypedomain=’a;valuesize:int;end)->
sigvalueshare:Algebra.domain->int->Algebra.domain;end;
Thatis,SharetakesasargumentamoduleAlgebraprovidingatypedomainandanintegervaluesize,anditdefinesavalueshareofthestatedtype.WeassumethattheelementsfromthedomainarepresentedwithanintegerkeyboundedbyAlgebra.size.Thatis,
sharexkwillassumeaspreconditionthat0≤k<Maxwith
Max=Algebra.size.
Weshallconstructthesharingmapwiththehelpofahashtable,madeupofbuckets(k,[e1;e2;...en])whereeachelementeihaskeyk.
-21-
Memoizing
typebucket=listAlgebra.domain;
valuememo=Array.createAlgebra.size([]:bucket);
Weshalluseaservicefunctionsearch,suchthatsearchelreturnsthefirstyinlsuchthaty=eororelseraisestheexception
Not_found.
valuesearche=List.find(funx->x=e);
-22-
The share function
valueshareelementkey=
letbucket=memo.(key)in
trysearchelementbucketwith[Not_found->
do{memo.(key):=[element::bucket];element}
];
Sharingisjustrecalling!
-23-
Compressing trees as dags
WemayforinstanceinstantiateShareonthealgebraoftrees,withasizehashmaxdependingontheapplication:
moduleDag=Share(structtypedomain=tree;
valuesize=hash_max;end);
Andnowwecompressatrieintoaminimaldagusingsharebyasimplebottom-uptraversal,wherethekeyiscomputedalongbyhashing.Forthiswedefineageneralbottom-uptraversalfunction,whichappliesaparametriclookupfunctiontoeverynodeanditsassociatedkey.
-24-
Dynamic programming
Bottom-uptraversingwithinductivehash-codecomputation.
valuehash1keyindexsum=sum+index*key
andhashforest=forestmodhash_max;
valuetraverselookup=travel
whererectravel=fun
[Tree(forest)->
letf(tries,index,span)t=
let(t0,k)=travelt
in([t0::tries],index+1,hash1kindexspan)
inlet(forest0,_,span)=List.fold_leftf([],1,0)forest
inletkey=hashspanin(lookup(Tree(revforest0))key,key)];
-25-
Compressing a tree as a dag
Now,compressingatreeoptimallyasaminimaldagissimplyeffectedbyasharingtraversal:
valuecompress=traverseDag.share;
valueminimizetree=let(dag,_)=compresstreeindag;
-26-
Adv an tages and extensions
Hashingkeysandsizeisontheclientside:wedonotdelegatehashingtoShare,whichisjustanassociativememory.Thishastwoadvantages:
•Thecomputationisfullylinear
•Itisadaptedtothestatisticsofthedata
Extension:Auto-sharingtypes(controlledhash-consing).Suggestsamonadofsharedhashedstructuresaccommodatingentropyofthedata.
-27-
Dagified lexicons
Wemaydagifyalexiconaposterioriinonepass:
valuerecdagify()=
letlexicon=(input_valuestdin:Trie.trie)
inletdag=Mini.minimizelexiconinoutput_valuestdoutdag;
Orwemaymaintainadagifiedstructurebysharingdynamicallywheninsertingwordsbyappropriatemodificationofthezipperoperations.
Andnowifweapplythistechniquetoourenglishlexicon,withcommanddagify<english.rem>small.rem,wenowgetanoptimalrepresentationwhichonlyneeds1Mbofstorage,halfoftheoriginalASCIIstringrepresentation.
-28-
Pub
Therecursivealgorithmsgivensofararefairlystraightforward.Theyareeasytodebug,maintainandmodifyduetothestrongtypingsafeguardofML,andeveneasytoformallycertify.Theyarenonethelessefficientenoughforproductionuse,thankstotheoptimizingnative-codecompilerofObjectiveCaml.
InourSanskritapplication,thetrieof11500entriesisshrunkfrom219Kbto103Kbin0.1s,whereasthetrieof120000flexedformsisshrunkfrom1.63Mbto140Kbin0.5sona864MHzPC.Ourtrieof173528Englishwordsisshrunkfrom4.5Mbto1Mbin2.7s.Measurementsshowedthatthetimecomplexityislinearwiththesizeofthelexicon(withincomparablesetsofwords).
-29-
V ariations
Manyvariationsontriesexist.OptimisationsoflexicalanalysersforprogramminglanguagesaredescribedintheDragonbook.Butthedragonbookofcomputationallinguisticshasnotbeenwrittenyet.
Variationwithternarytrees.TernarytreesareinspiredfromBentleyandSedgewick.Ternarytreesaremorecomplexthantries,butuseslightlylessstorage.Accessispotentiallyfasterinbalancedtreesthantries.Agoodmethodologyseemstousetriesforedition,andtotranslatethemtobalancedternarytreesforproductionusewithafixedlexicon.
Theternaryversionofourenglishlexicontakes3.6Mb,asavingsof20%overitstrieversionusing4.5Mb.Afterdagminimization,ittakes1Mb,asavingsof10%overthetriedagversionusing1.1Mb.Foroursanskritlexiconindex,thetrietakes221Kbandthetertree180Kb.Sharedasdagsthetrietakes103Kbandthetertree96Kb.
-30-
Decos, Lexmaps, Autos
WeunderstandtheTriestructureofasetofWordsasaspecialcaseofafinitelybasedmappingDeco=Word→AnnotationinthecaseofBooleanannotationssharedbyprefixarguments(andbycommonsubexpressionswhenshared).
Westoremorphologyconstructionsasbeingofthistype,andweinvestigatethereversemappingbygeneralisingthemtorelations,typicallyinductivelydefinedthroughfinitestatemachines.
Themoresharingwegetthebetterweoptimisethisdatalayout.Itisthusofparamountimportancethattheannotationsbelocalquasi-morphismsdecorations.
-31-
Decos
typedeco’a=[Decoof(list’a*dforest’a)]
anddforest’a=list(Word.letter*deco’a);
Wethinkofthedecorationoftypelist’aasaninformationassociatedwiththewordstoredatthatnode.
Wecaneasilygeneralizesharingtodecoratedtries.However,substantialsavingswillresultonlyiftheinformationatagivennodeisafunctionofthesubtrieatthatnode,i.e.ifsuchinformationisdefinedasatriemorphism.
Definition.Adecoisatreemorphismiftheinformationateverynodeisafunctionofthecorrespondingsub-tree.Suchdecospreservethesharingofthetreestheydecorate.
-32-
Enco ding morphological parameters as decorations
Wethusprofitoftheregularityofmorphologicaltransformationstohaveterserepresentationsofthelexicondecoratedbygrammaticalinformation.Thusifallpluralsareobtainedbyadding‘s’tothesingularstemexceptforafewexceptions,wedonotpayanycostinencodingthispluralinformationasanexplicitinstruction
[pl:suffixs]decoratingthestems,sinceitwillnotcreateanynewnodeexceptforthefewexceptions.Asopposedtolistingexplicitlythepluralform,whichwouldundoallsharing.
Inoursanskritimplementation,thevariousgendersassociatedwithanounstemaredefinedinadecousedforproducingtheflexedforms.Theflexedformsarethengeneratedusinganad-hocinternalsandhialgorithm,difficulttoencodeasafinite-stateprocess,andthusdifficulttoinverse.
-33-
( Aside ) The scoping structure of the lexicon
Howtofindthestemassociatedwithagenderinthelexiconinoneclicksothatmorphologymaybedisplayed-withnoneedofscriptorapplet.
Simpledistributedarchitecture-allthecomputationisdoneontheserverside.
Maintainingcomputationalinvariantsinthelexiconaugmentsitsrobustness.
-34-
Explicit morphology vs implicit morphology
ByexplicitmorphologyImeanlistingexplicitlytheformsgeneratedbymorphologyoperationsfromrootstems,prefixesandsuffixes.
ByimplicitmorphologyImeanjusthavingprogramswhichwillgeneratetheseflexedformsondemand.
Implicitmorphologyisnotenoughtorecognizethesegmentsofsentencesidenticalwithaflexedform:themorphologicalfunctionsmustbeinvertible.
-35-
Compromise
Ontheotherhand,thedelimitationbetweenimplicitandexplicitisblurredsincee.g.afinite-statemachinestategraphmaybebothconsideredaprogramandapieceofdata;forinstance,atriestoreswords,butactuallythewordsare“recognizedasbeinginthelexicon”by“runningthelexiconoverthemasinputdata”.
Thusweshallrepresent“explicitly”flexedformsandtheinformationonhowtheyarederivedfromrootstemsasatriebearingasdecorationsinstructionsonhowto“undomorphology”locally.Forthispurpose,weshallusethenotionofdifferentialwordabove.Wemaynowstoreinversemapsoflexicalrelations(suchasmorphologyderivations)usingtheLexmapstructure.
Thiswaywebypassthe(hard)problemofinternalsandhifsmaxiomatisation.
-36-
Lexmaps
typeinverse’a=(Word.delta*’a)
andinverse_map’a=list(inverse’a);
typelexmap’a=[Mapof(inverse_map’a*mforest’a)]andmforest’a=list(Word.letter*lexmap’a);
Typically,ifwordwisstoredatanodeMap([...;(d,r);...],...),thisrepresentsthefactthatwistheimagebyrelationrof
w 0=patchdw.Suchalexmapisthusarepresentationoftheimagebyrofasourcelexicon.Thisrepresentationisinvertible,whilepreservingmaximallythesharingofprefixes,andthusbeingamenabletosharing.
Example:catsanddogssharingtheir‘s’nodewhileimplicitlyreferringtotheirrespectivesingularstem.
-37-
Lexicon rep ositories using tries and decos
Inatypicalcomputationallinguisticsapplication,grammaticalinformation(partofspeechrole,gender/numberforsubstantives,valencyandothersubcategorizationinformationforverbs,etc)maybestoredasdecorationofthelexiconofroots/stems.Fromsuchadecoratedtrieamorphologicalprocessormaycomputethelexmapofallflexedforms,decoratedwiththeirderivationinformationencodedasaninversemap.Thisstructuremayitselfbeusedbyataggingprocessortoconstructthelinearrepresentationofasentencedecoratedbyfeaturestructures.Sucharepresentationwillsupportfurtherprocessing,suchascomputingsyntacticandfunctionalstructures,typicallyassolutionsofconstraintsatisfactionproblems.
-38-
Example: Sanskrit
Themaincomponentinourtoolsisastructuredlexicaldatabase.Fromthisdatabase,varioushypertextdocumentsmaybeproducedmechanically.TheindexCGIenginesearchesforwordsbynavigatinginapersistenttrieindexofstementries.Thecurrentdatabasecomprises12000items,anditsindexhasasizeof103KB.
Whencomputingthisindex,anotherpersistentstructureiscreated.Itrecordsinadecoallthegendersassociatedwithanounentry.Atpresent,thisdecorecordsgendersfor5700nouns,andithasasizeof268KB.
Weiterateonthisgendersstructureagrammaticalengine,whichgeneratesdeclinedforms.Thislexmaprecordsabout120000suchflexedformswithassociatedgrammaticalinformation,andithasasizeof341KB.Acompaniontrie,withouttheinformation,keepstheindexofflexedwordsasaminimizedstructureof140KB.
-39-
Finite State Lore
Computationalphonologyaremorphologyuseextensivelyfinitestatetechnology:rationallanguagesandrelations,transducers,bimachines,etc.
•Sch¨utzenberger
•Koskenniemi
•KaplanandKay
Finitestatetoolsetshavebeendeveloped,wherewordtransformationsaresystematicallycompiledinalow-levelalgebraoffinite-statemachinesoperators.SuchtoolsetshavebeendevelopedatXerox,ParisVII,BellLabs,MitsubishiLabs,etc.Compilingcomplexrewriterulesinrationaltransducersmaybesubtle.Wedepartfromthisfine-grainedmethodologyandproposemoredirecttranslationspreservingthestructureofthelexicon.
-40-
Finite State Mac hines as Lexicon Morphisms
Westartwiththeremarkthatalexiconrepresentedasatrieisdirectlythestatespacerepresentationofthe(deterministic)finitestatemachinethatrecognizesitswords,andthatitsminimizationconsistsexactlyinsharingthelexicaltreeasadag.Weareinacasewherethestategraphofsuchfinitelanguagesrecognizersisanacyclicstructure.Suchapuredatastructuremaybeeasilybuiltwithoutmutablereferences,whichhascomputationalandrobustnessadvantages.
Inthesamespirit,wedefineautomatawhichimplementnon-trivialrationalrelations(andtheirinversion)andwhosestatestructureisnonethelessamoreorlessdirectdecorationofthelexicontrie.Thecrucialnotionisthatthestatestructureisalexiconmorphism.
-41-
Unglueing
Westartwithatoyproblemwhichisthesimplestcaseofjunctureanalysis,namelywhentherearenonon-trivialjuncturerules,andsegmentationconsistsjustinretrievingthewordsofasentencegluedtogetherinonelongstringofcharacters(orphonemes).ConsiderforinstancewrittenEnglish.Youhaveatextfileconsistingofasequenceofwordsseparatedwithblanks,andyouhavealexiconcompleteforthistext(forinstance,‘spell’hasbeensuccessfullyapplied).Now,supposeyoumakesomeeditingmistake,whichremovesallspaces,andthetaskistoundothisoperationtorestoretheoriginal.
Thetransducerisdefinedasafunctor,takingthelexicontriestructureasparameter.
-42-
Unglue
moduleUnglue(Lexicon:sigvaluelexicon:Trie.trie;end)=struct
typeinput=Word.word(*inputsentenceasaword*)
andoutput=listWord.word;(*outputissequenceofwords*)
typebacktrack=(input*output)
andresumption=listbacktrack;(*coroutineresumptions*)
exceptionFinished;
Wedefineourunglueingreactiveengineasarecursiveprocesswhichnavigatesdirectlyonthe(flexed)lexicontrie(typicallythecompressedtrieresultingfromtheDagmoduleconsideredabove).
-43-
The reactiv e engine
Thereactiveenginetakesasargumentsthe(remaining)input,the(partiallyconstructed)listofwordsreturnedasoutput,abacktrackstackwhoseitemsare(input,output)pairs,thepathoccinthestategraphstacking(thereverseof)thecurrentcommonprefixofthecandidatewords,andfinallythecurrenttrienodeasitscurrentstate.Whenthestateisaccepting,wepushitonthebacktrackstack,becausewewanttofavorpossiblelongerwords,andsowecontinuereadingtheinputuntileitherweexhausttheinput,orthenextinputcharacterisinconsistentwiththelexicondata.
-44-
The reactiv e engine co de
valuerecreactinputoutputbackocc=fun[Trie(b,forest)->ifbthenletpushout=[occ::output]inifinput=[]then(pushout,back)(*solutionfound*)elseletpushback=[(input,pushout)::back]incontinuepushbackelsecontinuebackwherecontinuecont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterforestinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]]
-45-
Bac ktrac k
andbacktrack=fun
[[]->raiseFinished|[(input,output)::back]->
reactinputoutputback[]Lexicon.lexicon
];
Now,unglueingasentenceisjustcallingthereactiveenginefromtheappropriateinitialbacktracksituation.
valueungluesentence=backtrack[(sentence,[])];
-46-
Remark
Non-deterministicprogrammingisnobigdeal.WhyshouldyousurrendercontroltoaPROLOGblackbox?
Thethreegoldenrulesofnon-deterministicprogramming:
•Identifywellyoursearchstatespace
•Representstatesasnon-mutabledata
•Provetermination
Thelastpointisessentialforunderstandingthegranularityandenforcingcompleteness.
-47-
More on state space considerations
Thisnon-deterministicprocess(recognizingL ∗)usesthesamestatespaceasthelexicon/trie(recognizingL).
ThiscorrespondstothefactthatanautomatonforL ∗maybeobtainedfromtheautomatonforLbyinserting-movesfromacceptingnodestotheinitialnode.Butsuchtransitionsmaybekeptcompletelyimplicit.Allyouhavetodoistomanagethenecessarynon-determinism(continuinginLwhichisnotingeneralaprefixlanguage(i.e.ifmayhappenthatbothwandw·sareinL)versusiterating)inthebacktrackstack,butyoudonothavetomodifyatallthestatespacedatastructure.Itisjustashiftinpointofviewconcerningthisdata.
-48-
Still more on state space considerations
RememberthatdagifiedtriesdefinetheminimalautomatonofafinitelanguageL.
Butitisnotthecasethatthisautomaton,completedwithtransitions,isminimalforL∗.ConsiderforinstanceL={a,aa}.
However,notethatweareusingitasatransducercomputingjustificationsforawordinL∗tobeaconcatenationofprecisewordsofL,andtheminimalautomatondoesnotkeepenoughinformationforthat:distinctsegmentationsofasentencemustbeseparated.
-49-
Childtalk
moduleChildtalk=struct
valuelexicon=Lexicon.make_lex["boudin";"caca";"pipi"];
end;
moduleChildish=Unglue(Childtalk);
let(sol,_)=Childish.unglue(Word.encode"pipicacaboudin")
inChildish.print_outsol;
Werecoverasexpected:pipicacaboudin.
-50-
Generating sev eral solutions
Weresumearesumptionwithresume:(resumption->int->resumption).
valueresumecontn=let(output,resumption)=backtrackcontindo{print_string"\nSolution";print_intn;print_string":\n";print_outoutput;resumption};
valueunglue_allsentence=restore[(sentence,[])]1whererecrestorecontn=tryletresumption=resumecontninrestoreresumption(n+1)with[Finished->ifn=1thenprint_string"Nosolutionfound\n"else()];
-51-
Solving a charade
moduleShort=struct
valuelexicon=Lexicon.make_lex
["able";"am";"amiable";"get";"her";"i";"to";"together"];
end;
moduleCharade=Unglue(Short);
Charade.unglue_all(Word.encode"amiabletogether");
Solution1:amiabletogether
Solution2:amiabletogether
Solution3:amiabletogether
Solution4:amiabletogether
-52-
Juncture euphon y and its discretization
Whensuccessivewordsareuttered,theminimizationoftheenergynecessarytoreconfiguratethevocalorgansatthejunctureofthewordsprovoquesaeuphonytransformation,discretizedatthelevelofphonemesbyacontextualrewriteruleoftheform:
[x]u|v→w
Thisjunctureeuphony,orexternalsandhi,isactuallyrecordedinsanskritinthewrittenrenderingofthesentence.Thefirstlinguisticprocessingisthereforesegmentation,whichgeneralisesunglueingintosandhianalysis.
-53-
uv w x
-54-
z uv w
u v
x
-55-
Auto
typelexicon=trie
andrule=(word*word*word);
Theruletriple(revu,v,w)representsthestringrewriteu|v→w.Nowforthetransducerstatespace:
typeauto=[Stateof(bool*deter*choices)]
anddeter=list(letter*auto)
andchoices=listrule;
moduleAuto=Share(structtypedomain=auto;
valuesize=hash_max;end);
-56-
Compiling the lexicon to a minimal transducer
(*build_auto:word->lexicon->(auto*stack*int)*)valuerecbuild_autoocc=fun[Trie(b,arcs)->letlocal_stack=ifbthenget_sandhioccelse[]inletf(deter,stack,span)(n,t)=letcurrent=[n::occ](*currentoccurrence*)inlet(auto,st,k)=build_autocurrenttin([(n,auto)::deter],mergeststack,hash1nkspan)inlet(deter,stack,span)=fold_leftf([],[],hash0)arcsinlet(h,l)=matchstackwith[[]->([],[])|[h::l]->(h,l)]inletkey=hashbspanhinlets=Auto.share(State(b,deter,h))keyin(s,mergelocal_stackl,key)];
-57-
Segmen ting T ransducer Data Structures
typetransition=
[Euphonyofrule(*(revu,v,w)stu|v->w*)
|Id(*identityornosandhi*)
]
andoutput=list(word*transition);
typebacktrack=
[Nextof(input*output*word*choices)|Initof(input*output)
]
andresumption=listbacktrack;(*coroutineresumptions*)
exceptionFinished;
-58-
Running the Segmen ting T ransducer
valuerecreactinputoutputbackocc=fun[State(b,det,choices)->(*wetrythedeterministicspacefirst*)letdetercont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterdetinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]inletnondets=ifchoices=[]thenbackelse[Next(input,output,occ,choices)::back]inifbthenletout=[(occ,Id)::output](*optfinalsandhi*)
-59-
inifinput=[]then(out,nondets)(*solution*)elseletalterns=[Init(input,out)::nondets](*wefirsttrythelongestmatchingword*)indeteralternselsedeternondets]andchooseinputoutputbackocc=fun[[]->backtrackback|[((u,v,w)asrule)::others]->letalterns=[Next(input,output,occ,others)::back]inifprefixwinputthenlettape=advance(lengthw)inputandout=[(u@occ,Euphony(rule))::output]inifv=[](*finalsandhi*)theniftape=[]then(out,alterns)elsebacktrackalterns
-60-
elseletnext_state=accessvinreacttapeoutalternsvnext_state
elsebacktrackalterns
]
andbacktrack=fun
[[]->raiseFinished
|[resume::back]->matchresumewith
[Next(input,output,occ,choices)->
chooseinputoutputbackoccchoices
|Init(input,output)->
reactinputoutputback[]automaton]
];
-61-
Example of Sanskrit Segmen tation
process"tacchrutvaa";
Chunk:tacchrutvaa
maybesegmentedas:
Solution1:
[tadwithsandhid|"s->cch]
["srutvaawithnosandhi]
-62-
More examples
process"o.mnama.h\"sivaaya";
Solution1:
[omwithsandhim|n->.mn]
[namaswithsandhis|"s->.h"s]
["sivaayawithnosandhi]
process"sugandhi.mpu.s.tivardhanam";
Solution1:[sugandhimwithsandhim|p->.mp]
[pu.s.tiwithnosandhi]
[vardhanamwithnosandhi]
-63-
Sanskrit T agging
process"sugandhi.mpu.s.tivardhanam";
Solution1:
[sugandhim
<{acc.sg.m.}[sugandhi]>withsandhim|p->.mp]
[pu.s.ti<{iic.}[pu.s.ti]>withnosandhi]
[vardhanam
<{acc.sg.m.|acc.sg.n.|nom.sg.n.
|voc.sg.n.}[vardhana]>withnosandhi]
-64-
Statistics
Thecompleteautomatonconstructionfromtheflexedformslexicontakesonly9sona864MHzPC.Wegetaverycompactautomaton,withonly7337states,1438ofwhichacceptingstates,fittingin746KBofmemory.Withoutthesharing,wewouldhavegeneratedabout200000statesforasizeof6MB!
Thetotalnumberofsandhirulesis2802,ofwhich2411arecontextual.While4150stateshavenochoicepoints,theremaining3187haveanon-deterministiccomponent,withafan-outreaching164intheworstsituation.Howeverinpracticetherearenevermorethan2choicesforagiveninput,andsegmentationisextremelyfast.
-65-
Ov ergeneration Problems
Veryshortparticleshavetobetreateddifferently,orotherwisetherewouldbeintolerableovergeneration.Probablyprosodywillhavetocometotherescue.Thecaseofvedic“u”.
Compounds.Thebahuvr¯ıhiproblem.
Intrinsicovergeneration.a+a=a+¯a=¯a+a=¯a+¯a=¯aMosts.m.endwitha,manys.f.endwith¯a,thepreverb¯a(towards)isfrequent,theprefixaiscommon(negation).Sothereisoftenroomforinterpretation!
E.g.naasatovidyatebh¯avonaabh¯avovidyatesatah.vsnaasatovidyateabh¯avonaabh¯avovidyatesatah.
Doubleentendrepoetry.
-66-
Soundness and Completeness of the Algorithms
Theorem.Ifthelexicalsystem(L,R)isstrictandweaklynon-overlappingsisan(L,R)-sentenceiffthealgorithm(segmentalls)returnsasolution;conversely,the(finite)setofallsuchsolutionsexhibitsalltheproofsforstobean(L,R)-sentence.
Fact.InclassicalSanskrit,externalsandhiisstronglynon-overlapping.
Cf.http://pauillac.inria.fr/~huet/FREE/tagger.ps
-67-
Where is the information?
Mel’cuksays“Everythingisinthelexicon”.
Thekeyconceptislexicondirected.Somostoftheinformationisindeedinthelexicon.Butalotofphonologicalinformation(sandhirules)andgrammaticalknowledgeisinthecode.
Iftimepermits.Atourofthedictionarystructures.
-68-
Enjo y!
•Sanskritsite:http://pauillac.inria.fr/~huet/SKT/
•SandhiAnalysispaper:
http://pauillac.inria.fr/~huet/FREE/tagger.ps
•Coursenotes:
http://pauillac.inria.fr/~huet/ZEN/esslli.ps
•Courseslides:
http://pauillac.inria.fr/~huet/ZEN/Trento.ps
•ZENlibrary:http://pauillac.inria.fr/~huet/ZEN/zen.tar
•ObjectiveCaml:http://caml.inria.fr/ocaml/
-69-
What next (on the Sanskrit fron t)
•Sanskrit1Verbmorphology,Corpustesting,Lexiconacquisitionmode,Segmentationtraining,Philologyassistant(Scharf,Smith)
•Sanskrit2Sentinels,Prosody,Valencychecking,Dependencysynthesis
•Sanskrit3Discourseanalysis:Reference,Scope,Theme,Focus,Anaphoraresolution,Extra-linguisticinformation
•Sanskrit∞Distributeddevelopmentofmultilingualtools,SavingthePunedictionaryproject
-70-
What next (on the Zen fron t)
•ZenmaintenanceDistribution,Hotline,Users’club,Coordinationofextensions
•ZenimmediateextensionsGraftingofregularrelations,Rulescompiler
•Towardsamorecomprehensivegenericplatformforcomputationallinguistics,accommodatingthelevelsofSyntax,Semantics,andDiscourseInformationDynamics
-71-