Lexicon-directed Segmen tation and T agging of
Sanskrit
G ´erard Huet
XI Ith W orld Sanskrit Conference
Helsinki, July 2003
-1-
Abstract
WeproposeanalgorithmforsegmentingacontinuousSanskrittextbyreverseanalysisofsandhi.Itconsistsinconstructingafinite-statetransducerwhosestategraphisobtainedfromthelexicontrieofflexedformsofwordsbydecorationwithchoicepointslabeledwithjunctionrewriterulesoftheform[x]u|v→w.Sucharulemeansthatinthe(left)contextx,asuffixuofawordmergeswithaprefixvofthesucceedingwordtoformthephonemestreamw.Theserulesarecompiledfromexternalsandhitables.
Itisshownthatthemethodissoundandcomplete,inthatitproducesallcorrectsandhianalysesasafinitesetofsegmentationsolutions.Sincethemethodislexicondirected,andthemorphologicalstructureisinvertible,thisgivesautomaticallyforeachsegmentationasequenceofrootwordstaggedwiththeirgrammaticalfeatures.Suchtaggingsarethusafirstapproximationoftheshallow
-2-
syntaxofthesentence.Itisexpectedthatafurtheranalysisofthesubcategorizationpatternsoffiniteverbalforms,aswellasconcordconstraints,willtrimthissetofcandidateparsestoamanageablysmallforestofacceptableinterpretations.Furthertrainingwithmanuallytaggedcorpusesisexpectedtoyieldausefultoolforassistingscholarsinestablishingcriticaleditions,tocomputeconcordanceindexes,andtocompilestatisticalprofiles.Arobustmodewillfacilitatelexiconacquisitionfromthecorpusinordertobootstrapfromaninitialsmalllexicon(12000stemsyielding200000flexedforms)toamorecompletelexicographiccoverage.
Thetalkwilldescribehowthemethoddealswithcompoundsandhowpreverbsareprecompiledintheflexedformsinordertoavoidovergeneration,whilepreservingtheleft-to-rightapplicationofexternalsandhi.
-3-
Solving an English charade
moduleShort=struct
valuelexicon=Lexicon.make_lex
["able";"am";"amiable";"get";"her";"i";"to";"together"];
end;
moduleCharade=Unglue(Short);
Charade.unglue_all(Word.encode"amiabletogether");
Solution1:amiabletogether
Solution2:amiabletogether
Solution3:amiabletogether
Solution4:amiabletogether
-4-
Juncture euphon y and its discretization
Whensuccessivewordsareuttered,theminimizationoftheenergynecessarytoreconfiguratethevocalorgansatthejunctureofthewordsprovoquesaeuphonytransformation,discretizedatthelevelofphonemesbyacontextualrewriteruleoftheform:
[x]u|v→w
Thisjunctureeuphony,orexternalsandhi,isactuallyrecordedinsanskritinthewrittenrenderingofthesentence.Thefirstlinguisticprocessingisthereforesegmentation,whichgeneralisesunglueingintosandhianalysis.
-5-
uv w x
-6-
z uv w
u v
x
-7-
Auto
typelexicon=trie
andrule=(word*word*word);
Theruletriple(revu,v,w)representsthestringrewriteu|v→w.Nowforthetransducerstatespace:
typeauto=[Stateof(bool*deter*choices)]
anddeter=list(letter*auto)
andchoices=listrule;
moduleAuto=Share(structtypedomain=auto;
valuesize=hash_max;end);
-8-
Compiling the lexicon to a minimal transducer
(*build_auto:word->lexicon->(auto*stack*int)*)valuerecbuild_autoocc=fun[Trie(b,arcs)->letlocal_stack=ifbthenget_sandhioccelse[]inletf(deter,stack,span)(n,t)=letcurrent=[n::occ](*currentoccurrence*)inlet(auto,st,k)=build_autocurrenttin([(n,auto)::deter],mergeststack,hash1nkspan)inlet(deter,stack,span)=fold_leftf([],[],hash0)arcsinlet(h,l)=matchstackwith[[]->([],[])|[h::l]->(h,l)]inletkey=hashbspanhinlets=Auto.share(State(b,deter,h))keyin(s,mergelocal_stackl,key)];
-9-
Running the Segmen ting T ransducer
valuerecreactinputoutputbackocc=fun[State(b,det,choices)->(*wetrythedeterministicspacefirst*)letdetercont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterdetinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]inletnondets=ifchoices=[]thenbackelse[Next(input,output,occ,choices)::back]inifbthenletout=[(occ,Id)::output](*optfinalsandhi*)
-10-
inifinput=[]then(out,nondets)(*solution*)elseletalterns=[Init(input,out)::nondets](*wefirsttrythelongestmatchingword*)indeteralternselsedeternondets]andchooseinputoutputbackocc=fun[[]->backtrackback|[((u,v,w)asrule)::others]->letalterns=[Next(input,output,occ,others)::back]inifprefixwinputthenlettape=advance(lengthw)inputandout=[(u@occ,Euphony(rule))::output]inifv=[](*finalsandhi*)theniftape=[]then(out,alterns)elsebacktrackalterns
-11-
elseletnext_state=accessvinreacttapeoutalternsvnext_state
elsebacktrackalterns
]
andbacktrack=fun
[[]->raiseFinished
|[resume::back]->matchresumewith
[Next(input,output,occ,choices)->
chooseinputoutputbackoccchoices
|Init(input,output)->
reactinputoutputback[]automaton]
];
-12-
Example of Sanskrit Segmen tation
process"tacchrutvaa";
Chunk:tacchrutvaa
maybesegmentedas:
Solution1:
[tadwithsandhid|"s->cch]
["srutvaawithnosandhi]
-13-
More examples
process"o.mnama.h\"sivaaya";
Solution1:
[omwithsandhim|n->.mn]
[namaswithsandhis|"s->.h"s]
["sivaayawithnosandhi]
process"sugandhi.mpu.s.tivardhanam";
Solution1:[sugandhimwithsandhim|p->.mp]
[pu.s.tiwithnosandhi]
[vardhanamwithnosandhi]
-14-
Sanskrit T agging
process"sugandhi.mpu.s.tivardhanam";
Solution1:
[sugandhim
<{acc.sg.m.}[sugandhi]>withsandhim|p->.mp]
[pu.s.ti<{iic.}[pu.s.ti]>withnosandhi]
[vardhanam
<{acc.sg.m.|acc.sg.n.|nom.sg.n.
|voc.sg.n.}[vardhana]>withnosandhi]
-15-
The general case
process"me.saanajaa\"m\"sca";
Solution1:[me.saan<{acc.pl.m.}[me.sa]>withnosandhi][ajaan<{acc.pl.m.}[aja#1]|{acc.pl.m.}[aja#2]>withsandhin|c->"m"sc][ca<{und.}[ca]>withnosandhi]
Solution2:[maa<{und.}[maa#2]|{acc.sg.*}[aham]>
-16-
withsandhiaa|i->e]
[i.saan
<{acc.pl.m.}[i.sa]>withnosandhi]
[ajaan
<{acc.pl.m.}[aja#1]|{acc.pl.m.}[aja#2]>
withsandhin|c->"m"sc]
[ca
<{und.}[ca]>withnosandhi]
-17-
Statistics
Thecompleteautomatonconstructionfromtheflexedformslexicontakesonly9sona864MHzPC.Wegetaverycompactautomaton,withonly7337states,1438ofwhichacceptingstates,fittingin746KBofmemory.Withoutthesharing,wewouldhavegeneratedabout200000statesforasizeof6MB!
Thetotalnumberofsandhirulesis2802,ofwhich2411arecontextual.While4150stateshavenochoicepoints,theremaining3187haveanon-deterministiccomponent,withafan-outreaching164intheworstsituation.Howeverinpracticetherearenevermorethan2choicesforagiveninput,andsegmentationisextremelyfast.
-18-
Soundness and Completeness of the Algorithms
Theorem.Ifthelexicalsystem(L,R)isstrictandweaklynon-overlappingsisan(L,R)-sentenceiffthealgorithm(segmentalls)returnsasolution;conversely,the(finite)setofallsuchsolutionsexhibitsalltheproofsforstobean(L,R)-sentence.
Fact.InclassicalSanskrit,externalsandhiisstronglynon-overlappinginnounphrases.
Cf.http://pauillac.inria.fr/~huet/PUBLIC/tagger.pdf
-19-
Difficulties (noun phrases)
•Overgenerationwithshortparticles¯at,¯am,upa
•Removalofmeta-notations(li˙n-ga)
•clashof¯ayawithgenitives
•Overgenerationwith-ga,-da,-pa,-ya,etc
•Bahuvr¯ıhicompounds
•sa,duals
-20-
Ov ergeneration is una voidable
BG24[2]17
Chunk:naasatovidyatebhaava.hmaybesegmentedas:
SolutionShankara:
[na][asatas][vidyate][bhaavas]
SolutionMadhva:
[na][asatas][vidyate][abhaavas]
[MadhavDeshpande]Eachcommentatorhashisownlogictodefendtheirownpeculiarwaysegmentingtheline,anditisclearthatmanuscriptsalonedonothelp.
-21-
Difficulties (v erb phrases)
Howshouldpreverbprefixingbemodeled?
Thenaturalideawouldbetoaffixpreverbstoconjugatedverbforms,startingatroots,andtostorethecorrespondingflexedformsalongwiththedeclinednouns.ButthisisnottherightmodelforSanskritverbalmorphology,becausepreverbsassociatetorootformswithexternalandnotinternalsandhi.Andputtingpreverbsinparallelwithrootformsandnounformswillnotworkeither,becausethenon-overlappingconditionmentionedabovefailsforpreverb¯a.Andthisoverlappingactuallymakesexternalsandhinonassociative.Forinstance,notingsandhiwiththeverticalbar,weget:(iha|¯a)|ihi=ih¯a|ihi=ihehi(comehere).Whereas:iha|(¯a|ihi)=iha|ehi=*ihaihi,incorrect.Thisdefinitelydoomstheideaofstoringconjugatedformssuchasehi.
-22-
Phan tom phonemes
Thesolutiontothisproblemistoprepare¯a-prefixedrootformsinthecasewheretherootformsstartswithior¯ıoruor¯u-thecaseswhereanon-associativebehaviourofexternalsandhiobtains.Butinsteadofapplyingthestandardsandhirule¯a|i=e(andsimilarlyfor¯ı)weuse¯a|i=*ewhere*eisaphantomphonemewhichobeysspecialsandhirulessuchas:a|*e=eand¯a|*e=e.Throughtheuseofthisphantomphoneme,overlappingsandhiswith¯aaredealtwithcorrectly.Similarlyweintroduceanotherphantomphoneme*o,obeyinge.g.¯a|u=*o(andsimilarlyfor¯u)anda|*o=¯a|*o=o.
-23-
Prev erb sequences
Weproposetomodeltherecognitionofverbalphrasesbuiltfromasequenceofnounphrases,asequenceofpreverbs,andaconjugatedrootformbyacascadeofsegmentingautomata,withanautomatonfornouns(theonedemonstratedabove),anautomatonforsequencesofpreverbs,andanautomatonforconjugatedrootformsaugmentedwithphonyforms(i.e.¯aprefixesusingphantomphonemesandhi).Thesandhipredictionstructurewhichcontrolstheautomatonisdecomposedintothreephases,Nouns,PreverbsandRoots.WhenweareinphaseNouns,weproceedeithertomoreNouns,ortoPreverbs,ortoRoots,exceptifthepredictedprefixisphony,inwhichcaseweproceedtophaseRoot.WhenweareinphasePreverbs,weproceedtoVerbs,exceptifthepredictedprefixisphony,inwhichcasewebacktrack(sincepreverb¯aisaccountedforinPreverbs).Finally,ifweareinphaseRootswebacktrack.
-24-
Dispatc h
ThisprocedureisveryexplicitlystatedintheMLfunctiondispatchwhichistheheartofthesegmentingtransducercontrolloop:
valuedispatchphaseinputoutputbackv=
matchphasewith
[Nouns->ifphantomvthen
[Advance(Roots,input,output,v)::back]
else[Advance(Nouns,input,output,v)::
[Advance(Preverbs,input,output,v)::[Advance(Roots,input,output,v)::back]]]
|Preverbs->ifphantomvthenback
else[Advance(Roots,input,output,v)::back]
|Roots->back
];
-25-
Prev erbs
ItremainstoexplainwhatformstoenterinthePreverbsautomaton.Wecouldofcoursejustenterindividualdistinctpreverbs,andallowloopinginthePreverbsphase.Butthiswouldbegrosslyover-generating.Attheotherextreme,wecouldrecordinthelexiconthepreverbsequencesusedwithagivenroot.Buttheninsteadofonerootsformsautomaton,wewouldhavetousemanydifferentautomata(atleastoneforeveryequivalenceclassoftherelation“admitsthesamepreverbsequences”).Weproposeamiddleway,wherewehaveonepreverbsautomatonstoringallthepreverbsequencesusedforatleastoneroot.Namely:ati,adhi,adhyava,anu,anupar¯a,anupra,anuvi,antah
parisam,paryupa,pi,pra,prati,pratini,prativi,pratisam,praty¯a, upani,upasam,up¯a,up¯adhi,ni,nis,nirava,par¯a,pari,parini, abhivi,abhisam,abhy¯a,abhyud,abhyupa,ava,¯a,ud,ud¯a,upa, . ,apa,ap¯a,api,abhi,abhini,abhipra,
-26-
pratyud,prani,pravi,pravy¯a,pr¯a,vi,vini,vinih
samupa. sa.mpravi,sa.mvi,sam,samava,sam¯a,samud,samud¯a,samudvi, vipra,vyati,vyapa,vyava,vy¯a,vyud,sa,sa.mni,sa.mpra,sa.mprati, . ,vipar¯a,vipari,
Weremarkthatpreverb¯aonlyoccurslastinasequenceofpreverbs,i.e.itcanoccuronlynexttotheroot.ThisjustifiesnothavingtoaugmentthePreverbssequenceswithphantomphonemes.
-27-
Demonstration: “come here”
Chunk:ihehimaybesegmentedas:
Solution1:[iha<{und.}[iha]>withsandhia|aa|i->e][aa|ihi<{imp.sg.2}[aa-i#1]>withnosandhi]
Solution2:[iha<{und.}[iha]>withsandhia|i->e][ihi<{imp.sg.2}[i#1]>withnosandhi]
-28-
Remarks
Thisexceptionaltreatmentofthe¯apreverbcorrespondstoaspecialcaseinP¯an
. ini
aswell,whichindicatesthatourapproachislegitimate.
Weremarkthatthe¯apreverbalwaysoccurslastinthepreverbssequence,anobservationwhichtoourknowledgeisnotmadebyP¯an
. ini.
Hint.Regardthe*inphantomphonemes*eand*oassaying“jumpingover¯a”.Weprintthem¯a|iand¯a|urespectively.
Phantomphonemesrestoreassociativityofexternalsandhi.
-29-
State of the art of sanskrit tagging
Chunk:maarjaarodugdha.mpibati
maybesegmentedas:
Solution1:
[maarjaaras
<{nom.sg.m.}[maarjaara]>withsandhias|d->od]
[dugdham<{acc.sg.m.|acc.sg.n.|nom.sg.n.|voc.sg.n.}
[dugdha]>withsandhim|p->.mp]
[pibati
<{pr.sg.3}[paa#1]>withnosandhi]
-30-
What next
-31-
T o kno w more
•Sanskritsite:http://pauillac.inria.fr/~huet/SKT/
•SandhiAnalysispaper:
http://pauillac.inria.fr/~huet/PUBLIC/tagger.pdf
•Coursenotes:
http://pauillac.inria.fr/~huet/ZEN/esslli.ps
•Courseslides:
http://pauillac.inria.fr/~huet/ZEN/Trento.ps
•Tutorialslides:
http://pauillac.inria.fr/~huet/ZEN/Hyderabad.ps
•ZENlibrary:http://pauillac.inria.fr/~huet/ZEN/zen.tar
•ObjectiveCaml:http://caml.inria.fr/ocaml/
-32-