• Aucun résultat trouvé

Lexicon-directed Segmen tation and T agging of

N/A
N/A
Protected

Academic year: 2022

Partager "Lexicon-directed Segmen tation and T agging of"

Copied!
32
0
0

Texte intégral

(1)

Lexicon-directed Segmen tation and T agging of

Sanskrit

G ´erard Huet

XI Ith W orld Sanskrit Conference

Helsinki, July 2003

-1-

(2)

Abstract

WeproposeanalgorithmforsegmentingacontinuousSanskrittextbyreverseanalysisofsandhi.Itconsistsinconstructingafinite-statetransducerwhosestategraphisobtainedfromthelexicontrieofflexedformsofwordsbydecorationwithchoicepointslabeledwithjunctionrewriterulesoftheform[x]u|v→w.Sucharulemeansthatinthe(left)contextx,asuffixuofawordmergeswithaprefixvofthesucceedingwordtoformthephonemestreamw.Theserulesarecompiledfromexternalsandhitables.

Itisshownthatthemethodissoundandcomplete,inthatitproducesallcorrectsandhianalysesasafinitesetofsegmentationsolutions.Sincethemethodislexicondirected,andthemorphologicalstructureisinvertible,thisgivesautomaticallyforeachsegmentationasequenceofrootwordstaggedwiththeirgrammaticalfeatures.Suchtaggingsarethusafirstapproximationoftheshallow

-2-

(3)

syntaxofthesentence.Itisexpectedthatafurtheranalysisofthesubcategorizationpatternsoffiniteverbalforms,aswellasconcordconstraints,willtrimthissetofcandidateparsestoamanageablysmallforestofacceptableinterpretations.Furthertrainingwithmanuallytaggedcorpusesisexpectedtoyieldausefultoolforassistingscholarsinestablishingcriticaleditions,tocomputeconcordanceindexes,andtocompilestatisticalprofiles.Arobustmodewillfacilitatelexiconacquisitionfromthecorpusinordertobootstrapfromaninitialsmalllexicon(12000stemsyielding200000flexedforms)toamorecompletelexicographiccoverage.

Thetalkwilldescribehowthemethoddealswithcompoundsandhowpreverbsareprecompiledintheflexedformsinordertoavoidovergeneration,whilepreservingtheleft-to-rightapplicationofexternalsandhi.

-3-

(4)

Solving an English charade

moduleShort=struct

valuelexicon=Lexicon.make_lex

["able";"am";"amiable";"get";"her";"i";"to";"together"];

end;

moduleCharade=Unglue(Short);

Charade.unglue_all(Word.encode"amiabletogether");

Solution1:amiabletogether

Solution2:amiabletogether

Solution3:amiabletogether

Solution4:amiabletogether

-4-

(5)

Juncture euphon y and its discretization

Whensuccessivewordsareuttered,theminimizationoftheenergynecessarytoreconfiguratethevocalorgansatthejunctureofthewordsprovoquesaeuphonytransformation,discretizedatthelevelofphonemesbyacontextualrewriteruleoftheform:

[x]u|v→w

Thisjunctureeuphony,orexternalsandhi,isactuallyrecordedinsanskritinthewrittenrenderingofthesentence.Thefirstlinguisticprocessingisthereforesegmentation,whichgeneralisesunglueingintosandhianalysis.

-5-

(6)

uv w x

-6-

(7)

z uv w

u v

x

-7-

(8)

Auto

typelexicon=trie

andrule=(word*word*word);

Theruletriple(revu,v,w)representsthestringrewriteu|v→w.Nowforthetransducerstatespace:

typeauto=[Stateof(bool*deter*choices)]

anddeter=list(letter*auto)

andchoices=listrule;

moduleAuto=Share(structtypedomain=auto;

valuesize=hash_max;end);

-8-

(9)

Compiling the lexicon to a minimal transducer

(*build_auto:word->lexicon->(auto*stack*int)*)valuerecbuild_autoocc=fun[Trie(b,arcs)->letlocal_stack=ifbthenget_sandhioccelse[]inletf(deter,stack,span)(n,t)=letcurrent=[n::occ](*currentoccurrence*)inlet(auto,st,k)=build_autocurrenttin([(n,auto)::deter],mergeststack,hash1nkspan)inlet(deter,stack,span)=fold_leftf([],[],hash0)arcsinlet(h,l)=matchstackwith[[]->([],[])|[h::l]->(h,l)]inletkey=hashbspanhinlets=Auto.share(State(b,deter,h))keyin(s,mergelocal_stackl,key)];

-9-

(10)

Running the Segmen ting T ransducer

valuerecreactinputoutputbackocc=fun[State(b,det,choices)->(*wetrythedeterministicspacefirst*)letdetercont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterdetinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]inletnondets=ifchoices=[]thenbackelse[Next(input,output,occ,choices)::back]inifbthenletout=[(occ,Id)::output](*optfinalsandhi*)

-10-

(11)

inifinput=[]then(out,nondets)(*solution*)elseletalterns=[Init(input,out)::nondets](*wefirsttrythelongestmatchingword*)indeteralternselsedeternondets]andchooseinputoutputbackocc=fun[[]->backtrackback|[((u,v,w)asrule)::others]->letalterns=[Next(input,output,occ,others)::back]inifprefixwinputthenlettape=advance(lengthw)inputandout=[(u@occ,Euphony(rule))::output]inifv=[](*finalsandhi*)theniftape=[]then(out,alterns)elsebacktrackalterns

-11-

(12)

elseletnext_state=accessvinreacttapeoutalternsvnext_state

elsebacktrackalterns

]

andbacktrack=fun

[[]->raiseFinished

|[resume::back]->matchresumewith

[Next(input,output,occ,choices)->

chooseinputoutputbackoccchoices

|Init(input,output)->

reactinputoutputback[]automaton]

];

-12-

(13)

Example of Sanskrit Segmen tation

process"tacchrutvaa";

Chunk:tacchrutvaa

maybesegmentedas:

Solution1:

[tadwithsandhid|"s->cch]

["srutvaawithnosandhi]

-13-

(14)

More examples

process"o.mnama.h\"sivaaya";

Solution1:

[omwithsandhim|n->.mn]

[namaswithsandhis|"s->.h"s]

["sivaayawithnosandhi]

process"sugandhi.mpu.s.tivardhanam";

Solution1:[sugandhimwithsandhim|p->.mp]

[pu.s.tiwithnosandhi]

[vardhanamwithnosandhi]

-14-

(15)

Sanskrit T agging

process"sugandhi.mpu.s.tivardhanam";

Solution1:

[sugandhim

<{acc.sg.m.}[sugandhi]>withsandhim|p->.mp]

[pu.s.ti<{iic.}[pu.s.ti]>withnosandhi]

[vardhanam

<{acc.sg.m.|acc.sg.n.|nom.sg.n.

|voc.sg.n.}[vardhana]>withnosandhi]

-15-

(16)

The general case

process"me.saanajaa\"m\"sca";

Solution1:[me.saan<{acc.pl.m.}[me.sa]>withnosandhi][ajaan<{acc.pl.m.}[aja#1]|{acc.pl.m.}[aja#2]>withsandhin|c->"m"sc][ca<{und.}[ca]>withnosandhi]

Solution2:[maa<{und.}[maa#2]|{acc.sg.*}[aham]>

-16-

(17)

withsandhiaa|i->e]

[i.saan

<{acc.pl.m.}[i.sa]>withnosandhi]

[ajaan

<{acc.pl.m.}[aja#1]|{acc.pl.m.}[aja#2]>

withsandhin|c->"m"sc]

[ca

<{und.}[ca]>withnosandhi]

-17-

(18)

Statistics

Thecompleteautomatonconstructionfromtheflexedformslexicontakesonly9sona864MHzPC.Wegetaverycompactautomaton,withonly7337states,1438ofwhichacceptingstates,fittingin746KBofmemory.Withoutthesharing,wewouldhavegeneratedabout200000statesforasizeof6MB!

Thetotalnumberofsandhirulesis2802,ofwhich2411arecontextual.While4150stateshavenochoicepoints,theremaining3187haveanon-deterministiccomponent,withafan-outreaching164intheworstsituation.Howeverinpracticetherearenevermorethan2choicesforagiveninput,andsegmentationisextremelyfast.

-18-

(19)

Soundness and Completeness of the Algorithms

Theorem.Ifthelexicalsystem(L,R)isstrictandweaklynon-overlappingsisan(L,R)-sentenceiffthealgorithm(segmentalls)returnsasolution;conversely,the(finite)setofallsuchsolutionsexhibitsalltheproofsforstobean(L,R)-sentence.

Fact.InclassicalSanskrit,externalsandhiisstronglynon-overlappinginnounphrases.

Cf.http://pauillac.inria.fr/~huet/PUBLIC/tagger.pdf

-19-

(20)

Difficulties (noun phrases)

•Overgenerationwithshortparticles¯at,¯am,upa

•Removalofmeta-notations(li˙n-ga)

•clashof¯ayawithgenitives

•Overgenerationwith-ga,-da,-pa,-ya,etc

•Bahuvr¯ıhicompounds

•sa,duals

-20-

(21)

Ov ergeneration is una voidable

BG24[2]17

Chunk:naasatovidyatebhaava.hmaybesegmentedas:

SolutionShankara:

[na][asatas][vidyate][bhaavas]

SolutionMadhva:

[na][asatas][vidyate][abhaavas]

[MadhavDeshpande]Eachcommentatorhashisownlogictodefendtheirownpeculiarwaysegmentingtheline,anditisclearthatmanuscriptsalonedonothelp.

-21-

(22)

Difficulties (v erb phrases)

Howshouldpreverbprefixingbemodeled?

Thenaturalideawouldbetoaffixpreverbstoconjugatedverbforms,startingatroots,andtostorethecorrespondingflexedformsalongwiththedeclinednouns.ButthisisnottherightmodelforSanskritverbalmorphology,becausepreverbsassociatetorootformswithexternalandnotinternalsandhi.Andputtingpreverbsinparallelwithrootformsandnounformswillnotworkeither,becausethenon-overlappingconditionmentionedabovefailsforpreverb¯a.Andthisoverlappingactuallymakesexternalsandhinonassociative.Forinstance,notingsandhiwiththeverticalbar,weget:(iha|¯a)|ihi=ih¯a|ihi=ihehi(comehere).Whereas:iha|(¯a|ihi)=iha|ehi=*ihaihi,incorrect.Thisdefinitelydoomstheideaofstoringconjugatedformssuchasehi.

-22-

(23)

Phan tom phonemes

Thesolutiontothisproblemistoprepare¯a-prefixedrootformsinthecasewheretherootformsstartswithior¯ıoruor¯u-thecaseswhereanon-associativebehaviourofexternalsandhiobtains.Butinsteadofapplyingthestandardsandhirule¯a|i=e(andsimilarlyfor¯ı)weuse¯a|i=*ewhere*eisaphantomphonemewhichobeysspecialsandhirulessuchas:a|*e=eand¯a|*e=e.Throughtheuseofthisphantomphoneme,overlappingsandhiswith¯aaredealtwithcorrectly.Similarlyweintroduceanotherphantomphoneme*o,obeyinge.g.¯a|u=*o(andsimilarlyfor¯u)anda|*o=¯a|*o=o.

-23-

(24)

Prev erb sequences

Weproposetomodeltherecognitionofverbalphrasesbuiltfromasequenceofnounphrases,asequenceofpreverbs,andaconjugatedrootformbyacascadeofsegmentingautomata,withanautomatonfornouns(theonedemonstratedabove),anautomatonforsequencesofpreverbs,andanautomatonforconjugatedrootformsaugmentedwithphonyforms(i.e.¯aprefixesusingphantomphonemesandhi).Thesandhipredictionstructurewhichcontrolstheautomatonisdecomposedintothreephases,Nouns,PreverbsandRoots.WhenweareinphaseNouns,weproceedeithertomoreNouns,ortoPreverbs,ortoRoots,exceptifthepredictedprefixisphony,inwhichcaseweproceedtophaseRoot.WhenweareinphasePreverbs,weproceedtoVerbs,exceptifthepredictedprefixisphony,inwhichcasewebacktrack(sincepreverb¯aisaccountedforinPreverbs).Finally,ifweareinphaseRootswebacktrack.

-24-

(25)

Dispatc h

ThisprocedureisveryexplicitlystatedintheMLfunctiondispatchwhichistheheartofthesegmentingtransducercontrolloop:

valuedispatchphaseinputoutputbackv=

matchphasewith

[Nouns->ifphantomvthen

[Advance(Roots,input,output,v)::back]

else[Advance(Nouns,input,output,v)::

[Advance(Preverbs,input,output,v)::[Advance(Roots,input,output,v)::back]]]

|Preverbs->ifphantomvthenback

else[Advance(Roots,input,output,v)::back]

|Roots->back

];

-25-

(26)

Prev erbs

ItremainstoexplainwhatformstoenterinthePreverbsautomaton.Wecouldofcoursejustenterindividualdistinctpreverbs,andallowloopinginthePreverbsphase.Butthiswouldbegrosslyover-generating.Attheotherextreme,wecouldrecordinthelexiconthepreverbsequencesusedwithagivenroot.Buttheninsteadofonerootsformsautomaton,wewouldhavetousemanydifferentautomata(atleastoneforeveryequivalenceclassoftherelation“admitsthesamepreverbsequences”).Weproposeamiddleway,wherewehaveonepreverbsautomatonstoringallthepreverbsequencesusedforatleastoneroot.Namely:ati,adhi,adhyava,anu,anupar¯a,anupra,anuvi,antah

parisam,paryupa,pi,pra,prati,pratini,prativi,pratisam,praty¯a, upani,upasam,up¯a,up¯adhi,ni,nis,nirava,par¯a,pari,parini, abhivi,abhisam,abhy¯a,abhyud,abhyupa,ava,¯a,ud,ud¯a,upa, . ,apa,ap¯a,api,abhi,abhini,abhipra,

-26-

(27)

pratyud,prani,pravi,pravy¯a,pr¯a,vi,vini,vinih

samupa. sa.mpravi,sa.mvi,sam,samava,sam¯a,samud,samud¯a,samudvi, vipra,vyati,vyapa,vyava,vy¯a,vyud,sa,sa.mni,sa.mpra,sa.mprati, . ,vipar¯a,vipari,

Weremarkthatpreverb¯aonlyoccurslastinasequenceofpreverbs,i.e.itcanoccuronlynexttotheroot.ThisjustifiesnothavingtoaugmentthePreverbssequenceswithphantomphonemes.

-27-

(28)

Demonstration: “come here”

Chunk:ihehimaybesegmentedas:

Solution1:[iha<{und.}[iha]>withsandhia|aa|i->e][aa|ihi<{imp.sg.2}[aa-i#1]>withnosandhi]

Solution2:[iha<{und.}[iha]>withsandhia|i->e][ihi<{imp.sg.2}[i#1]>withnosandhi]

-28-

(29)

Remarks

Thisexceptionaltreatmentofthe¯apreverbcorrespondstoaspecialcaseinP¯an

. ini

aswell,whichindicatesthatourapproachislegitimate.

Weremarkthatthe¯apreverbalwaysoccurslastinthepreverbssequence,anobservationwhichtoourknowledgeisnotmadebyP¯an

. ini.

Hint.Regardthe*inphantomphonemes*eand*oassaying“jumpingover¯a”.Weprintthem¯a|iand¯a|urespectively.

Phantomphonemesrestoreassociativityofexternalsandhi.

-29-

(30)

State of the art of sanskrit tagging

Chunk:maarjaarodugdha.mpibati

maybesegmentedas:

Solution1:

[maarjaaras

<{nom.sg.m.}[maarjaara]>withsandhias|d->od]

[dugdham<{acc.sg.m.|acc.sg.n.|nom.sg.n.|voc.sg.n.}

[dugdha]>withsandhim|p->.mp]

[pibati

<{pr.sg.3}[paa#1]>withnosandhi]

-30-

(31)

What next

-31-

(32)

T o kno w more

•Sanskritsite:http://pauillac.inria.fr/~huet/SKT/

•SandhiAnalysispaper:

http://pauillac.inria.fr/~huet/PUBLIC/tagger.pdf

•Coursenotes:

http://pauillac.inria.fr/~huet/ZEN/esslli.ps

•Courseslides:

http://pauillac.inria.fr/~huet/ZEN/Trento.ps

•Tutorialslides:

http://pauillac.inria.fr/~huet/ZEN/Hyderabad.ps

•ZENlibrary:http://pauillac.inria.fr/~huet/ZEN/zen.tar

•ObjectiveCaml:http://caml.inria.fr/ocaml/

-32-

Références

Documents relatifs

However, a feature of the words used in Pitt &amp; Samuel's study was that most had a &#34;CVC#CVC&#34; structure; thus subjects expecting, say, the third consonant, may in fact

The Slavic word for ‘father’ goes back to PIE *at- ‘father’, an informal and probably affective word derived from the language of children (cf. L atta, Gr átta, Goth atta), which

1) Distributed (modal) control of a PDE: Depending on the positioning of actuators and sensors, two types of control of PDEs can be defined [13]. In in-domain control, actuators

The permission to translate the two main documents in this volume on the Grammar and the Lexicon, to reproduce sample pages in Part 4, as well as

J ACOBS , PAR SES VA - ET - VIENT DANS L ’ IMAGE , FAIT DURER LE PLAISIR DU FILM QUAND F ELDMAN QUI VOULAIT MAINTENIR LE TEMPS EN SUSPENS DANS UN MONDE QU ’ IL DISAIT ETRE

In this thesis, I compared Tone 4 sandhi in Standard Mandarin and Luoyang Dialect. These two dialects both belong to the same dialect group and have the same falling

In the first group (N=45; see Table 2), we studied the {acuity} of the corners of the figure (or the {obtuse : acute} graphic opposition) and the {continuity} of its drawing (or

This part of the vocabulary contains a strikingly high proportion of words whose pronunciation depends on that of another word (isomorphic principles). Perspective: Crossing