• Aucun résultat trouvé

What shall not b e discussed

N/A
N/A
Protected

Academic year: 2022

Partager "What shall not b e discussed"

Copied!
71
0
0

Texte intégral

(1)

Represen tation Structures for Computational

Linguistics

G ´erard Huet

ESSLLI2002,Trento

-1-

(2)

What the course is ab out

•AcomputationalplatformforSanskrit

•TheZENcomputationalmorphologytoolkit

•PidginML

•ThefunctionalprogrammingparadigmforCL

•ConcreteprogrammingissuesinObjectiveCaml+Camlp4

•GeneralarchitectureissuesforaCLplatform

•CooperationonfreeCLresources

Twospecificapplicativetechnologies:

•Localprocessingoffocuseddata

•Sharing

-2-

(3)

What shall not b e discussed

•MLvsC++

•MLvsJava

•MLvsProlog

-3-

(4)

What shall not b e discussed at length

•ObjectiveCAMLvsSML

•MLvsHaskell

•MLvsC

•PidginMLvsObjectiveCAML

-4-

(5)

Basics: lists vs stac ks

valuel5=[1;2;3;4;5];

values5=[5;4;3;2;1];

valuerecunstackls=

matchlwith

[[]->s

|[h::t]->unstackt[h::s]

];

valuerevl=unstackl[];

valuestate3=([3;2;1],[4;5]);

-5-

(6)

T uring mac hines, Emacs, and Zipp ers

Zippers.FirstpresentationatFLoC’96.Publishedas:G.Huet.TheZipper.J.FunctionalProgramming7,5(1997),549-554.

Largescaleimplementationsinsyntaxeditorswithincomputationallinguisticsplatforms:

•G.Huet.LexicalmorphismswiththeZenplatform.

•A.Ranta.Grammaticalframeworks.

-6-

(7)

Con texts as zipp ers

typetree=[Treeofforest]andforest=listtree;

typetree_zipper=[Top|Zipof(forest*tree_zipper*forest)];

typefocused_tree=(tree_zipper*tree);

Afocusedtreeisatreewithafocuspointofinterest,i.e.atreeandastackedcontext.

-7-

(8)

Op erations on fo cused trees

valuedown(z,t)=matchtwith[Tree(forest)->matchforestwith

[[]->raise(Failure"down")

|[hd::tl]->(Zip([],z,tl),hd)

]

];

valueup(z,t)=matchzwith

[Top->raise(Failure"up")

|Zip(l,u,r)->(u,Tree(unstackl[t::r]))];

-8-

(9)

More op erations on fo cused trees

valueleft(z,t)=matchzwith[Top->raise(Failure"left")|Zip(l,u,r)->matchlwith[[]->raise(Failure"left")|[elder::rest]->(Zip(elders,u,[t::r]),rest)]];valueright(z,t)=matchzwith[Top->raise(Failure"right")|Zip(l,u,r)->matchrwith[[]->raise(Failure"right")|[young::rest]->(Zip([t::l],u,rest),young)]];

-9-

(10)

Applicativ e up dating

valuedel_l(z,_)=matchzwith

[Top->raise(Failure"del_l")

|Zip(l,u,r)->matchlwith

[[]->raise(Failure"del_l")

|[elder::elders]->(Zip(elders,u,r),elder)

]];

valuereplace(z,_)t=(z,t);

-10-

(11)

P oin ts of view ab out fo cused structures

•Manipulationoffocuseddataislocal

•Redundantrepresentation-efficiency

•TheInteractionCombinatorsParadigm

Remark.Zippersarelinearcontexts.TheyaresuperiortoΩ-terms,notablybecausetheapproximationorderingissubstructural.

TheNaturalTransformationfromtreefunctorstozipperfunctorsisDifferentiation;ZippersmayalsobeseenasthelinearFunctionsoverTrees.

-11-

(12)

Bac k to linguistics

Wewanttoprocess(parseandgenerate)naturallanguagesentences,dialogues,corpusesofvariouskinds(oral,written,news,books,websites,etc).Weassumethatthedataisalreadydigitalisedanddiscretizedasastreamofletters(phonemesfororaldata,lettersforwrittenone).

Afundamentalentityinthisprocessingistheword.Onetraditionallydistinguishesprocessingbetweenstreamsoflettersandwords(morphology,lexicalanalysis)andprocessingbetweenwordsandsentences(syntax,parsing).However,thenatureofthewordisellusive.

-12-

(13)

What T esni `ere has to sa y

ThelinguistTesni`ere,inhisEl´ementsdeSyntaxeStructurale,says:

“Poursimplequ’elleparaisse,lanotiondemotestunedecellesdontlad´efinitionestlaplusd´elicatepourlelinguiste.C’estpeut-ˆetrequetropsouventonpartdelanotiondemotpourarriver`alanotiondephrase,aulieudepartirdelanotiondephrasepourarriver`alanotiondemot.Oronnesauraitd´efinirlaphrase`apartirdumot,maisseulementlemot`apartirdelaphrase.Carlanotiondephraseestlogiquementant´erieure`acelledemot.”

-13-

(14)

On tological Problem

WhatTesni`erereallysaysisanevidence:itistheontologicalpriorityoftheCorpusovertheLexicon.ThewordsarefoundintheCorpus,thencopiedtotheLexicon;theLanguageisdefinedbyitsCorpus.

ThepreeminenceoftheCorpusovertheLexiconisundeniable.Nevertheless,thewordsarerecognizedinthecorpusrelativelytothegenerativedevicesofmorphology;theinversionofthesegenerativerelationsextendsthestrictcoveringofthecorpusbythegenerativecapabilitiesofthegrammar;andthusthereisatensionbetweentheco-inductivestructureofthelexiconasarepositoryofutterancesandtheinductivestructureofwordsasgeneratedbymorphologicaldevicesofstemsinthelexicon.

-14-

(15)

Philosophical considerations

Anekdot.TheThamadasinGeorgia.

Puzzles.The‘oui’problem.The‘oiu’problem.

Researchtopic.Definethefunctorthefixpointofwhichisconstructed.

Technology.Chaseouthapaxes.Orrather,indexproperlythediachronicaldimensionofthelangageunderconsideration.

-15-

(16)

Bac k to the Lexicon

Words.Wordsarerepresentedaslistofpositiveintegers.

typeletter=intandword=listletter;

Weprovidecoercionsencode:string->wordanddecode:word->string.Hereislexicographicordering.

valuereclexicol1l2=matchl1with[[]->True|[c1::r1]->matchl2with[[]->False|[c2::r2]->ifc2<c1thenFalseelseifc2=c1thenlexicor1r2elseTrue]];

-16-

(17)

Differen tial w ords

typedelta=(int*word);

Adifferentialwordisanotationpermittingtoretrieveawordwfromanotherwordw 0sharingacommonprefix.Itdenotestheminimalpathconnectingthewordsinatree,asasequenceofupsanddowns:ifd=(n,u)wegoupntimesandthendownalongwordu.

Wecomputethedifferencebetweenwandw 0asadifferentialword diffww 0=(|w1|,w2)wherew=p.w1andw 0=p.w2,withmaximalcommonprefixp.

Theconverseofdiff:word->word->deltais

patch:delta->word->word:w 0mayberetrievedfromwand d=diffww 0asw 0=patchdw.

-17-

(18)

T ries

Triesstoresparsesetsofwordssharinginitialprefixes.TheyareduetoRen´edelaBriantais(1959).Weuseaverysimplerepresentationwithlistsofsiblings.

typetrie=[Trieof(bool*forest)]

andforest=list(Word.letter*trie);

Triesaremanaged(search,insertion,etc)usingthezippertechnology.

-18-

(19)

Imp ortan t remarks

Triesmaybeconsideredasdeterministicfinitestateautomatagraphsforacceptingthe(finite)languagetheyrepresent.Thisremarkisthebasisformanylexiconprocessinglibraries.

Suchgraphsareacyclic(trees).Butmoregeneralfinitestateautomatagraphsmayberepresentedasannotatedtrees.Theseannotationsaccountfornon-deterministicchoicepoints,andforvirtualpointersinthegraph.

-19-

(20)

Lexicon

Hereisasimplisticlexiconcompiler

make_lex:liststring->trie:

valuemake_lex=List.fold_left(funlexc->Trie.enterlex(Word.encodec))Trie.empty;

Forinstance,withenglish.lststoringalistof173528words,asatextfileofsize2Mb,thecommandmake_lex<english.lst>english.remproducesatrierepresentationasafileof4.5Mb.

Triessharethewordsbythereprefixes,butcommonsuffixesaccountforalotofredundancyinthestructure.Weshalleliminatethisredundancybysharing.

-20-

(21)

The Share F unctor

moduleShare:functor(Algebra:sigtypedomain=’a;valuesize:int;end)->

sigvalueshare:Algebra.domain->int->Algebra.domain;end;

Thatis,SharetakesasargumentamoduleAlgebraprovidingatypedomainandanintegervaluesize,anditdefinesavalueshareofthestatedtype.WeassumethattheelementsfromthedomainarepresentedwithanintegerkeyboundedbyAlgebra.size.Thatis,

sharexkwillassumeaspreconditionthat0≤k<Maxwith

Max=Algebra.size.

Weshallconstructthesharingmapwiththehelpofahashtable,madeupofbuckets(k,[e1;e2;...en])whereeachelementeihaskeyk.

-21-

(22)

Memoizing

typebucket=listAlgebra.domain;

valuememo=Array.createAlgebra.size([]:bucket);

Weshalluseaservicefunctionsearch,suchthatsearchelreturnsthefirstyinlsuchthaty=eororelseraisestheexception

Not_found.

valuesearche=List.find(funx->x=e);

-22-

(23)

The share function

valueshareelementkey=

letbucket=memo.(key)in

trysearchelementbucketwith[Not_found->

do{memo.(key):=[element::bucket];element}

];

Sharingisjustrecalling!

-23-

(24)

Compressing trees as dags

WemayforinstanceinstantiateShareonthealgebraoftrees,withasizehashmaxdependingontheapplication:

moduleDag=Share(structtypedomain=tree;

valuesize=hash_max;end);

Andnowwecompressatrieintoaminimaldagusingsharebyasimplebottom-uptraversal,wherethekeyiscomputedalongbyhashing.Forthiswedefineageneralbottom-uptraversalfunction,whichappliesaparametriclookupfunctiontoeverynodeanditsassociatedkey.

-24-

(25)

Dynamic programming

Bottom-uptraversingwithinductivehash-codecomputation.

valuehash1keyindexsum=sum+index*key

andhashforest=forestmodhash_max;

valuetraverselookup=travel

whererectravel=fun

[Tree(forest)->

letf(tries,index,span)t=

let(t0,k)=travelt

in([t0::tries],index+1,hash1kindexspan)

inlet(forest0,_,span)=List.fold_leftf([],1,0)forest

inletkey=hashspanin(lookup(Tree(revforest0))key,key)];

-25-

(26)

Compressing a tree as a dag

Now,compressingatreeoptimallyasaminimaldagissimplyeffectedbyasharingtraversal:

valuecompress=traverseDag.share;

valueminimizetree=let(dag,_)=compresstreeindag;

-26-

(27)

Adv an tages and extensions

Hashingkeysandsizeisontheclientside:wedonotdelegatehashingtoShare,whichisjustanassociativememory.Thishastwoadvantages:

•Thecomputationisfullylinear

•Itisadaptedtothestatisticsofthedata

Extension:Auto-sharingtypes(controlledhash-consing).Suggestsamonadofsharedhashedstructuresaccommodatingentropyofthedata.

-27-

(28)

Dagified lexicons

Wemaydagifyalexiconaposterioriinonepass:

valuerecdagify()=

letlexicon=(input_valuestdin:Trie.trie)

inletdag=Mini.minimizelexiconinoutput_valuestdoutdag;

Orwemaymaintainadagifiedstructurebysharingdynamicallywheninsertingwordsbyappropriatemodificationofthezipperoperations.

Andnowifweapplythistechniquetoourenglishlexicon,withcommanddagify<english.rem>small.rem,wenowgetanoptimalrepresentationwhichonlyneeds1Mbofstorage,halfoftheoriginalASCIIstringrepresentation.

-28-

(29)

Pub

Therecursivealgorithmsgivensofararefairlystraightforward.Theyareeasytodebug,maintainandmodifyduetothestrongtypingsafeguardofML,andeveneasytoformallycertify.Theyarenonethelessefficientenoughforproductionuse,thankstotheoptimizingnative-codecompilerofObjectiveCaml.

InourSanskritapplication,thetrieof11500entriesisshrunkfrom219Kbto103Kbin0.1s,whereasthetrieof120000flexedformsisshrunkfrom1.63Mbto140Kbin0.5sona864MHzPC.Ourtrieof173528Englishwordsisshrunkfrom4.5Mbto1Mbin2.7s.Measurementsshowedthatthetimecomplexityislinearwiththesizeofthelexicon(withincomparablesetsofwords).

-29-

(30)

V ariations

Manyvariationsontriesexist.OptimisationsoflexicalanalysersforprogramminglanguagesaredescribedintheDragonbook.Butthedragonbookofcomputationallinguisticshasnotbeenwrittenyet.

Variationwithternarytrees.TernarytreesareinspiredfromBentleyandSedgewick.Ternarytreesaremorecomplexthantries,butuseslightlylessstorage.Accessispotentiallyfasterinbalancedtreesthantries.Agoodmethodologyseemstousetriesforedition,andtotranslatethemtobalancedternarytreesforproductionusewithafixedlexicon.

Theternaryversionofourenglishlexicontakes3.6Mb,asavingsof20%overitstrieversionusing4.5Mb.Afterdagminimization,ittakes1Mb,asavingsof10%overthetriedagversionusing1.1Mb.Foroursanskritlexiconindex,thetrietakes221Kbandthetertree180Kb.Sharedasdagsthetrietakes103Kbandthetertree96Kb.

-30-

(31)

Decos, Lexmaps, Autos

WeunderstandtheTriestructureofasetofWordsasaspecialcaseofafinitelybasedmappingDeco=Word→AnnotationinthecaseofBooleanannotationssharedbyprefixarguments(andbycommonsubexpressionswhenshared).

Westoremorphologyconstructionsasbeingofthistype,andweinvestigatethereversemappingbygeneralisingthemtorelations,typicallyinductivelydefinedthroughfinitestatemachines.

Themoresharingwegetthebetterweoptimisethisdatalayout.Itisthusofparamountimportancethattheannotationsbelocalquasi-morphismsdecorations.

-31-

(32)

Decos

typedeco’a=[Decoof(list’a*dforest’a)]

anddforest’a=list(Word.letter*deco’a);

Wethinkofthedecorationoftypelist’aasaninformationassociatedwiththewordstoredatthatnode.

Wecaneasilygeneralizesharingtodecoratedtries.However,substantialsavingswillresultonlyiftheinformationatagivennodeisafunctionofthesubtrieatthatnode,i.e.ifsuchinformationisdefinedasatriemorphism.

Definition.Adecoisatreemorphismiftheinformationateverynodeisafunctionofthecorrespondingsub-tree.Suchdecospreservethesharingofthetreestheydecorate.

-32-

(33)

Enco ding morphological parameters as decorations

Wethusprofitoftheregularityofmorphologicaltransformationstohaveterserepresentationsofthelexicondecoratedbygrammaticalinformation.Thusifallpluralsareobtainedbyadding‘s’tothesingularstemexceptforafewexceptions,wedonotpayanycostinencodingthispluralinformationasanexplicitinstruction

[pl:suffixs]decoratingthestems,sinceitwillnotcreateanynewnodeexceptforthefewexceptions.Asopposedtolistingexplicitlythepluralform,whichwouldundoallsharing.

Inoursanskritimplementation,thevariousgendersassociatedwithanounstemaredefinedinadecousedforproducingtheflexedforms.Theflexedformsarethengeneratedusinganad-hocinternalsandhialgorithm,difficulttoencodeasafinite-stateprocess,andthusdifficulttoinverse.

-33-

(34)

( Aside ) The scoping structure of the lexicon

Howtofindthestemassociatedwithagenderinthelexiconinoneclicksothatmorphologymaybedisplayed-withnoneedofscriptorapplet.

Simpledistributedarchitecture-allthecomputationisdoneontheserverside.

Maintainingcomputationalinvariantsinthelexiconaugmentsitsrobustness.

-34-

(35)

Explicit morphology vs implicit morphology

ByexplicitmorphologyImeanlistingexplicitlytheformsgeneratedbymorphologyoperationsfromrootstems,prefixesandsuffixes.

ByimplicitmorphologyImeanjusthavingprogramswhichwillgeneratetheseflexedformsondemand.

Implicitmorphologyisnotenoughtorecognizethesegmentsofsentencesidenticalwithaflexedform:themorphologicalfunctionsmustbeinvertible.

-35-

(36)

Compromise

Ontheotherhand,thedelimitationbetweenimplicitandexplicitisblurredsincee.g.afinite-statemachinestategraphmaybebothconsideredaprogramandapieceofdata;forinstance,atriestoreswords,butactuallythewordsare“recognizedasbeinginthelexicon”by“runningthelexiconoverthemasinputdata”.

Thusweshallrepresent“explicitly”flexedformsandtheinformationonhowtheyarederivedfromrootstemsasatriebearingasdecorationsinstructionsonhowto“undomorphology”locally.Forthispurpose,weshallusethenotionofdifferentialwordabove.Wemaynowstoreinversemapsoflexicalrelations(suchasmorphologyderivations)usingtheLexmapstructure.

Thiswaywebypassthe(hard)problemofinternalsandhifsmaxiomatisation.

-36-

(37)

Lexmaps

typeinverse’a=(Word.delta*’a)

andinverse_map’a=list(inverse’a);

typelexmap’a=[Mapof(inverse_map’a*mforest’a)]andmforest’a=list(Word.letter*lexmap’a);

Typically,ifwordwisstoredatanodeMap([...;(d,r);...],...),thisrepresentsthefactthatwistheimagebyrelationrof

w 0=patchdw.Suchalexmapisthusarepresentationoftheimagebyrofasourcelexicon.Thisrepresentationisinvertible,whilepreservingmaximallythesharingofprefixes,andthusbeingamenabletosharing.

Example:catsanddogssharingtheir‘s’nodewhileimplicitlyreferringtotheirrespectivesingularstem.

-37-

(38)

Lexicon rep ositories using tries and decos

Inatypicalcomputationallinguisticsapplication,grammaticalinformation(partofspeechrole,gender/numberforsubstantives,valencyandothersubcategorizationinformationforverbs,etc)maybestoredasdecorationofthelexiconofroots/stems.Fromsuchadecoratedtrieamorphologicalprocessormaycomputethelexmapofallflexedforms,decoratedwiththeirderivationinformationencodedasaninversemap.Thisstructuremayitselfbeusedbyataggingprocessortoconstructthelinearrepresentationofasentencedecoratedbyfeaturestructures.Sucharepresentationwillsupportfurtherprocessing,suchascomputingsyntacticandfunctionalstructures,typicallyassolutionsofconstraintsatisfactionproblems.

-38-

(39)

Example: Sanskrit

Themaincomponentinourtoolsisastructuredlexicaldatabase.Fromthisdatabase,varioushypertextdocumentsmaybeproducedmechanically.TheindexCGIenginesearchesforwordsbynavigatinginapersistenttrieindexofstementries.Thecurrentdatabasecomprises12000items,anditsindexhasasizeof103KB.

Whencomputingthisindex,anotherpersistentstructureiscreated.Itrecordsinadecoallthegendersassociatedwithanounentry.Atpresent,thisdecorecordsgendersfor5700nouns,andithasasizeof268KB.

Weiterateonthisgendersstructureagrammaticalengine,whichgeneratesdeclinedforms.Thislexmaprecordsabout120000suchflexedformswithassociatedgrammaticalinformation,andithasasizeof341KB.Acompaniontrie,withouttheinformation,keepstheindexofflexedwordsasaminimizedstructureof140KB.

-39-

(40)

Finite State Lore

Computationalphonologyaremorphologyuseextensivelyfinitestatetechnology:rationallanguagesandrelations,transducers,bimachines,etc.

•Sch¨utzenberger

•Koskenniemi

•KaplanandKay

Finitestatetoolsetshavebeendeveloped,wherewordtransformationsaresystematicallycompiledinalow-levelalgebraoffinite-statemachinesoperators.SuchtoolsetshavebeendevelopedatXerox,ParisVII,BellLabs,MitsubishiLabs,etc.Compilingcomplexrewriterulesinrationaltransducersmaybesubtle.Wedepartfromthisfine-grainedmethodologyandproposemoredirecttranslationspreservingthestructureofthelexicon.

-40-

(41)

Finite State Mac hines as Lexicon Morphisms

Westartwiththeremarkthatalexiconrepresentedasatrieisdirectlythestatespacerepresentationofthe(deterministic)finitestatemachinethatrecognizesitswords,andthatitsminimizationconsistsexactlyinsharingthelexicaltreeasadag.Weareinacasewherethestategraphofsuchfinitelanguagesrecognizersisanacyclicstructure.Suchapuredatastructuremaybeeasilybuiltwithoutmutablereferences,whichhascomputationalandrobustnessadvantages.

Inthesamespirit,wedefineautomatawhichimplementnon-trivialrationalrelations(andtheirinversion)andwhosestatestructureisnonethelessamoreorlessdirectdecorationofthelexicontrie.Thecrucialnotionisthatthestatestructureisalexiconmorphism.

-41-

(42)

Unglueing

Westartwithatoyproblemwhichisthesimplestcaseofjunctureanalysis,namelywhentherearenonon-trivialjuncturerules,andsegmentationconsistsjustinretrievingthewordsofasentencegluedtogetherinonelongstringofcharacters(orphonemes).ConsiderforinstancewrittenEnglish.Youhaveatextfileconsistingofasequenceofwordsseparatedwithblanks,andyouhavealexiconcompleteforthistext(forinstance,‘spell’hasbeensuccessfullyapplied).Now,supposeyoumakesomeeditingmistake,whichremovesallspaces,andthetaskistoundothisoperationtorestoretheoriginal.

Thetransducerisdefinedasafunctor,takingthelexicontriestructureasparameter.

-42-

(43)

Unglue

moduleUnglue(Lexicon:sigvaluelexicon:Trie.trie;end)=struct

typeinput=Word.word(*inputsentenceasaword*)

andoutput=listWord.word;(*outputissequenceofwords*)

typebacktrack=(input*output)

andresumption=listbacktrack;(*coroutineresumptions*)

exceptionFinished;

Wedefineourunglueingreactiveengineasarecursiveprocesswhichnavigatesdirectlyonthe(flexed)lexicontrie(typicallythecompressedtrieresultingfromtheDagmoduleconsideredabove).

-43-

(44)

The reactiv e engine

Thereactiveenginetakesasargumentsthe(remaining)input,the(partiallyconstructed)listofwordsreturnedasoutput,abacktrackstackwhoseitemsare(input,output)pairs,thepathoccinthestategraphstacking(thereverseof)thecurrentcommonprefixofthecandidatewords,andfinallythecurrenttrienodeasitscurrentstate.Whenthestateisaccepting,wepushitonthebacktrackstack,becausewewanttofavorpossiblelongerwords,andsowecontinuereadingtheinputuntileitherweexhausttheinput,orthenextinputcharacterisinconsistentwiththelexicondata.

-44-

(45)

The reactiv e engine co de

valuerecreactinputoutputbackocc=fun[Trie(b,forest)->ifbthenletpushout=[occ::output]inifinput=[]then(pushout,back)(*solutionfound*)elseletpushback=[(input,pushout)::back]incontinuepushbackelsecontinuebackwherecontinuecont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterforestinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]]

-45-

(46)

Bac ktrac k

andbacktrack=fun

[[]->raiseFinished|[(input,output)::back]->

reactinputoutputback[]Lexicon.lexicon

];

Now,unglueingasentenceisjustcallingthereactiveenginefromtheappropriateinitialbacktracksituation.

valueungluesentence=backtrack[(sentence,[])];

-46-

(47)

Remark

Non-deterministicprogrammingisnobigdeal.WhyshouldyousurrendercontroltoaPROLOGblackbox?

Thethreegoldenrulesofnon-deterministicprogramming:

•Identifywellyoursearchstatespace

•Representstatesasnon-mutabledata

•Provetermination

Thelastpointisessentialforunderstandingthegranularityandenforcingcompleteness.

-47-

(48)

More on state space considerations

Thisnon-deterministicprocess(recognizingL )usesthesamestatespaceasthelexicon/trie(recognizingL).

ThiscorrespondstothefactthatanautomatonforL maybeobtainedfromtheautomatonforLbyinserting-movesfromacceptingnodestotheinitialnode.Butsuchtransitionsmaybekeptcompletelyimplicit.Allyouhavetodoistomanagethenecessarynon-determinism(continuinginLwhichisnotingeneralaprefixlanguage(i.e.ifmayhappenthatbothwandw·sareinL)versusiterating)inthebacktrackstack,butyoudonothavetomodifyatallthestatespacedatastructure.Itisjustashiftinpointofviewconcerningthisdata.

-48-

(49)

Still more on state space considerations

RememberthatdagifiedtriesdefinetheminimalautomatonofafinitelanguageL.

Butitisnotthecasethatthisautomaton,completedwithtransitions,isminimalforL∗.ConsiderforinstanceL={a,aa}.

However,notethatweareusingitasatransducercomputingjustificationsforawordinL∗tobeaconcatenationofprecisewordsofL,andtheminimalautomatondoesnotkeepenoughinformationforthat:distinctsegmentationsofasentencemustbeseparated.

-49-

(50)

Childtalk

moduleChildtalk=struct

valuelexicon=Lexicon.make_lex["boudin";"caca";"pipi"];

end;

moduleChildish=Unglue(Childtalk);

let(sol,_)=Childish.unglue(Word.encode"pipicacaboudin")

inChildish.print_outsol;

Werecoverasexpected:pipicacaboudin.

-50-

(51)

Generating sev eral solutions

Weresumearesumptionwithresume:(resumption->int->resumption).

valueresumecontn=let(output,resumption)=backtrackcontindo{print_string"\nSolution";print_intn;print_string":\n";print_outoutput;resumption};

valueunglue_allsentence=restore[(sentence,[])]1whererecrestorecontn=tryletresumption=resumecontninrestoreresumption(n+1)with[Finished->ifn=1thenprint_string"Nosolutionfound\n"else()];

-51-

(52)

Solving a charade

moduleShort=struct

valuelexicon=Lexicon.make_lex

["able";"am";"amiable";"get";"her";"i";"to";"together"];

end;

moduleCharade=Unglue(Short);

Charade.unglue_all(Word.encode"amiabletogether");

Solution1:amiabletogether

Solution2:amiabletogether

Solution3:amiabletogether

Solution4:amiabletogether

-52-

(53)

Juncture euphon y and its discretization

Whensuccessivewordsareuttered,theminimizationoftheenergynecessarytoreconfiguratethevocalorgansatthejunctureofthewordsprovoquesaeuphonytransformation,discretizedatthelevelofphonemesbyacontextualrewriteruleoftheform:

[x]u|v→w

Thisjunctureeuphony,orexternalsandhi,isactuallyrecordedinsanskritinthewrittenrenderingofthesentence.Thefirstlinguisticprocessingisthereforesegmentation,whichgeneralisesunglueingintosandhianalysis.

-53-

(54)

uv w x

-54-

(55)

z uv w

u v

x

-55-

(56)

Auto

typelexicon=trie

andrule=(word*word*word);

Theruletriple(revu,v,w)representsthestringrewriteu|v→w.Nowforthetransducerstatespace:

typeauto=[Stateof(bool*deter*choices)]

anddeter=list(letter*auto)

andchoices=listrule;

moduleAuto=Share(structtypedomain=auto;

valuesize=hash_max;end);

-56-

(57)

Compiling the lexicon to a minimal transducer

(*build_auto:word->lexicon->(auto*stack*int)*)valuerecbuild_autoocc=fun[Trie(b,arcs)->letlocal_stack=ifbthenget_sandhioccelse[]inletf(deter,stack,span)(n,t)=letcurrent=[n::occ](*currentoccurrence*)inlet(auto,st,k)=build_autocurrenttin([(n,auto)::deter],mergeststack,hash1nkspan)inlet(deter,stack,span)=fold_leftf([],[],hash0)arcsinlet(h,l)=matchstackwith[[]->([],[])|[h::l]->(h,l)]inletkey=hashbspanhinlets=Auto.share(State(b,deter,h))keyin(s,mergelocal_stackl,key)];

-57-

(58)

Segmen ting T ransducer Data Structures

typetransition=

[Euphonyofrule(*(revu,v,w)stu|v->w*)

|Id(*identityornosandhi*)

]

andoutput=list(word*transition);

typebacktrack=

[Nextof(input*output*word*choices)|Initof(input*output)

]

andresumption=listbacktrack;(*coroutineresumptions*)

exceptionFinished;

-58-

(59)

Running the Segmen ting T ransducer

valuerecreactinputoutputbackocc=fun[State(b,det,choices)->(*wetrythedeterministicspacefirst*)letdetercont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterdetinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]inletnondets=ifchoices=[]thenbackelse[Next(input,output,occ,choices)::back]inifbthenletout=[(occ,Id)::output](*optfinalsandhi*)

-59-

(60)

inifinput=[]then(out,nondets)(*solution*)elseletalterns=[Init(input,out)::nondets](*wefirsttrythelongestmatchingword*)indeteralternselsedeternondets]andchooseinputoutputbackocc=fun[[]->backtrackback|[((u,v,w)asrule)::others]->letalterns=[Next(input,output,occ,others)::back]inifprefixwinputthenlettape=advance(lengthw)inputandout=[(u@occ,Euphony(rule))::output]inifv=[](*finalsandhi*)theniftape=[]then(out,alterns)elsebacktrackalterns

-60-

(61)

elseletnext_state=accessvinreacttapeoutalternsvnext_state

elsebacktrackalterns

]

andbacktrack=fun

[[]->raiseFinished

|[resume::back]->matchresumewith

[Next(input,output,occ,choices)->

chooseinputoutputbackoccchoices

|Init(input,output)->

reactinputoutputback[]automaton]

];

-61-

(62)

Example of Sanskrit Segmen tation

process"tacchrutvaa";

Chunk:tacchrutvaa

maybesegmentedas:

Solution1:

[tadwithsandhid|"s->cch]

["srutvaawithnosandhi]

-62-

(63)

More examples

process"o.mnama.h\"sivaaya";

Solution1:

[omwithsandhim|n->.mn]

[namaswithsandhis|"s->.h"s]

["sivaayawithnosandhi]

process"sugandhi.mpu.s.tivardhanam";

Solution1:[sugandhimwithsandhim|p->.mp]

[pu.s.tiwithnosandhi]

[vardhanamwithnosandhi]

-63-

(64)

Sanskrit T agging

process"sugandhi.mpu.s.tivardhanam";

Solution1:

[sugandhim

<{acc.sg.m.}[sugandhi]>withsandhim|p->.mp]

[pu.s.ti<{iic.}[pu.s.ti]>withnosandhi]

[vardhanam

<{acc.sg.m.|acc.sg.n.|nom.sg.n.

|voc.sg.n.}[vardhana]>withnosandhi]

-64-

(65)

Statistics

Thecompleteautomatonconstructionfromtheflexedformslexicontakesonly9sona864MHzPC.Wegetaverycompactautomaton,withonly7337states,1438ofwhichacceptingstates,fittingin746KBofmemory.Withoutthesharing,wewouldhavegeneratedabout200000statesforasizeof6MB!

Thetotalnumberofsandhirulesis2802,ofwhich2411arecontextual.While4150stateshavenochoicepoints,theremaining3187haveanon-deterministiccomponent,withafan-outreaching164intheworstsituation.Howeverinpracticetherearenevermorethan2choicesforagiveninput,andsegmentationisextremelyfast.

-65-

(66)

Ov ergeneration Problems

Veryshortparticleshavetobetreateddifferently,orotherwisetherewouldbeintolerableovergeneration.Probablyprosodywillhavetocometotherescue.Thecaseofvedic“u”.

Compounds.Thebahuvr¯ıhiproblem.

Intrinsicovergeneration.a+a=a+¯a=¯a+a=¯a+¯a=¯aMosts.m.endwitha,manys.f.endwith¯a,thepreverb¯a(towards)isfrequent,theprefixaiscommon(negation).Sothereisoftenroomforinterpretation!

E.g.naasatovidyatebh¯avonaabh¯avovidyatesatah.vsnaasatovidyateabh¯avonaabh¯avovidyatesatah.

Doubleentendrepoetry.

-66-

(67)

Soundness and Completeness of the Algorithms

Theorem.Ifthelexicalsystem(L,R)isstrictandweaklynon-overlappingsisan(L,R)-sentenceiffthealgorithm(segmentalls)returnsasolution;conversely,the(finite)setofallsuchsolutionsexhibitsalltheproofsforstobean(L,R)-sentence.

Fact.InclassicalSanskrit,externalsandhiisstronglynon-overlapping.

Cf.http://pauillac.inria.fr/~huet/FREE/tagger.ps

-67-

(68)

Where is the information?

Mel’cuksays“Everythingisinthelexicon”.

Thekeyconceptislexicondirected.Somostoftheinformationisindeedinthelexicon.Butalotofphonologicalinformation(sandhirules)andgrammaticalknowledgeisinthecode.

Iftimepermits.Atourofthedictionarystructures.

-68-

(69)

Enjo y!

•Sanskritsite:http://pauillac.inria.fr/~huet/SKT/

•SandhiAnalysispaper:

http://pauillac.inria.fr/~huet/FREE/tagger.ps

•Coursenotes:

http://pauillac.inria.fr/~huet/ZEN/esslli.ps

•Courseslides:

http://pauillac.inria.fr/~huet/ZEN/Trento.ps

•ZENlibrary:http://pauillac.inria.fr/~huet/ZEN/zen.tar

•ObjectiveCaml:http://caml.inria.fr/ocaml/

-69-

(70)

What next (on the Sanskrit fron t)

•Sanskrit1Verbmorphology,Corpustesting,Lexiconacquisitionmode,Segmentationtraining,Philologyassistant(Scharf,Smith)

•Sanskrit2Sentinels,Prosody,Valencychecking,Dependencysynthesis

•Sanskrit3Discourseanalysis:Reference,Scope,Theme,Focus,Anaphoraresolution,Extra-linguisticinformation

•Sanskrit∞Distributeddevelopmentofmultilingualtools,SavingthePunedictionaryproject

-70-

(71)

What next (on the Zen fron t)

•ZenmaintenanceDistribution,Hotline,Users’club,Coordinationofextensions

•ZenimmediateextensionsGraftingofregularrelations,Rulescompiler

•Towardsamorecomprehensivegenericplatformforcomputationallinguistics,accommodatingthelevelsofSyntax,Semantics,andDiscourseInformationDynamics

-71-

Références

Documents relatifs

Canadian residency programs are required by the College of Family Physicians of Canada to have res- ident safety policies in place, 2 which are intended to ensure, for example,

Residency in a family medicine–run hospital Many programs offer family medicine training in secondary and primary care hospitals, run primarily by family doctors (including McGill’s

The Sanskrit Heritage Platform (http://sanskrit.inria.fr) offers a number of Web services meant for the assisted processing of classical Sanskrit texts.. Its main component is

‘‘opted for genocide in order to ac- complish its political objectives.’’ 41 At her first speech to the Security Council on January 29, 2009, the new United States ambassador to

Dans cette section, nous verrons comment exprimer correctement une grandeur physique avec un nombre approprié de chiffres significatifs, comment choisir une incertitude sur une

Modern ERP systems contain flexible report generators but the tendency exists for users to export data to spreadsheets for manipulation, reporting and decision making.. A

Exploration of Toronto working-class struggles against home evictions and in opposition to the impoverishing conditions of relief in the Great Depression illuminate how resistance

To reflect the issues just outlined, the WHO Gender Policy was created, in order to emphasize the priority in addressing those issues within the organization, as well as to formally