• Aucun résultat trouvé

G ´erard Huet

N/A
N/A
Protected

Academic year: 2022

Partager "G ´erard Huet"

Copied!
26
0
0

Texte intégral

(1)

Automata Mista

G ´erard Huet

Zohar F estsc hrift, T aormina, June 2003

-1-

(2)

Practical origin

Zen and the Art of Sym b olic Computing:

Ligh t and F ast Applicativ e Algorithms for

Computational Linguistics

G ´erard Huet

INRIA

PADL,NewOrleans,January2003

-2-

(3)

T ries

Tries,orlexicaltrees,storesparsesetsofwordssharinginitialprefixes.TheyareduetoRen´edelaBriantais(1959).Weuseaverysimplerepresentationwithlistsofsiblings.

typetrie=[Trieof(bool*forest)]

andforest=list(Word.letter*trie);

Triesaremanaged(search,insertion,etc)usingthezippertechnology.

-3-

(4)

Imp ortan t remarks

Triesmaybeconsideredasdeterministicfinitestateautomatagraphsforacceptingthe(finite)languagetheyrepresent.Thisremarkisthebasisformanylexiconprocessinglibraries.

Suchgraphsareacyclic(trees).Butmoregeneralfinitestateautomatagraphsmayberepresentedasannotatedtrees.Theseannotationsaccountfornon-deterministicchoicepoints,andforvirtualpointersinthegraph.

-4-

(5)

Solving a charade

moduleShort=struct

valuelexicon=Lexicon.make_lex

["able";"am";"amiable";"get";"her";"i";"to";"together"];

end;

moduleCharade=Unglue(Short);

Charade.unglue_all(Word.encode"amiabletogether");

Solution1:amiabletogether

Solution2:amiabletogether

Solution3:amiabletogether

Solution4:amiabletogether

-5-

(6)

Juncture euphon y and its discretization

Whensuccessivewordsareuttered,theminimizationoftheenergynecessarytoreconfiguratethevocalorgansatthejunctureofthewordsprovoquesaeuphonytransformation,discretizedatthelevelofphonemesbyacontextualrewriteruleoftheform:

[x]u|v→w

Thisjunctureeuphony,orexternalsandhi,isactuallyrecordedinsanskritinthewrittenrenderingofthesentence.Thefirstlinguisticprocessingisthereforesegmentation,whichgeneralisesunglueingintosandhianalysis.

-6-

(7)

uv w x

-7-

(8)

z uv w

u v

x

-8-

(9)

Auto

typelexicon=trie

andrule=(word*word*word);

Theruletriple(revu,v,w)representsthestringrewriteu|v→w.Nowforthetransducerstatespace:

typeauto=[Stateof(bool*deter*choices)]

anddeter=list(letter*auto)

andchoices=listrule;

moduleAuto=Share(structtypedomain=auto;

valuesize=hash_max;end);

-9-

(10)

Compiling the lexicon to a minimal transducer

(*build_auto:word->lexicon->(auto*stack*int)*)valuerecbuild_autoocc=fun[Trie(b,arcs)->letlocal_stack=ifbthenget_sandhioccelse[]inletf(deter,stack,span)(n,t)=letcurrent=[n::occ](*currentoccurrence*)inlet(auto,st,k)=build_autocurrenttin([(n,auto)::deter],mergeststack,hash1nkspan)inlet(deter,stack,span)=fold_leftf([],[],hash0)arcsinlet(h,l)=matchstackwith[[]->([],[])|[h::l]->(h,l)]inletkey=hashbspanhinlets=Auto.share(State(b,deter,h))keyin(s,mergelocal_stackl,key)];

-10-

(11)

Running the Segmen ting T ransducer

valuerecreactinputoutputbackocc=fun[State(b,det,choices)->(*wetrythedeterministicspacefirst*)letdetercont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterdetinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]inletnondets=ifchoices=[]thenbackelse[Next(input,output,occ,choices)::back]inifbthenletout=[(occ,Id)::output](*optfinalsandhi*)

-11-

(12)

inifinput=[]then(out,nondets)(*solution*)elseletalterns=[Init(input,out)::nondets](*wefirsttrythelongestmatchingword*)indeteralternselsedeternondets]andchooseinputoutputbackocc=fun[[]->backtrackback|[((u,v,w)asrule)::others]->letalterns=[Next(input,output,occ,others)::back]inifprefixwinputthenlettape=advance(lengthw)inputandout=[(u@occ,Euphony(rule))::output]inifv=[](*finalsandhi*)theniftape=[]then(out,alterns)elsebacktrackalterns

-12-

(13)

elseletnext_state=accessvinreacttapeoutalternsvnext_state

elsebacktrackalterns

]

andbacktrack=fun

[[]->raiseFinished

|[resume::back]->matchresumewith

[Next(input,output,occ,choices)->

chooseinputoutputbackoccchoices

|Init(input,output)->

reactinputoutputback[]automaton]

];

-13-

(14)

Example of Sanskrit Segmen tation

process"tacchrutvaa";

Chunk:tacchrutvaa

maybesegmentedas:

Solution1:

[tadwithsandhid|"s->cch]

["srutvaawithnosandhi]

-14-

(15)

More examples

process"o.mnama.h\"sivaaya";

Solution1:

[omwithsandhim|n->.mn]

[namaswithsandhis|"s->.h"s]

["sivaayawithnosandhi]

process"sugandhi.mpu.s.tivardhanam";

Solution1:[sugandhimwithsandhim|p->.mp]

[pu.s.tiwithnosandhi]

[vardhanamwithnosandhi]

-15-

(16)

Sanskrit T agging

process"sugandhi.mpu.s.tivardhanam";

Solution1:

[sugandhim

<{acc.sg.m.}[sugandhi]>withsandhim|p->.mp]

[pu.s.ti<{iic.}[pu.s.ti]>withnosandhi]

[vardhanam

<{acc.sg.m.|acc.sg.n.|nom.sg.n.

|voc.sg.n.}[vardhana]>withnosandhi]

-16-

(17)

Statistics

Thecompleteautomatonconstructionfromtheflexedformslexicontakesonly9sona864MHzPC.Wegetaverycompactautomaton,withonly7337states,1438ofwhichacceptingstates,fittingin746KBofmemory.Withoutthesharing,wewouldhavegeneratedabout200000statesforasizeof6MB!

Thetotalnumberofsandhirulesis2802,ofwhich2411arecontextual.While4150stateshavenochoicepoints,theremaining3187haveanon-deterministiccomponent,withafan-outreaching164intheworstsituation.Howeverinpracticetherearenevermorethan2choicesforagiveninput,andsegmentationisextremelyfast.

-17-

(18)

Soundness and Completeness of the Algorithms

Theorem.Ifthelexicalsystem(L,R)isstrictandweaklynon-overlappingsisan(L,R)-sentenceiffthealgorithm(segmentalls)returnsasolution;conversely,the(finite)setofallsuchsolutionsexhibitsalltheproofsforstobean(L,R)-sentence.

Fact.InclassicalSanskrit,externalsandhiisstronglynon-overlapping.

Cf.http://pauillac.inria.fr/~huet/FREE/tagger.ps

-18-

(19)

A note on termination

Terminationisprovedbymultisetorderingonresumptions.

Thisallowstostatethealgorithmasanon-deterministicalgorithm,allowinganystrategyforpriorityoflexiconsearchversuseuphonyprediction,aswellasarbitraryselectionofresumptionswhenbacktracking.

Thisisimportant,sinceitleavesallfreedomforimplementingarbitraryprioritypolicieslearnedbycorpustraining.

-19-

(20)

Non-deterministic programming

Non-deterministicprogrammingisnobigdeal.WhyshouldyousurrendercontroltoaPROLOGblackbox?

Thethreegoldenrulesofnon-deterministicprogramming:

•Identifywellyoursearchstatespace

•Representstatesasnon-mutabledata

•Provetermination

Thelastpointisessentialforunderstandingthegranularityandenforcingcompleteness.

Remark.Multisetorderingisanelegantmethodforprovingterminationofnon-deterministicprograms,independentlyofthesequentialstrategyofthegenerationofthesolutions.

-20-

(21)

Enjo y!

•Sanskritsite:http://pauillac.inria.fr/~huet/SKT/

•SandhiAnalysispaper:

http://pauillac.inria.fr/~huet/FREE/tagger.ps

•Coursenotes:

http://pauillac.inria.fr/~huet/ZEN/esslli.ps

•Courseslides:

http://pauillac.inria.fr/~huet/ZEN/Trento.ps

•Tutorialslides:

http://pauillac.inria.fr/~huet/ZEN/Hyderabad.ps

•ZENlibrary:http://pauillac.inria.fr/~huet/ZEN/zen.tar

•ObjectiveCaml:http://caml.inria.fr/ocaml/

-21-

(22)

Automata mista

-22-

(23)

Differen tial w ords

typedelta=(int*word);

Adifferentialwordisanotationpermittingtoretrieveawordwfromanotherwordw 0sharingacommonprefix.Itdenotestheminimalpathconnectingthewordsinatree,asasequenceofupsanddowns:ifd=(n,u)wegoupntimesandthendownalongwordu.

Wecomputethedifferencebetweenwandw 0asadifferentialword diffww 0=(|w1|,w2)wherew=p.w1andw 0=p.w2,withmaximalcommonprefixp.

Theconverseofdiff:word->word->deltais

patch:delta->word->word:w 0mayberetrievedfromwand d=diffww 0asw 0=patchdw.

-23-

(24)

The automaton structure

typeinput=word;

typedelta=(int*word)andaddress=[Globalofdelta|Localofdelta];

typeauto=[Stateof(bool*deter*choices)]anddeter=list(letter*auto)andchoices=list(input*address);

typeautomaton=(arrayauto*delta);

typebacktrack=(input*delta*choices)andresumption=listbacktrack;(*coroutineresumptions*)

-24-

(25)

The transducer structure

typeinput=wordandoutput=word;

typedelta=(int*word)andaddress=[Globalofdelta|Localofdelta];

typetrans=[Stateof(bool*deter*choices)]anddeter=list(letter*trans)andchoices=list(input*output*address);

typetransducer=(arraytrans*delta);

typebacktrack=(input*output*delta*choices)andresumption=listbacktrack;(*coroutineresumptions*)

-25-

(26)

Next-hierarchical/modularautomata-seeRaajiv’stalk?

Références

Documents relatifs

For vibrational absorption, the situation is the opposite: absorption by the umbrella mode of individual methyl groups is substantial [33], and most of it is due to the hydrogen and

Two’s Complement •Most common scheme of representing negative numbers in computers •Affords natural arithmetic (no special rules!) •To represent a negative number in 2’s

Ben Salem: département de biochimie (10h00 - 13h00 FACULTE DE MEDECINE..

Wordpress – Aller plus loin Personnaliser les thèmes. Personnaliser

One may consider the dimension datum to be spectral in nature, and paraphrase the question (following Bers and Kac [9]) as “can one hear the shape of a subgroup?” Ideas around

We study the zero set of random analytic functions generated by a sum of the cardinal sine functions which form an orthogonal basis for the Paley-Wiener space1. As a model case,

Section 7 considers several application of obtained results in approximation theory (sharp relations between best approximations and moduli of smoothness) and functional

Journée thématique Espèces exotiques envahissante – Réseau EEE Pays de la Loire – Ernée, octobre 2019.. •  plante