Automata Mista
G ´erard Huet
Zohar F estsc hrift, T aormina, June 2003
-1-
Practical origin
Zen and the Art of Sym b olic Computing:
Ligh t and F ast Applicativ e Algorithms for
Computational Linguistics
G ´erard Huet
INRIA
PADL,NewOrleans,January2003
-2-
T ries
Tries,orlexicaltrees,storesparsesetsofwordssharinginitialprefixes.TheyareduetoRen´edelaBriantais(1959).Weuseaverysimplerepresentationwithlistsofsiblings.
typetrie=[Trieof(bool*forest)]
andforest=list(Word.letter*trie);
Triesaremanaged(search,insertion,etc)usingthezippertechnology.
-3-
Imp ortan t remarks
Triesmaybeconsideredasdeterministicfinitestateautomatagraphsforacceptingthe(finite)languagetheyrepresent.Thisremarkisthebasisformanylexiconprocessinglibraries.
Suchgraphsareacyclic(trees).Butmoregeneralfinitestateautomatagraphsmayberepresentedasannotatedtrees.Theseannotationsaccountfornon-deterministicchoicepoints,andforvirtualpointersinthegraph.
-4-
Solving a charade
moduleShort=struct
valuelexicon=Lexicon.make_lex
["able";"am";"amiable";"get";"her";"i";"to";"together"];
end;
moduleCharade=Unglue(Short);
Charade.unglue_all(Word.encode"amiabletogether");
Solution1:amiabletogether
Solution2:amiabletogether
Solution3:amiabletogether
Solution4:amiabletogether
-5-
Juncture euphon y and its discretization
Whensuccessivewordsareuttered,theminimizationoftheenergynecessarytoreconfiguratethevocalorgansatthejunctureofthewordsprovoquesaeuphonytransformation,discretizedatthelevelofphonemesbyacontextualrewriteruleoftheform:
[x]u|v→w
Thisjunctureeuphony,orexternalsandhi,isactuallyrecordedinsanskritinthewrittenrenderingofthesentence.Thefirstlinguisticprocessingisthereforesegmentation,whichgeneralisesunglueingintosandhianalysis.
-6-
uv w x
-7-
z uv w
u v
x
-8-
Auto
typelexicon=trie
andrule=(word*word*word);
Theruletriple(revu,v,w)representsthestringrewriteu|v→w.Nowforthetransducerstatespace:
typeauto=[Stateof(bool*deter*choices)]
anddeter=list(letter*auto)
andchoices=listrule;
moduleAuto=Share(structtypedomain=auto;
valuesize=hash_max;end);
-9-
Compiling the lexicon to a minimal transducer
(*build_auto:word->lexicon->(auto*stack*int)*)valuerecbuild_autoocc=fun[Trie(b,arcs)->letlocal_stack=ifbthenget_sandhioccelse[]inletf(deter,stack,span)(n,t)=letcurrent=[n::occ](*currentoccurrence*)inlet(auto,st,k)=build_autocurrenttin([(n,auto)::deter],mergeststack,hash1nkspan)inlet(deter,stack,span)=fold_leftf([],[],hash0)arcsinlet(h,l)=matchstackwith[[]->([],[])|[h::l]->(h,l)]inletkey=hashbspanhinlets=Auto.share(State(b,deter,h))keyin(s,mergelocal_stackl,key)];
-10-
Running the Segmen ting T ransducer
valuerecreactinputoutputbackocc=fun[State(b,det,choices)->(*wetrythedeterministicspacefirst*)letdetercont=matchinputwith[[]->backtrackcont|[letter::rest]->tryletnext_state=List.assocletterdetinreactrestoutputcont[letter::occ]next_statewith[Not_found->backtrackcont]]inletnondets=ifchoices=[]thenbackelse[Next(input,output,occ,choices)::back]inifbthenletout=[(occ,Id)::output](*optfinalsandhi*)
-11-
inifinput=[]then(out,nondets)(*solution*)elseletalterns=[Init(input,out)::nondets](*wefirsttrythelongestmatchingword*)indeteralternselsedeternondets]andchooseinputoutputbackocc=fun[[]->backtrackback|[((u,v,w)asrule)::others]->letalterns=[Next(input,output,occ,others)::back]inifprefixwinputthenlettape=advance(lengthw)inputandout=[(u@occ,Euphony(rule))::output]inifv=[](*finalsandhi*)theniftape=[]then(out,alterns)elsebacktrackalterns
-12-
elseletnext_state=accessvinreacttapeoutalternsvnext_state
elsebacktrackalterns
]
andbacktrack=fun
[[]->raiseFinished
|[resume::back]->matchresumewith
[Next(input,output,occ,choices)->
chooseinputoutputbackoccchoices
|Init(input,output)->
reactinputoutputback[]automaton]
];
-13-
Example of Sanskrit Segmen tation
process"tacchrutvaa";
Chunk:tacchrutvaa
maybesegmentedas:
Solution1:
[tadwithsandhid|"s->cch]
["srutvaawithnosandhi]
-14-
More examples
process"o.mnama.h\"sivaaya";
Solution1:
[omwithsandhim|n->.mn]
[namaswithsandhis|"s->.h"s]
["sivaayawithnosandhi]
process"sugandhi.mpu.s.tivardhanam";
Solution1:[sugandhimwithsandhim|p->.mp]
[pu.s.tiwithnosandhi]
[vardhanamwithnosandhi]
-15-
Sanskrit T agging
process"sugandhi.mpu.s.tivardhanam";
Solution1:
[sugandhim
<{acc.sg.m.}[sugandhi]>withsandhim|p->.mp]
[pu.s.ti<{iic.}[pu.s.ti]>withnosandhi]
[vardhanam
<{acc.sg.m.|acc.sg.n.|nom.sg.n.
|voc.sg.n.}[vardhana]>withnosandhi]
-16-
Statistics
Thecompleteautomatonconstructionfromtheflexedformslexicontakesonly9sona864MHzPC.Wegetaverycompactautomaton,withonly7337states,1438ofwhichacceptingstates,fittingin746KBofmemory.Withoutthesharing,wewouldhavegeneratedabout200000statesforasizeof6MB!
Thetotalnumberofsandhirulesis2802,ofwhich2411arecontextual.While4150stateshavenochoicepoints,theremaining3187haveanon-deterministiccomponent,withafan-outreaching164intheworstsituation.Howeverinpracticetherearenevermorethan2choicesforagiveninput,andsegmentationisextremelyfast.
-17-
Soundness and Completeness of the Algorithms
Theorem.Ifthelexicalsystem(L,R)isstrictandweaklynon-overlappingsisan(L,R)-sentenceiffthealgorithm(segmentalls)returnsasolution;conversely,the(finite)setofallsuchsolutionsexhibitsalltheproofsforstobean(L,R)-sentence.
Fact.InclassicalSanskrit,externalsandhiisstronglynon-overlapping.
Cf.http://pauillac.inria.fr/~huet/FREE/tagger.ps
-18-
A note on termination
Terminationisprovedbymultisetorderingonresumptions.
Thisallowstostatethealgorithmasanon-deterministicalgorithm,allowinganystrategyforpriorityoflexiconsearchversuseuphonyprediction,aswellasarbitraryselectionofresumptionswhenbacktracking.
Thisisimportant,sinceitleavesallfreedomforimplementingarbitraryprioritypolicieslearnedbycorpustraining.
-19-
Non-deterministic programming
Non-deterministicprogrammingisnobigdeal.WhyshouldyousurrendercontroltoaPROLOGblackbox?
Thethreegoldenrulesofnon-deterministicprogramming:
•Identifywellyoursearchstatespace
•Representstatesasnon-mutabledata
•Provetermination
Thelastpointisessentialforunderstandingthegranularityandenforcingcompleteness.
Remark.Multisetorderingisanelegantmethodforprovingterminationofnon-deterministicprograms,independentlyofthesequentialstrategyofthegenerationofthesolutions.
-20-
Enjo y!
•Sanskritsite:http://pauillac.inria.fr/~huet/SKT/
•SandhiAnalysispaper:
http://pauillac.inria.fr/~huet/FREE/tagger.ps
•Coursenotes:
http://pauillac.inria.fr/~huet/ZEN/esslli.ps
•Courseslides:
http://pauillac.inria.fr/~huet/ZEN/Trento.ps
•Tutorialslides:
http://pauillac.inria.fr/~huet/ZEN/Hyderabad.ps
•ZENlibrary:http://pauillac.inria.fr/~huet/ZEN/zen.tar
•ObjectiveCaml:http://caml.inria.fr/ocaml/
-21-
Automata mista
-22-
Differen tial w ords
typedelta=(int*word);
Adifferentialwordisanotationpermittingtoretrieveawordwfromanotherwordw 0sharingacommonprefix.Itdenotestheminimalpathconnectingthewordsinatree,asasequenceofupsanddowns:ifd=(n,u)wegoupntimesandthendownalongwordu.
Wecomputethedifferencebetweenwandw 0asadifferentialword diffww 0=(|w1|,w2)wherew=p.w1andw 0=p.w2,withmaximalcommonprefixp.
Theconverseofdiff:word->word->deltais
patch:delta->word->word:w 0mayberetrievedfromwand d=diffww 0asw 0=patchdw.
-23-
The automaton structure
typeinput=word;
typedelta=(int*word)andaddress=[Globalofdelta|Localofdelta];
typeauto=[Stateof(bool*deter*choices)]anddeter=list(letter*auto)andchoices=list(input*address);
typeautomaton=(arrayauto*delta);
typebacktrack=(input*delta*choices)andresumption=listbacktrack;(*coroutineresumptions*)
-24-
The transducer structure
typeinput=wordandoutput=word;
typedelta=(int*word)andaddress=[Globalofdelta|Localofdelta];
typetrans=[Stateof(bool*deter*choices)]anddeter=list(letter*trans)andchoices=list(input*output*address);
typetransducer=(arraytrans*delta);
typebacktrack=(input*output*delta*choices)andresumption=listbacktrack;(*coroutineresumptions*)
-25-
Next-hierarchical/modularautomata-seeRaajiv’stalk?