Computational Linguistics
F rom Zen to Aum
G ´erard Huet
Chalmers Univ ersit y , Ma y 21st, 2004
-1-
The Zen to olkit - Generic tec hnology
Afewspecificapplicativetechniques:
•Localprocessingoffocuseddata
•Sharing
•Lexicaltrees
•Differentialwords
•Finitetransducersaslexiconmorphisms
•Searchbyresumptioncoroutines
•Multisetorderingconvergence
-2-
Automata Mista - AuM
Werepresentfinite-stateautomatabyamixedstructure-adeterministicskeletondecoratedbynon-deterministictransitions.
Thefirstcomponentisaforestoflexicaltrees,usedascoveringtreesofthestatetransitionsgraph.Therestofthetransitionsisrepresentedasannotationsstatingthatonacertaininput(awordpossiblyempty,allowing-transitions),theautomatongoestoastatedesignatedbyavirtualaddress.Therearetwokindsofaddresses,localandglobal.Aglobaladdressisgivenbyaninteger(indexingintotheforestarray)andaword.Alocaladdresshasthesamestructure,butnowactsasadifferentialword.Itsfirstcomponentindexesintoanarrayrepresentingtheaccesspathinthecurrenttree(necessarybecauseofsharing).
-3-
Differen tial w ords
typedelta=(int*word);
Adifferentialwordisanotationpermittingtoretrieveawordwfromanotherwordw 0sharingacommonprefix.Itdenotestheminimalpathconnectingthewordsinatree,asasequenceofupsanddowns:ifd=(n,u)wegoupntimesandthendownalongwordu.
Wecomputethedifferencebetweenwandw 0asadifferentialword diffww 0=(|w1|,w2)wherew=p.w1andw 0=p.w2,withmaximalcommonprefixp.
Theconverseofdiff:word->word->deltais
patch:delta->word->word:w 0mayberetrievedfromwand d=diffww 0asw 0=patchdw.
-4-
The automaton structure
typeinput=word;
typedelta=(int*word)andaddress=[Globalofdelta|Localofdelta];
typeauto=[Stateof(bool*deter*choices)]anddeter=list(letter*auto)andchoices=list(input*address);
typeautomaton=(arrayauto*delta);
typebacktrack=(input*delta*choices)andresumption=listbacktrack;(*coroutineresumptions*)
-5-
Completeness
Everynon-deterministicautomaton(possiblywithtrasitions)mayberepresentedasaflataum(withemptydeterministicstructure).
EverydeterministicautomatonmayberepresentedasanaumwhosechoiceannotationsState(b,[],[([],address)])donotgiverisetobacktrack.
Everyaumhasaminimalrepresentation,obtainedbymaximalsharing.N.B.Sharingthelocalvirtualadressesdoesnotnecessarilycorrespondbyequivalencebybisimulation.
-6-
The transducer structure
typeinput=wordandoutput=word;
typedelta=(int*word)andaddress=[Globalofdelta|Localofdelta];
typetrans=[Stateof(bool*deter*choices)]anddeter=list(letter*trans)andchoices=list(input*output*address);
typetransducer=(arraytrans*delta);
typebacktrack=(input*output*delta*choices)andresumption=listbacktrack;(*coroutineresumptions*)
-7-
-8-
foret
pile a1
an a1
an k 1k AuM
dag courant mot dag
-9-
Memorisation of the curren t access
Theaccessstack[sn;sn−1;...s0]isnecessary,tointerpretlocalvirtualaddresses.Itmaybeconvenienttostoreaswellthecurrentaccess wordword=[an;...a1],stackedandunstackedalongthelocalaccesses.Wemaythusdistinguishtwooutputconstructors:
AbsoluteofwordetRelativeofdelta.Inthelastcase,outputiscomputedbypatchappliedtoword.
Applications:
•Inflectedformsdictionaryusedaslemmatizer(regularplural:(δ=(1,[ 0s 0]))
•Unglue(δ=(0,[]))
•Segment(δ=(0,u))
-10-
Mo dular aums
Anaumisgivenbyapairin(arrayauto*delta).
Wemakethemmodularbymakingtheglobaladdressesrelocatable,andpossiblyinterpretingsuccessstatesbycontinuations.Continuationsareimplementedas-transitions,i.e.extrachoices,withemptyinput.
Nowitiseasytocompileregularexpressionsintoaums,asfollows:
•Thebasecaseisanyaum,itssizethesizeofitsarray
•ifA=(arrayA,deltaA)isofsizeaandB=(arrayB,deltaB)isofsizeb,A·BisobtainedbyrelocatingBbya,continuingAbya+deltaB,startingatdeltaA,ofsizea+b.
•ifA=(arrayA,deltaA)isofsizeaandB=(arrayB,deltaB)isofsizeb,A+BisobtainedbyrelocatingBbya,startingata+b+1,whereweput
-11-
State(False,[],[([],deltaA);([],a+deltaB)]),ofsizea+b+1.
•ifA=(arrayA,deltaA)isofsizea,thenA∗isobtainedbycontinuingAbydeltaA,makingitsstartingnodeaccepting,ofsizea.
Thesetransformationsoughttobeeffectedbeforesharing.
-12-
Conclusion
Automatamistaofferanelegantapplicativesolutiontomanyfinite-stateprocessingproblems,typicallythetreatmentoflexiconrepresentation,phonology,morphologyandsegmentationincomputationallinguistics.Thedeterministicspanningtreeoftheirstatespaceisthennaturallythedictionaryofinflectedformsofwords,whichisthusplacedatthecenterofthecomputertreatmentoflanguage.
-13-