database: An experiment in data reverse engineering
Gerard Huet
INRIA-Rocquencourt
78150 LeChesnay France
Abstract
We present an experiment in data reverse engi-
neering in the eld of computational linguistics. We
explain a methodology which preserves to a great ex-
tent theoriginalinputformat,allowingparallelacqui-
sition/update of the data with processing at a more
structuredrepresentation level. We motivate the use
for such applications of Objective Caml, a functional
programminglanguagewithstrongstatictyping, para-
metric modules andmeta-linguistic technology.
1 Introduction
This paperdescribesan experiment conducted by
the author to turn aloosely structuredtextual San-
skrit to French lexicon into a comprehensive lexical
database,usedasuniquesourceforavarietyofuses:
a high-quality printable 320 pages Sanskrit-to-
Frenchdictionary, withentrieslistedbothin de-
vanagar scriptand in roman transcription with
diacritics
a Web site givinga fully tagged hypertextview
ofthesamedocument
various search toolsof this web site, notablyan
indexenginereturningallentries(andthedeclen-
sions thereof) matching agivenprex, a simpli-
ed index where diacritics maybeomittedfrom
the query, and agrammaticalengine computing
declensionsofagivenstem.
This database,freelyavailable onInternet[6],is now
used asatestbench forcomputationallinguistics[8].
Inarstpartofthispaperwedescribethemethod-
ology used. In asecond part, we discuss the appro-
priatenessoffunctionalprogrammingandoftheW3C
and otherInternetstandardsforthistypeofapplica-
2 From loose text processing to
context-free syntax analysis
The initial form of the lexicon was a source T
E X
documenteditedasatextleundertheEmacseditor.
Itwasthus asemi-structureddocument, inthe sense
thatlexicalentriesweredescribedusingacombination
ofunstructuredtextwithT
E
Xmacros. Thiseortwas
startedfromscratchin1993,andhadgrownbymiddle
1996toatextleof1.2Mb,listing8000entries.Here
isasampleentryofthedocumentatthattime:
\word{$kumAra$}{kum\=ara} m.
gar{\c c}on, jeune homme
\or prince; page; cavalier
\or myth. np. de Kum\=ara ``Prince'',
\'epith. de Skanda
-- n. or pur
-- \fem{kum\=ar{\=\i}\/} adolescente, jeune
fille, vierge.
Let us give a few comments on the above entry.
Thespecialstring$kumAra$isactuallyanotationfor
the devanagar representation. It is interpreted by
a specic processor devnag, originally due to Franz
Velthuis[12], which in a rst pass on the document
analyses the Sanskrit words in syllables, combines
theirconsonantsintermsofdevanagarligatures,and
compiles the corresponding sequences of ligatures in
termsof sequences of glyphs in alow-levelMetafont
glyphalphabet. ThecorrespondingCprocessortrans-
forms thus $kumAra$ into acryptic {\dn \7{k}mAr}
which can then beinterpreted byT
E
X enriched with
thednfont.
Despite theuse ofafew macroslikeor, wordand
fem,thesourcewasextremelylowlevel,therewasno
clearnotionofscopeofthevariousconstituents,there
were two distinct notations for diacritics, one using
thedevnagconventionsuch askumAra,theotherone
usingT
E
Xinputconventionsforromanalphabetwith
check of consistency. Sanskrit words freelyappeared
in theFrenchtext,whereaccentswereencodedinthe
T
E
X way. Structural markers such as the separator
of grammaticalroles used directly thetypographical
encoding -- instead of an explicit macro, and low-
level typographical details such as the italic correc-
tion\/appearedexplicitly. It washardlypossibleto
debug typingmistakes,which couldresultinT
E Xer-
rors showingupmanypagesafterthetypo. Needless
to say, thedocument had become extremely hard to
maintain. A moredisciplined wayof proceeding was
clearlyin order.
Therstsignicantimprovementwastoisolateall
the text proper from the T
E
X meta-notation, to use
systematicallymacrosforthestructuralunits,and to
terminateeveryentryandsub-entrywithexplicitclos-
ingmarkers. Thisrstpasswaseectedmechanically
byEmacsmacros. Finally,Frenchaccentswerecoded
in the ISO-LATIN1 standard8-bitencoding. Here is
theresultonoursampleentry:
\word{$kumAra$}{kum\=ara}
\sm \sem{garcon, jeune homme}
\or \sem{prince; page; cavalier}
\or \myth{np. de Kum\=ara ``Prince'',
epith. de Skanda}
\role \sn \sem{or pur}
\role \fem{kum\=ar{\=\i}\/} \sem{adolescente,
jeune fille, vierge}
\fin
Weremarkthatwestillhaveredundancyofdiacrit-
ics encodings, and undisciplined intermixing of San-
skrit namessuch asKum\=arawithFrench text. But
at leastthe structure of the dictionaryentries is im-
plicit intheT
E
X macros.
The next step was to make explicit the lexicon
structure by parsing the source as belonging to a
context-free language identifying structural regulari-
ties. Actually 3 context-free grammars were identi-
ed, correspondingrespectivelytothestructuresofa
genericlexicon,ofSanskritquotations,andofFrench
sentences. These 3 grammars are each dened with
their own lexical analyser, so that diacritic encod-
ingsmayappearonlyin theSanskritquotations,and
accented letters may appear only in the French sen-
tences. Beforegivingdetails onthisgrammaticalap-
paratus, let us give the nal source of our example
entry:
\word{kumaara}
\or \sem{prince; page; cavalier}
\or \myth{np. de \npd{Kumaara} "Prince",
epith. de \np{Skanda}}
\role \sn \sem{or pur}
\role \fem{kumaarii} \sem{adolescente,
jeune fille, vierge}
\fin
The transformation between the previous rep-
resentation and the nal one was eected semi-
mechanically, byamixture ofmutualadjustmentbe-
tween the lexicon source and the grammar descrip-
tion. Some of these adjustments were aided by me-
chanical tools such asEmacs macros or other trans-
ducers. Others werehard totrack. Forinstance, the
explicittaggingofpropernamesinSanskrit(withthe
labelnpforused occurrencesandnpdfordened oc-
currences)waseasyfornameswherediacriticsoccur,
suchasKum\=araabove,buthardfortheotherssuch
as Skanda, for which only human intervention may
decide whether the name belongs to Sanskrit or to
French.
Thetechnologyusedforthedescriptionofthelex-
icalanalysersbeingnaturallytto describingregular
transducers, itwasthen easyto adapt it to the gen-
eration of the redundant diacritics encodings from a
uniquecanonicaldescription,whichsubsumesthetwo
previous representations kumAra and kum\=ara into
theunique kumaara, achievingone of the stated ob-
jectives.
WechoseObjectiveCaml[14,19,1]asourprogram-
minglanguageforthistransductionactivity. Weshall
justifythis technologicalchoiceat theend ofthis ar-
ticle,after describingtherestof ourprocessingtools.
For themoment, we remark that the Ocaml prepro-
cessorCamlp4[16] is ideallysuited to thedescription
ofparsersinahighlevelnotationsmoothlyintegrated
inthe syntaxof thisvarietyof theML programming
language.
Forinstance,hereisaportionofthesourcecodeof
thetransducerfromdiacritics encodingsalaVelthuis
intoT
E
X notation:
value tex = Gram.Entry.create "skt to tex";
GEXTEND Gram
tex:
[ [ LETTER "a"; LETTER "a" -> "\\=a"
| LETTER "a" -> "a"
| LETTER "i"; LETTER "i" -> "{\\=\\i}"
| LETTER "i" -> "i"
| LETTER "u"; LETTER "u" -> "\\=u"
| "."; LETTER "t" -> "{\\d t}"
| "."; LETTER "d" -> "{\\d d}"
| "."; LETTER "s" -> "{\\d s}"
...
| i = LETTER -> i
| i = INT -> i
| EOI -> raise Exit ] ];
END;
whereLETTERandINTaredenedaslexicaltokensin
ourcustomlexicalanalyserTrans.lexer,andGramis
agrammarmodule,initialisedasfollows:
module Gram = Grammar.Make
(struct value lexer = Trans.lexer (); end);
That is, the Camlp4 standard library Grammar pro-
vides a generic functor (parametric module) Make
which may be instanciated by a specic lexer struc-
tureinordertogenerateamoduleGramwhichcontains
ascomponentsallthenecessaryparsingtoolssuchas
entry creation (Gram.Entry.create). One of these
componentsisaparserGram.Entry.parsewhichisa
function from entries (such as tex dened above) to
stream analysers. Inourcase, astream analyserisa
function from streams to strings. It is now easy to
program agenericfunction transducer,which given
anentryreturnsafunction fromcharacterstreamsto
strings,andthespecictransducerfromdiacriticsen-
codingstoT
E
Xencodingsissimplyobtainedbypartial
application:
value skt_to_tex = transducer tex;
This veryelegantimplementation of the ideasex-
posed in thearticle by de Rauglaudre and Mauny[2]
is furthermore quite eÆcient. Once the concepts are
mastered, weare ledto verysmoothprogramswhich
areselfdocumentedandeasytomaintain. Ourreverse
engineering exercise useddozens of such transducing
engines, seamlessly embedded in the rest of our ML
modules, and whose strong static typing guarantees
correctexecution.
3 Fromcontext-freeconcretesyntaxto
strongly typed abstract syntax
Let usstatetherequirementsofourparser. First,
itshouldpreservealltheinformationcontainedinthe
database,includingtypographicalhintssuchasitalic
correction and hyphenation pragmas; it should also
preserve comments, which containvaluable informa-
rmationorabsenceinclassicaldictionaries,etc.
Another requirementis that it should as much as
possiblepreservetheexibilityoftheoriginalformat;
theoppositerequirement,thatitshouldnotbeoverly
complex,dened atrade-obetweenpreservingvalu-
able notational exibility and avoiding baroque and
ad-hocdetails.
Thistrade-owasmadeexplicitbytherequirement
ofdesigninganabstractsyntax, deninganalgebraic
representationofthelexicon;thisabstractsyntaxwas
made concrete as a set of recursive ML type deni-
tions;to everynon-terminalofthegrammaroughtto
correspondauniquetype,the variouschoices forthe
corresponding grammar entry corresponding to the
various constructors of the relevant type; listscorre-
spondedtotheKleenestaroperation,whereasoption
types corresponded to epsilon unions; modulo these
regularoperations,weaimedatisomorphismbetween
abstractandconcretesyntax;eachgrammarrulewas
associatedwithasemanticaction,constructingavalue
of the appropriate type, usually the corresponding
constructor, applied recursively to the values associ-
atedtothesemanticactionsofthenon-terminalsap-
pearing inthe production;the abstractsyntaxcorre-
spondingtoagiveninputstringwasthusconstructed
asa well-typed ML value synthesized by bottom-up
parsing.
The design of this abstract syntax was really the
creative activity underlying this reverse engineering
process, similar to the denition of the DTD[24] of
ourdictionarystructure. Itshouldbestressedatthis
pointthatanaprioridesignofthisDTDfromscratch
would havebeenhopeless, in the sense that itwould
havehadtoberevisedincessantlyalongwiththedic-
tionarygrowth,whereasstartingfroman8000entries
existing dictionary gave us a much better insurance
that most peculiarities and exceptions were already
present-weindeedbelievethatthestructurewecame
upwithwillscaleupwithminoradjustmentstofuture
growthofthelexicaldatabase.
Let us give a concrete example of this abstract
syntax, together with a typical set of grammar pro-
ductions. There are two kindsof entries in our dic-
tionary,genuineentriescontaininglemmaticinforma-
tionaboutalexical item, and cross-referenceswhich
pointtypicallytoanorthographicvarianttoitsproper
lemma. This dictinction correspondsto two produc-
tionsinthedictionarygrammarpertainingtothenon-
terminalentry:
entry:
| s = syntax; c = crossref
-> Crossref(s,c)
] ] ;
these twoproductionsproducebothvaluesof anML
type entry, constructed respectively with data con-
structors Entry and Crossref. Thus wend in the
interfacemodule Dictionaryadatatypedeclaration:
type entry =
[ Entry of (syntax * usage * option cogs)
| Crossref of (syntax * crossref)
];
We can thus witness the expected isomorphism be-
tween the parse trees of the concrete syntax non-
terminal entry and the values of the abstract syn-
tax typeentry. Thisparallelcorrespondancemaybe
understood top-down acrossall the structure. Thus
an Entry value has 3 sub-components, two manda-
tory ones,correspondingroughlyto thesyntacticin-
formationconcerningtheentry(orthographicstemfor
eachgrammaticalr^ole,etymologicalinformation,etc.)
and toitsusageinformation (meanings,expectedva-
lency of dependent items, etc.), the third one being
optional(cognateetymologiesinotherIndo-European
languages such as Greek, Latin, Germanic, English,
French, Slavic, Hindi, etc.) Weremark that the op-
tional concretesyntaxcomponentOPT cogshastype
option cogs, where the polymorphictype construc-
toroptionistheOCamllibrarysumconstructor,de-
nedas:
type option 'a = [ None | Some of 'a ];
Similarly,Ocaml grammarsallowtheKleenestarop-
erator LIST x to stand for values of type list t,
where the non-terminal x constructs values of type
list t.
Weshall not detailfurther the abstract structure
ofourdictionary,whichisfullyexplainedinatechni-
cal report[7]; letus just remark that it is highly non
trivial: thethreemutually recursivegrammarsden-
ing itssyntaxcompriseatotalof 321productions, of
which 14 pertain to Sanskrit quotations, 54 pertain
to French sentences, and 253 pertain to the generic
dictionary structure. It would havebeenhopeless to
guessbeforehandthisstructure,whereasitturnedout
to berelatively straightforwardto induce this struc-
turefromthedictionaryrawdataitself,afterafewit-
erationsforsmoothingoutirregularitiesandmistakes.
Wealso remarkthat ourlexicalanalysersarecon-
sistent with T
E
X lexical conventions, and that for
nalsourcesusedin compilingthedictionary. Butour
parserenforcesmanymorestructuralinvariantsthan
anarbitraryT
E
Xdocument. Thustyposandotherin-
putmistakesarespotted byourparserexactlywhere
these invariants are violated, with an explicit error
messagegivingthepreciselocationofthemistake,al-
lowingcontinuous growth ofthe dictionarywith rea-
sonably good debugging facilities (mistakes indicate
thepreciselineand characternumberwherethesyn-
taxisviolated).
Wenishthissectionbypretty-printingourexam-
ple entry above, as the ML representation of its ab-
stractsyntax:
(Entry
(((Word "kumaara"), [], None),
(Subst
[(Meaning
(Unpos (Genderorth (([Mas], None))),
[([], (Sem
[(Affirm [[(Unit "garcon")];
[(Unit "jeune"); (Unit "homme")]])]));
([], (Sem
[(Affirm [[(Unit "prince")]]);
(Affirm [[(Unit "page")]]);
(Affirm [[(Unit "cavalier")]])]));
([], (Myth
[(Affirm [[(Unit "np."); (Unit "de");
(Npd "Kumaara");
(Quotation (Affirm [[(Unit "Prince")]]))];
[(Unit "epith."); (Unit "de");
(Np "Skanda")]])]))]);
(Meaning
(Unpos (Genderorth (([Neu], None))),
[([], (Sem [(Affirm [[(Unit "or");
(Unit "pur")]])]))]);
(Meaning
(Unpos (Genderorth (([Fem],
Some "kumaarii"))),
[([], (Sem [(Affirm
[[(Unit "adolescente")]; [(Unit "jeune");
(Unit "fille")];
[(Unit "vierge")]])]))])]),
None, true))
Alternatively,wecouldgiveanXMLrepresentation
ofthesame:
<Entry>
<Triple>
<Word> "kumaara" </Word>
...
</Triple></Entry>
formating instructions
The main procedureof ourdictionary processoris
a read-eval-loop, which parses successive items, con-
structs their abstract syntax, and then proceeds in
pretty-printing recursively this structure in terms of
low-leveltypographicalinstructions.
Thus, in the caseof ourexampleentryabove,the
generated T
E
X le contains the following low-level
code:
{\dn \7{k}mAr}~{\sl kum\=ara} m.
garcon, jeune homme
\(\mid\:\)prince; page; cavalier
\(\mid\:\)myth. np. de Kum\=ara ``Prince'',
epith. de Skanda
--- n. or pur
--- f. {\sl kum\=ar{\=\i}} adolescente,
jeune fille, vierge.
Onemaywonderat thispointwhywetookallthe
troubletocompilehigh-levelT
E
Xtextintoalow-level
typographicallyequivalentone! Butnowweareready
toreaptherewardofourabstraction,byreplacingeas-
ilythebackend(T
E
Xgenerator)byanotherformating
generator,orbyanarbitrarycomputationonthedic-
tionary structure, for that matter. Furthermore, by
properuseofthemodularityconstructsofOCaml,we
mayfactorouttherecursivetraversalofthestructure
from the backend generator,reduced tolow-levelop-
erationsontheterminalstrings. Letusexplainalittle
this modularstructure.
The main componentgrind isa parametricmod-
ule,orfunctorinMLterminology. Hereisexactlythe
moduletypegrind.mliin thesyntaxofOcaml:
module Grind : functor
(Process:Proc.Process_signature) -> sig end;
with interface module Proc specifying the expected
signatureofaProcess:
module type Process_signature = sig
value process_header :
(Sanskrit.skt * Sanskrit.skt) -> unit;
value process_entry :
Dictionary.entry -> unit;
value prelude : unit -> unit;
value postlude : unit -> unit;
end;
That is,therearetwosortsofitemsin thedatabase,
namely headers and entries. The grinder will start
trywithroutineprocess_entry,andwillconcludeby
calling the process postlude. The module interface
Dictionarydescribes thedictionary abstract syntax
asaset ofmutuallyrecursiveMLdatatypes,whereas
module Sanskrit holds the private representation
structures of Sanskrit nouns (seen from Dictionary
asan abstract type, insuring that only the Sanskrit
lexicalanalysermayconstructvaluesoftypeskt).
A typicalprocessistheprintingprocess
Print_dict,itselfafunctor. Hereisitsinterface:
module Print_dict : functor
(Printer:Print.Printer_signature)
-> Proc.Process_signature;
Ittakesas argumentaPrintermodule, which spec-
ieslow-levelprintingprimitivesforagivenmedium,
anddenes theprintingofentries asageneric recur-
sion over its abstract syntax. Thus we may dene
thetypesettingprimitivestogenerateaT
E
Xsourcein
module Print_tex,andobtainaT
E
X processorbya
simpleinstanciation:
module Process_tex = Print_dict(Print_tex);
Butsimilarly,wemaynowdenetheprimitivesto
generatethe HTML source ofa Web document, and
obtainanHTMLprocessorProcess_htmlas:
module Process_html = Print_dict(Print_html);
It isverysatisfyingindeed tohavesuch sharingin
thetoolsthat build twoultimatelyverydierentob-
jects, abook with professionaltypographical quality
ononehand,andaWebsite tforhypertextnaviga-
tionandsearch ontheotherhand. Butmaybeafew
wordsareinordertoexplainhowindeedthehypertext
structure is implicitin the abstract syntax represen-
tation.
5 Scoping and name management
There are several notionsof scoping in the lexical
database. Theyemergedprogressively, rstfrom the
needtoprovidecorrecthypertextreferencetoSanskrit
quotations, and thus complete decoration by deni-
tionanchors. Sincewewantdirectaccessthroughthe
whole document, this givesa notion of global scope
ofdictionaryentries andpropernamedenitions and
otherexplicitlybindingnames. Theimportantmeta-
notion which emerged was that certain slots of the
abstractstructurealgebraconstructorswherebinding
thesortofSanskritreferences.
Thusanumberofsuch\bindingconstructors"were
identied. Conversely,two unaryreference construc-
tors wereidentied, correspondingin theFrench text
to the quotations to Sanskrit common and proper
namesrespectively.
Theprocessingofsuchbindingslotsstaystranspar-
entintheT
E
Xprinting;intheHTMLversion,itgen-
eratesappropriateURLanchorsandreferencesrespec-
tively. Thegeneratingphaseconstructsatrieindexof
the binding references, warning for duplicated bind-
ings. Asecondcheckingphaseaccessesthispersistent
structureateveryreferenceforwarningofmissingan-
chors. Wethusenforceinthissecond steptheoverall
constructionofaconsistentWebsite.
At this point we reach the concept of a module
type signature, corresponding roughly to a DTD in
XML parlance, with constructors binding in certain
arguments. An abstract syntax value,with namesin
binding and reference slots, may be considered as a
kindofgraphrepresentation.
Thenextnotionwasinducedfromafurther exten-
sion of our computationallinguistics tools, namelya
declension grammaticalengine. Theproblem was to
generate all exed forms corresponding to the basic
stems in thelexicon. InSanskrit, thedeclensionofa
substantive(or adjective, pronoun, and number) de-
pendsontheterminalendingofitsstemstringandon
its gender;in order toautomatethis computationby
a simple traversal of the dictionary structure, a new
scopeconstraintarose: eachstempossibly associated
with an entry had to be associated with its possible
genders. Here thescopingconstraintismorecomplex
than statically determining global binding slots, and
itcorrespondstoadynamictop-downtraversalofthe
abstractsyntaxtree. Duringthistraversalthecurrent
stem is remembered,sothat each genderdeclaration
may registerthe properinformation in atrie associ-
ating witheachentryallsuch pairs(stem,gender). A
nalpassonthetriecallsadeclensionengine,inorder
to build a big trie of all exed words. Furthermore,
thegenderdeclarationsgeneratereferencestothede-
clensionengineseenasaCGIbinary,associatedtothe
appropriatearguments. Thiswaythedeclensionsofa
givenentryare dynamically computablefrom within
thedictionaryhyper-structure.
Weshall notgivefull details onsuch tools, which
aremorefullydocumentedinreference[8],buttheba-
sic idea is to use the abstract syntax of the lexicon
entries to make all kinds of computations, such as
dering of the entries, a not entirely trivial check in
Sanskrit,whereforinstanceagenericnasalisationno-
tation (the so-called anunasika) stands for the nasal
consonant which ishomophonic (savarn
.
a) to the fol-
lowingconsonant.
WeendthissectionbygivingtheHTMLrepresen-
tationobtainedforourexampleentry:
<P CLASS="dico">
<A CLASS="defn"; NAME="kum=ara">
<I>kumāra</I></A>
<A CLASS="defr";
HREF="http:XX/dicdecl?query=kumaara &
gender=Mas">m.</A>
garçon, jeune homme; fils
| prince; page; cavalier
| myth. np. de <A CLASS="defn";
NAME="*kum=ara">Kumāra</A>
«Prince», épith. de
<A HREF="s.html#*skanda">Skanda</A> —
<A CLASS="defr";
HREF="http:XX/dicdecl?query=kumaara &
gender=Neu">n.</A>
or pur —
<A CLASS="defr";
HREF="http:XX/dicdecl?query=kumaarii &
gender=Fem">f.</A>
<A CLASS="defn"; NAME="kum=ar=i">
<I>kumārī</I></A>
adolescente, jeune fille, vierge.
The string XX above stands for the URL of the
CGI binary directory of the intended HTTP server,
and is installation dependent. Note how invocations
ofthe grammaticalenginedicdecl areprecomputed
asHTMLreferencestriggeredbyclickingonthegen-
derspecications. Thusallpossibledeclensionsofthe
stemsassociatedwithanentryarejustoneclickaway
fromthedictionarynavigator.
6 The compiling chains
Theoverallarchitectureofoursoftwareplatformis
acollectionof processes, called as parametersto the
genericGrindfunctor,inordertomakesomecompu-
tationduringtherecursivetraversalofthevariousen-
tries,computedontheywhileparsingthedatabase;
global information is kept in tries structure, which
maintainamaximally sharedrepresentation. It is to
beremarkedthattheprogressiveconstructionofthese
tries iscompletely applicative, beingimplemented as
E
sourceofthepaperdictionary,availableinPSorPDF
format, and a consistentWeb site which implements
a sort of hypertext encyclopedia, with various exe-
cutable families of links, permitting easy navigation
to quotations, to etymologicalparents, to synonyms
andantonyms,etc,andalsospecicgrammaticaltab-
ulations such as declensiontables. We nally stress
thatthesetwo`prettyprinters',whichdeliversuchdif-
ferentnalproducts,areactuallytwoinstanciationsof
a generic pretty-printing functor with backend mod-
ulesgivingtherepresentationsinrespectivelyT
E Xand
HTML of lowleveltypographicalprimitives(such as
blank, line, printident, etc). We thus fully use the
moduleconstructionsofObjectiveCaml.
Itistobenotedthatatsomepointtheencodingof
Frenchsentences,whichoriginallywasjust anencap-
sulationofISO-LATIN1strings,hadtobedroppedin
favorofUnicodeUTF-8encodings,in orderto repre-
sent diacritics, for which weused a fontdesigned by
Chris Fynn[4]. But the corresponding modication
consistedinjustpipingthe8-bitsstringsintoanISO-
LATIN1toUTF-8transducer,amatterof15minutes
programmingwithourgenerictransductionfunction.
Thus we had the best of both worlds - Unicode en-
coding of the Web pages, allowing arbitrary fontsin
thedictionaryWebsite,whilepreservingISO-LATIN
acquisition with AZERTY (french) keyboards in the
databasesource.
7 Implementation notes
7.1 ML as a software engineering
workhorse
We believe that Objective Caml, together with
thesyntaxfacilitiesofitscompanionmacro-processor
Camlp4, is ideally suited to such processing. Its
stream library is extremely eÆcient, and the whole
1.7Mbofsourcedatabaseiscompletelyprocessedinto
saythewebsiteinlessthan1mnona500MHzPC.We
recall that ObjectiveCaml is available bothin byte-
codeversion,foreasydebugging,andinoptimisedna-
tivecodecompetitivewith Cprogramming. Theker-
nelofObjectiveCamlbeingitselfwritten inANSIC,
itis availableonallcurrentplatforms.
OneofthemainstrengthsofOCamlistheversatil-
ityof itsprogrammingparadigms: one mayprogram
in a purely applicative functional style, or one may
mixapplicativeandimperativefeatures. Forinstance,
wemadeanextensiveuseofmutabledata structures
tivewayasfunctionalzippers[5].
OCaml may indeed be considered as the modern
successor of both LISP-Scheme and Pascal-Modula.
Itbenetsfrom thesafetyofstrongstatictyping,but
polymorphism together with the synthesis of princi-
pal types by the compiler relieves the programmer
fromcumbersomeexplicittypedeclarations. Itsmod-
ule system is aspecially powerful abstraction mech-
anism for the design of complex systems. It has an
elegantobjectmodel,wellintegratedwiththerest of
the language, although we did not use objectorien-
tation in our application. It ought to be remarked
thatthemechanismsavailableforobjectorientedpro-
grammingdonotincuranypenaltyifonedoesnotuse
them.
Thetotalcodewrittensofarforourplatformcom-
prises45modules, totalingabout10000linesofcode.
Thiswholesystemwasdevelopedsolelybytheauthor
in about 6months asapart-time activity. Thepro-
grams were developed directly with the native code
compiler,withonlyoccasionaluseofbytecodeexecu-
tionand virtuallyno debugging: thewhole develop-
mentcycle consistediniteratingsourceedition,com-
pilation until syntax errors and type inconsistencies
were xed, then native code execution. The depen-
dencyanalysistoolof OCaml,coupledwiththeUnix
makecommand,provedtobeinvaluable.
The onlyother software packagesused forthe de-
velopment of our platform are the Emacs text ed-
itor (with its OCaml mode Tuareg) and the T
E X-
Metafont-L A
T
E X2
"
text formatter (with the devnag
preprocessor[12] fordisplayingdevanagartextunder
theBabelpackage).Thisdemandsanappropriatefont
serverfortheTrueTypefontformat,adiÆcultyforold
versionsofXwindowsonUNIX/Linux platforms.
7.2 Web and other standards
For our Web tools we used the Web standard
HTML 4.0[13] with Cascading Style Sheets in order
toseparatetheformatinginstructionsfromthestruc-
turing constructs. We used UTF-8 encodings corre-
spondingtotheUnicode[23]standard2.1,fordisplay-
ingdiacriticsintheromanindic font[4].
A general remark concerning the W3C standards
isthattheyareevolvingtoofast,andthat interoper-
abilityacrossplatformsisfarfromsatisfactory. Since
alarge percentageof the Web users use browsers in
theoriginalversionwhichwasinstalledatthetimeof
purchase of their platform, they will not be able to
correctlyinterpretmorerecentversionsofthevarious
features(theso-calleddeprecatedoptionsinW3Cjar-
gon)whicheverybrowserwillaccept,andnewerfunc-
tionalitieswhichmaybeunknowntoolderplatforms.
EvenmatureHTMLfeaturesmayhavequitedierent
look-and-feelundersayNetscapeNavigatorandInter-
netExplorerbrowsersintheirvariousversions.Inthis
compromise we refrained from using ad-hoc browser
scripting languages such as Javascript, mobile code
programminglanguagessuchasJava,andad-hocfea-
turessuchascookies,andwelimitedourselvesinthe
dynamic capabilities of our Web interface to remote
procedurecallsgeneratingHTMLpages(theso-called
CGI-bins).
A related remark concerns XML[24], which is
rapidly imposing itself as the standard data inter-
changeformat,especiallyintheareaoflinguisticand
philology resources such as digital dictionaries and
corpuses[22]. Thismeans that XML versions of digi-
tal lexicons such as our sanskrit dictionay will have
to be provided in the future in order to be usable
in moregenericlinguistic platforms. Suchdocuments
will becompliantwithpreciseDTDs explainingtheir
logical structure, which will be themselves instances
of generic standard formats. However, we are not
convinced at all that the production of these docu-
ments, as well as computations performed on them,
will need to use allthe layersof resourcedescription
and other frameworks which are weekly churned out
by theW3 Consortiumandthe various vendors. We
rather believe that general programming languages
such as OCamlare a moresolid, moreuniform, and
moreeÆcientalternativetosuchspecicsoftwareen-
gineeringtools.
8 Conclusion
Reverseengineeringisoftenreferredtoin thecon-
text oftransferringsomelegacysystemtoanewpro-
gramminglanguage,systemorplatform,orofporting
by hook or crook some obsolete database to an up-
to-date format. Here we show an application where
thereverseengineeringoperationleadstoatranslator
(actuallyafamilyoftranslatorsandprocessingtools)
which allows the development of a naturallanguage
processing platform centered on a lexical database
whose format is only slightly modied. This means
that the development of this platform may process
asynchronously from the subsequent updates to the
database, whose development may continue, without
changing too much its input conventions. Further-
ture is progressively extracted from the data. For
instance, comments in the original source may pro-
gressively ndtheir way in the structure, ascitation
indexesforinstance. This ispossiblebecause,in our
approach, even thoughall computations are doneon
theabstractsyntaxtreerepresentation,westillretain
theoriginal concrete syntax text le asthedatabase
source.
Thereisageneralneedforsuchadaptorsintheeld
ofcomputational linguistics, where resourcessuch as
lexiconsanddigitalisedcorpusesarelong-terminvest-
mentsoftendevelopedbyindependentteams. Rather
than attempting a complete standardization of their
formats,anunendingquestanywayinaworldoffast-
changingtechnologies, our approach permits parallel
developmentswithoutchangingtheeditingtoolsand
acquisitionprocesses. Itisofupmostimportancehow-
ever that these translation processes be clean, well
documentedprogramswhich maybemaintainedand
adapted over long evolution cycles. We have shown
that a general-purpose language such as Objective
Caml, with its extensive meta-syntactic features, its
powerfultype discipline and module system, and its
eÆcientandportablebackend,isanidealtoolforsuch
engineeringtasks. Weseelittlejusticationindeedin
theuseoflessgeneraland poorlyeÆcientcomputing
paradigms,suchastheso-called\scriptinglanguages".
References
[1] Guy Cousineau and Michel Mauny. \The Func-
tional Approach to Programming". Cambridge
UniversityPress,1998.
[2] Daniel de Rauglaudre and Michel Mauny.
\Parsers in ML". Proceedings of Conference on
LISP and Functional Programming (1992), pp.
76{85.
[3] Christiane Fellbaum, Ed. \Wordnet, an Elec-
tronic Lexical Database." MIT Press, 1998. See
also: http://www.cogsci.princeton.edu/~wn.
[4] Christopher J. Fynn. \How to use Diacritics
for Romanised Indic Text on the WWW". See:
http://www.user.dicon.co.uk/~cfynn/
sanskrit.htm.
[5] Gerard Huet. \The Zipper". J. Functional Pro-
Available fromtheSanskritWebSite:
http://pauillac.inria.fr/~huet/SKT/.
[7] Gerard Huet. \Structure of a Sanskrit Dictio-
nary."INRIAResearchReport toappear, 2001.
[8] Gerard Huet. \Computational Linguistics for
Sanskrit: a Software Engineering Approach".
Toappear, anniversaryvolume in honorof Rod
Burstall,2001.
[9] Donald Knuth. The T
E
Xbook. Addison-Wesley,
1984.
[10] LeslieLamport.L A
T
E
X-ADocumentPreparation
System.Addison-Wesley,1986.
[11] Leslie Lamport et al. L A
T
E X2
"
- The macro
package for T
E
X. Edition 1.6, Dec. 1994.
Available from CTAN archive sites, such as
ftp://ftp.tex.ac.uk/.
[12] AnshumanPandey. Devanagar forT
E
X, version
1.6, Aug. 1998. Available from CTAN archive
sites,suchasftp://ftp.tex.ac.uk/.
[13] Dave Raggett, ArnaudLe Horsand Ian Jacobs,
Eds.HTML 4.0 Specication. W3CRecommen-
dation,April24,1998.
http://www.w3.org/TR/REC-html40.
[14] PierreWeisand XavierLeroy.Lelangage Caml.
2emeedition,Dunod, Paris,1999.
[15] AcrobatReader,anAdobeproduct,freelydown-
loadablefrom: http://www.adobe.com/
products/acrobat/readstep.html.
[16] TheCamlp4preprocessor.Available from:
http://caml.inria.fr/camlp4/.
[17] TheCommonGatewayInterface.See:
http://www.w3.org/CGI/Overview.html.
[18] Ghostscript.See: http://www.cs.wisc.edu/
~ghost/index.html.
[19] ObjectiveCaml.Availablefrom:
http://caml.inria.fr/ocaml/index.html.
[20] PDF, the portable document format, an Adobe
product.See: http://www.adobe.com/
products/acrobat/adobepdf.html.
[21] Postscript,anAdobeproduct.See: http://
http://www.uic.edu/orgs/tei/.
[23] TheUnicodestandard2.1. See:
http://charts.unicode.org/.
[24] The XML norm, version 1.1, W3C consortium.
See: http://www.w3c.org/documents/XML/.