We explain a methodology which preserves to a great ex- tent theoriginalinputformat,allowingparallelacqui- sition/update of the data with processing at a more structuredrepresentation level

(1)

database: An experiment in data reverse engineering

Gerard Huet

INRIA-Rocquencourt

78150 LeChesnay France

Abstract

We present an experiment in data reverse engi-

neering in the eld of computational linguistics. We

explain a methodology which preserves to a great ex-

tent theoriginalinputformat,allowingparallelacqui-

sition/update of the data with processing at a more

structuredrepresentation level. We motivate the use

for such applications of Objective Caml, a functional

programminglanguagewithstrongstatictyping, para-

metric modules andmeta-linguistic technology.

1 Introduction

This paperdescribesan experiment conducted by

the author to turn aloosely structuredtextual San-

skrit to French lexicon into a comprehensive lexical

database,usedasuniquesourceforavarietyofuses:

a high-quality printable 320 pages Sanskrit-to-

Frenchdictionary, withentrieslistedbothin de-

vanagar scriptand in roman transcription with

diacritics

a Web site givinga fully tagged hypertextview

ofthesamedocument

various search toolsof this web site, notablyan

indexenginereturningallentries(andthedeclen-

sions thereof) matching agivenprex, a simpli-

ed index where diacritics maybeomittedfrom

the query, and agrammaticalengine computing

declensionsofagivenstem.

This database,freelyavailable onInternet[6],is now

used asatestbench forcomputationallinguistics[8].

Inarstpartofthispaperwedescribethemethod-

ology used. In asecond part, we discuss the appro-

priatenessoffunctionalprogrammingandoftheW3C

and otherInternetstandardsforthistypeofapplica-

2 From loose text processing to

context-free syntax analysis

The initial form of the lexicon was a source T

E X

documenteditedasatextleundertheEmacseditor.

Itwasthus asemi-structureddocument, inthe sense

thatlexicalentriesweredescribedusingacombination

ofunstructuredtextwithT

E

Xmacros. Thiseortwas

startedfromscratchin1993,andhadgrownbymiddle

1996toatextleof1.2Mb,listing8000entries.Here

isasampleentryofthedocumentatthattime:

\word{$kumAra$}{kum\=ara} m.

gar{\c c}on, jeune homme

\or prince; page; cavalier

\or myth. np. de Kum\=ara ``Prince'',

\'epith. de Skanda

-- n. or pur

-- \fem{kum\=ar{\=\i}\/} adolescente, jeune

fille, vierge.

Let us give a few comments on the above entry.

Thespecialstring$kumAra$isactuallyanotationfor

the devanagar representation. It is interpreted by

a specic processor devnag, originally due to Franz

Velthuis[12], which in a rst pass on the document

analyses the Sanskrit words in syllables, combines

theirconsonantsintermsofdevanagarligatures,and

compiles the corresponding sequences of ligatures in

termsof sequences of glyphs in alow-levelMetafont

glyphalphabet. ThecorrespondingCprocessortrans-

forms thus $kumAra$ into acryptic {\dn \7{k}mAr}

which can then beinterpreted byT

E

X enriched with

thednfont.

Despite theuse ofafew macroslikeor, wordand

fem,thesourcewasextremelylowlevel,therewasno

clearnotionofscopeofthevariousconstituents,there

were two distinct notations for diacritics, one using

thedevnagconventionsuch askumAra,theotherone

usingT

E

Xinputconventionsforromanalphabetwith

(2)

check of consistency. Sanskrit words freelyappeared

in theFrenchtext,whereaccentswereencodedinthe

T

E

X way. Structural markers such as the separator

of grammaticalroles used directly thetypographical

encoding -- instead of an explicit macro, and low-

level typographical details such as the italic correc-

tion\/appearedexplicitly. It washardlypossibleto

debug typingmistakes,which couldresultinT

E Xer-

rors showingupmanypagesafterthetypo. Needless

to say, thedocument had become extremely hard to

maintain. A moredisciplined wayof proceeding was

clearlyin order.

Therstsignicantimprovementwastoisolateall

the text proper from the T

E

X meta-notation, to use

systematicallymacrosforthestructuralunits,and to

terminateeveryentryandsub-entrywithexplicitclos-

ingmarkers. Thisrstpasswaseectedmechanically

byEmacsmacros. Finally,Frenchaccentswerecoded

in the ISO-LATIN1 standard8-bitencoding. Here is

theresultonoursampleentry:

\word{$kumAra$}{kum\=ara}

\sm \sem{garcon, jeune homme}

\or \sem{prince; page; cavalier}

\or \myth{np. de Kum\=ara ``Prince'',

epith. de Skanda}

\role \sn \sem{or pur}

\role \fem{kum\=ar{\=\i}\/} \sem{adolescente,

jeune fille, vierge}

\fin

Weremarkthatwestillhaveredundancyofdiacrit-

ics encodings, and undisciplined intermixing of San-

skrit namessuch asKum\=arawithFrench text. But

at leastthe structure of the dictionaryentries is im-

plicit intheT

E

X macros.

The next step was to make explicit the lexicon

structure by parsing the source as belonging to a

context-free language identifying structural regulari-

ties. Actually 3 context-free grammars were identi-

ed, correspondingrespectivelytothestructuresofa

genericlexicon,ofSanskritquotations,andofFrench

sentences. These 3 grammars are each dened with

their own lexical analyser, so that diacritic encod-

ingsmayappearonlyin theSanskritquotations,and

accented letters may appear only in the French sen-

tences. Beforegivingdetails onthisgrammaticalap-

paratus, let us give the nal source of our example

entry:

\word{kumaara}

\or \sem{prince; page; cavalier}

\or \myth{np. de \npd{Kumaara} "Prince",

epith. de \np{Skanda}}

\role \sn \sem{or pur}

\role \fem{kumaarii} \sem{adolescente,

jeune fille, vierge}

\fin

The transformation between the previous rep-

resentation and the nal one was eected semi-

mechanically, byamixture ofmutualadjustmentbe-

tween the lexicon source and the grammar descrip-

tion. Some of these adjustments were aided by me-

chanical tools such asEmacs macros or other trans-

ducers. Others werehard totrack. Forinstance, the

explicittaggingofpropernamesinSanskrit(withthe

labelnpforused occurrencesandnpdfordened oc-

currences)waseasyfornameswherediacriticsoccur,

suchasKum\=araabove,buthardfortheotherssuch

as Skanda, for which only human intervention may

decide whether the name belongs to Sanskrit or to

French.

Thetechnologyusedforthedescriptionofthelex-

icalanalysersbeingnaturallytto describingregular

transducers, itwasthen easyto adapt it to the gen-

eration of the redundant diacritics encodings from a

uniquecanonicaldescription,whichsubsumesthetwo

previous representations kumAra and kum\=ara into

theunique kumaara, achievingone of the stated ob-

jectives.

WechoseObjectiveCaml[14,19,1]asourprogram-

minglanguageforthistransductionactivity. Weshall

justifythis technologicalchoiceat theend ofthis ar-

ticle,after describingtherestof ourprocessingtools.

For themoment, we remark that the Ocaml prepro-

cessorCamlp4[16] is ideallysuited to thedescription

ofparsersinahighlevelnotationsmoothlyintegrated

inthe syntaxof thisvarietyof theML programming

language.

Forinstance,hereisaportionofthesourcecodeof

thetransducerfromdiacritics encodingsalaVelthuis

intoT

E

X notation:

value tex = Gram.Entry.create "skt to tex";

GEXTEND Gram

tex:

[ [ LETTER "a"; LETTER "a" -> "\\=a"

| LETTER "a" -> "a"

| LETTER "i"; LETTER "i" -> "{\\=\\i}"

| LETTER "i" -> "i"

| LETTER "u"; LETTER "u" -> "\\=u"

(3)

| "."; LETTER "t" -> "{\\d t}"

| "."; LETTER "d" -> "{\\d d}"

| "."; LETTER "s" -> "{\\d s}"

...

| i = LETTER -> i

| i = INT -> i

| EOI -> raise Exit ] ];

END;

whereLETTERandINTaredenedaslexicaltokensin

ourcustomlexicalanalyserTrans.lexer,andGramis

agrammarmodule,initialisedasfollows:

module Gram = Grammar.Make

(struct value lexer = Trans.lexer (); end);

That is, the Camlp4 standard library Grammar pro-

vides a generic functor (parametric module) Make

which may be instanciated by a specic lexer struc-

tureinordertogenerateamoduleGramwhichcontains

ascomponentsallthenecessaryparsingtoolssuchas

entry creation (Gram.Entry.create). One of these

componentsisaparserGram.Entry.parsewhichisa

function from entries (such as tex dened above) to

stream analysers. Inourcase, astream analyserisa

function from streams to strings. It is now easy to

program agenericfunction transducer,which given

anentryreturnsafunction fromcharacterstreamsto

strings,andthespecictransducerfromdiacriticsen-

codingstoT

E

Xencodingsissimplyobtainedbypartial

application:

value skt_to_tex = transducer tex;

This veryelegantimplementation of the ideasex-

posed in thearticle by de Rauglaudre and Mauny[2]

is furthermore quite eÆcient. Once the concepts are

mastered, weare ledto verysmoothprogramswhich

areselfdocumentedandeasytomaintain. Ourreverse

engineering exercise useddozens of such transducing

engines, seamlessly embedded in the rest of our ML

modules, and whose strong static typing guarantees

correctexecution.

3 Fromcontext-freeconcretesyntaxto

strongly typed abstract syntax

Let usstatetherequirementsofourparser. First,

itshouldpreservealltheinformationcontainedinthe

database,includingtypographicalhintssuchasitalic

correction and hyphenation pragmas; it should also

preserve comments, which containvaluable informa-

rmationorabsenceinclassicaldictionaries,etc.

Another requirementis that it should as much as

possiblepreservetheexibilityoftheoriginalformat;

theoppositerequirement,thatitshouldnotbeoverly

complex,dened atrade-obetweenpreservingvalu-

able notational exibility and avoiding baroque and

ad-hocdetails.

Thistrade-owasmadeexplicitbytherequirement

ofdesigninganabstractsyntax, deninganalgebraic

representationofthelexicon;thisabstractsyntaxwas

made concrete as a set of recursive ML type deni-

tions;to everynon-terminalofthegrammaroughtto

correspondauniquetype,the variouschoices forthe

corresponding grammar entry corresponding to the

various constructors of the relevant type; listscorre-

spondedtotheKleenestaroperation,whereasoption

types corresponded to epsilon unions; modulo these

regularoperations,weaimedatisomorphismbetween

abstractandconcretesyntax;eachgrammarrulewas

associatedwithasemanticaction,constructingavalue

of the appropriate type, usually the corresponding

constructor, applied recursively to the values associ-

atedtothesemanticactionsofthenon-terminalsap-

pearing inthe production;the abstractsyntaxcorre-

spondingtoagiveninputstringwasthusconstructed

asa well-typed ML value synthesized by bottom-up

parsing.

The design of this abstract syntax was really the

creative activity underlying this reverse engineering

process, similar to the denition of the DTD[24] of

ourdictionarystructure. Itshouldbestressedatthis

pointthatanaprioridesignofthisDTDfromscratch

would havebeenhopeless, in the sense that itwould

havehadtoberevisedincessantlyalongwiththedic-

tionarygrowth,whereasstartingfroman8000entries

existing dictionary gave us a much better insurance

that most peculiarities and exceptions were already

present-weindeedbelievethatthestructurewecame

upwithwillscaleupwithminoradjustmentstofuture

growthofthelexicaldatabase.

Let us give a concrete example of this abstract

syntax, together with a typical set of grammar pro-

ductions. There are two kindsof entries in our dic-

tionary,genuineentriescontaininglemmaticinforma-

tionaboutalexical item, and cross-referenceswhich

pointtypicallytoanorthographicvarianttoitsproper

lemma. This dictinction correspondsto two produc-

tionsinthedictionarygrammarpertainingtothenon-

terminalentry:

entry:

(4)

| s = syntax; c = crossref

-> Crossref(s,c)

] ] ;

these twoproductionsproducebothvaluesof anML

type entry, constructed respectively with data con-

structors Entry and Crossref. Thus wend in the

interfacemodule Dictionaryadatatypedeclaration:

type entry =

[ Entry of (syntax * usage * option cogs)

| Crossref of (syntax * crossref)

];

We can thus witness the expected isomorphism be-

tween the parse trees of the concrete syntax non-

terminal entry and the values of the abstract syn-

tax typeentry. Thisparallelcorrespondancemaybe

understood top-down acrossall the structure. Thus

an Entry value has 3 sub-components, two manda-

tory ones,correspondingroughlyto thesyntacticin-

formationconcerningtheentry(orthographicstemfor

eachgrammaticalr^ole,etymologicalinformation,etc.)

and toitsusageinformation (meanings,expectedva-

lency of dependent items, etc.), the third one being

optional(cognateetymologiesinotherIndo-European

languages such as Greek, Latin, Germanic, English,

French, Slavic, Hindi, etc.) Weremark that the op-

tional concretesyntaxcomponentOPT cogshastype

option cogs, where the polymorphictype construc-

toroptionistheOCamllibrarysumconstructor,de-

nedas:

type option 'a = [ None | Some of 'a ];

Similarly,Ocaml grammarsallowtheKleenestarop-

erator LIST x to stand for values of type list t,

where the non-terminal x constructs values of type

list t.

Weshall not detailfurther the abstract structure

ofourdictionary,whichisfullyexplainedinatechni-

cal report[7]; letus just remark that it is highly non

trivial: thethreemutually recursivegrammarsden-

ing itssyntaxcompriseatotalof 321productions, of

which 14 pertain to Sanskrit quotations, 54 pertain

to French sentences, and 253 pertain to the generic

dictionary structure. It would havebeenhopeless to

guessbeforehandthisstructure,whereasitturnedout

to berelatively straightforwardto induce this struc-

turefromthedictionaryrawdataitself,afterafewit-

erationsforsmoothingoutirregularitiesandmistakes.

Wealso remarkthat ourlexicalanalysersarecon-

sistent with T

E

X lexical conventions, and that for

nalsourcesusedin compilingthedictionary. Butour

parserenforcesmanymorestructuralinvariantsthan

anarbitraryT

E

Xdocument. Thustyposandotherin-

putmistakesarespotted byourparserexactlywhere

these invariants are violated, with an explicit error

messagegivingthepreciselocationofthemistake,al-

lowingcontinuous growth ofthe dictionarywith rea-

sonably good debugging facilities (mistakes indicate

thepreciselineand characternumberwherethesyn-

taxisviolated).

Wenishthissectionbypretty-printingourexam-

ple entry above, as the ML representation of its ab-

stractsyntax:

(Entry

(((Word "kumaara"), [], None),

(Subst

[(Meaning

(Unpos (Genderorth (([Mas], None))),

[([], (Sem

[(Affirm [[(Unit "garcon")];

[(Unit "jeune"); (Unit "homme")]])]));

([], (Sem

[(Affirm [[(Unit "prince")]]);

(Affirm [[(Unit "page")]]);

(Affirm [[(Unit "cavalier")]])]));

([], (Myth

[(Affirm [[(Unit "np."); (Unit "de");

(Npd "Kumaara");

(Quotation (Affirm [[(Unit "Prince")]]))];

[(Unit "epith."); (Unit "de");

(Np "Skanda")]])]))]);

(Meaning

(Unpos (Genderorth (([Neu], None))),

[([], (Sem [(Affirm [[(Unit "or");

(Unit "pur")]])]))]);

(Meaning

(Unpos (Genderorth (([Fem],

Some "kumaarii"))),

[([], (Sem [(Affirm

[[(Unit "adolescente")]; [(Unit "jeune");

(Unit "fille")];

[(Unit "vierge")]])]))])]),

None, true))

Alternatively,wecouldgiveanXMLrepresentation

ofthesame:

<Entry>

<Word> "kumaara" </Word>

...

</Triple></Entry>

(5)

formating instructions

The main procedureof ourdictionary processoris

a read-eval-loop, which parses successive items, con-

structs their abstract syntax, and then proceeds in

pretty-printing recursively this structure in terms of

low-leveltypographicalinstructions.

Thus, in the caseof ourexampleentryabove,the

generated T

E

X le contains the following low-level

code:

{\dn \7{k}mAr}~{\sl kum\=ara} m.

garcon, jeune homme

$\mid\:$prince; page; cavalier

$\mid\:$myth. np. de Kum\=ara ``Prince'',

epith. de Skanda

--- n. or pur

--- f. {\sl kum\=ar{\=\i}} adolescente,

jeune fille, vierge.

Onemaywonderat thispointwhywetookallthe

troubletocompilehigh-levelT

E

Xtextintoalow-level

typographicallyequivalentone! Butnowweareready

toreaptherewardofourabstraction,byreplacingeas-

ilythebackend(T

E

Xgenerator)byanotherformating

generator,orbyanarbitrarycomputationonthedic-

tionary structure, for that matter. Furthermore, by

properuseofthemodularityconstructsofOCaml,we

mayfactorouttherecursivetraversalofthestructure

from the backend generator,reduced tolow-levelop-

erationsontheterminalstrings. Letusexplainalittle

this modularstructure.

The main componentgrind isa parametricmod-

ule,orfunctorinMLterminology. Hereisexactlythe

moduletypegrind.mliin thesyntaxofOcaml:

module Grind : functor

(Process:Proc.Process_signature) -> sig end;

with interface module Proc specifying the expected

signatureofaProcess:

module type Process_signature = sig

value process_header :

(Sanskrit.skt * Sanskrit.skt) -> unit;

value process_entry :

Dictionary.entry -> unit;

value prelude : unit -> unit;

value postlude : unit -> unit;

end;

That is,therearetwosortsofitemsin thedatabase,

namely headers and entries. The grinder will start

trywithroutineprocess_entry,andwillconcludeby

calling the process postlude. The module interface

Dictionarydescribes thedictionary abstract syntax

asaset ofmutuallyrecursiveMLdatatypes,whereas

module Sanskrit holds the private representation

structures of Sanskrit nouns (seen from Dictionary

asan abstract type, insuring that only the Sanskrit

lexicalanalysermayconstructvaluesoftypeskt).

A typicalprocessistheprintingprocess

Print_dict,itselfafunctor. Hereisitsinterface:

module Print_dict : functor

(Printer:Print.Printer_signature)

-> Proc.Process_signature;

Ittakesas argumentaPrintermodule, which spec-

ieslow-levelprintingprimitivesforagivenmedium,

anddenes theprintingofentries asageneric recur-

sion over its abstract syntax. Thus we may dene

thetypesettingprimitivestogenerateaT

E

Xsourcein

module Print_tex,andobtainaT

E

X processorbya

simpleinstanciation:

module Process_tex = Print_dict(Print_tex);

Butsimilarly,wemaynowdenetheprimitivesto

generatethe HTML source ofa Web document, and

obtainanHTMLprocessorProcess_htmlas:

module Process_html = Print_dict(Print_html);

It isverysatisfyingindeed tohavesuch sharingin

thetoolsthat build twoultimatelyverydierentob-

jects, abook with professionaltypographical quality

ononehand,andaWebsite tforhypertextnaviga-

tionandsearch ontheotherhand. Butmaybeafew

wordsareinordertoexplainhowindeedthehypertext

structure is implicitin the abstract syntax represen-

tation.

5 Scoping and name management

There are several notionsof scoping in the lexical

database. Theyemergedprogressively, rstfrom the

needtoprovidecorrecthypertextreferencetoSanskrit

quotations, and thus complete decoration by deni-

tionanchors. Sincewewantdirectaccessthroughthe

whole document, this givesa notion of global scope

ofdictionaryentries andpropernamedenitions and

otherexplicitlybindingnames. Theimportantmeta-

notion which emerged was that certain slots of the

abstractstructurealgebraconstructorswherebinding

(6)

thesortofSanskritreferences.

Thusanumberofsuch\bindingconstructors"were

identied. Conversely,two unaryreference construc-

tors wereidentied, correspondingin theFrench text

to the quotations to Sanskrit common and proper

namesrespectively.

Theprocessingofsuchbindingslotsstaystranspar-

entintheT

E

Xprinting;intheHTMLversion,itgen-

eratesappropriateURLanchorsandreferencesrespec-

tively. Thegeneratingphaseconstructsatrieindexof

the binding references, warning for duplicated bind-

ings. Asecondcheckingphaseaccessesthispersistent

structureateveryreferenceforwarningofmissingan-

chors. Wethusenforceinthissecond steptheoverall

constructionofaconsistentWebsite.

At this point we reach the concept of a module

type signature, corresponding roughly to a DTD in

XML parlance, with constructors binding in certain

arguments. An abstract syntax value,with namesin

binding and reference slots, may be considered as a

kindofgraphrepresentation.

Thenextnotionwasinducedfromafurther exten-

sion of our computationallinguistics tools, namelya

declension grammaticalengine. Theproblem was to

generate all exed forms corresponding to the basic

stems in thelexicon. InSanskrit, thedeclensionofa

substantive(or adjective, pronoun, and number) de-

pendsontheterminalendingofitsstemstringandon

its gender;in order toautomatethis computationby

a simple traversal of the dictionary structure, a new

scopeconstraintarose: eachstempossibly associated

with an entry had to be associated with its possible

genders. Here thescopingconstraintismorecomplex

than statically determining global binding slots, and

itcorrespondstoadynamictop-downtraversalofthe

abstractsyntaxtree. Duringthistraversalthecurrent

stem is remembered,sothat each genderdeclaration

may registerthe properinformation in atrie associ-

ating witheachentryallsuch pairs(stem,gender). A

nalpassonthetriecallsadeclensionengine,inorder

to build a big trie of all exed words. Furthermore,

thegenderdeclarationsgeneratereferencestothede-

clensionengineseenasaCGIbinary,associatedtothe

appropriatearguments. Thiswaythedeclensionsofa

givenentryare dynamically computablefrom within

thedictionaryhyper-structure.

Weshall notgivefull details onsuch tools, which

aremorefullydocumentedinreference[8],buttheba-

sic idea is to use the abstract syntax of the lexicon

entries to make all kinds of computations, such as

dering of the entries, a not entirely trivial check in

Sanskrit,whereforinstanceagenericnasalisationno-

tation (the so-called anunasika) stands for the nasal

consonant which ishomophonic (savarn

.

a) to the fol-

lowingconsonant.

WeendthissectionbygivingtheHTMLrepresen-

tationobtainedforourexampleentry:

kumāra</A>

<A CLASS="defr";

HREF="http:XX/dicdecl?query=kumaara &

gender=Mas">m.</A>

garçon, jeune homme; fils

| prince; page; cavalier

| myth. np. de <A CLASS="defn";

NAME="*kum=ara">Kumāra</A>

«Prince», épith. de

<A HREF="s.html#*skanda">Skanda</A> —

<A CLASS="defr";

HREF="http:XX/dicdecl?query=kumaara &

gender=Neu">n.</A>

or pur —

<A CLASS="defr";

HREF="http:XX/dicdecl?query=kumaarii &

gender=Fem">f.</A>

kumārī</A>

adolescente, jeune fille, vierge.

The string XX above stands for the URL of the

CGI binary directory of the intended HTTP server,

and is installation dependent. Note how invocations

ofthe grammaticalenginedicdecl areprecomputed

asHTMLreferencestriggeredbyclickingonthegen-

derspecications. Thusallpossibledeclensionsofthe

stemsassociatedwithanentryarejustoneclickaway

fromthedictionarynavigator.

6 The compiling chains

Theoverallarchitectureofoursoftwareplatformis

acollectionof processes, called as parametersto the

genericGrindfunctor,inordertomakesomecompu-

tationduringtherecursivetraversalofthevariousen-

tries,computedontheywhileparsingthedatabase;

global information is kept in tries structure, which

maintainamaximally sharedrepresentation. It is to

beremarkedthattheprogressiveconstructionofthese

tries iscompletely applicative, beingimplemented as

(7)

E

sourceofthepaperdictionary,availableinPSorPDF

format, and a consistentWeb site which implements

a sort of hypertext encyclopedia, with various exe-

cutable families of links, permitting easy navigation

to quotations, to etymologicalparents, to synonyms

andantonyms,etc,andalsospecicgrammaticaltab-

ulations such as declensiontables. We nally stress

thatthesetwo`prettyprinters',whichdeliversuchdif-

ferentnalproducts,areactuallytwoinstanciationsof

a generic pretty-printing functor with backend mod-

ulesgivingtherepresentationsinrespectivelyT

E Xand

HTML of lowleveltypographicalprimitives(such as

blank, line, printident, etc). We thus fully use the

moduleconstructionsofObjectiveCaml.

Itistobenotedthatatsomepointtheencodingof

Frenchsentences,whichoriginallywasjust anencap-

sulationofISO-LATIN1strings,hadtobedroppedin

favorofUnicodeUTF-8encodings,in orderto repre-

sent diacritics, for which weused a fontdesigned by

Chris Fynn[4]. But the corresponding modication

consistedinjustpipingthe8-bitsstringsintoanISO-

LATIN1toUTF-8transducer,amatterof15minutes

programmingwithourgenerictransductionfunction.

Thus we had the best of both worlds - Unicode en-

coding of the Web pages, allowing arbitrary fontsin

thedictionaryWebsite,whilepreservingISO-LATIN

acquisition with AZERTY (french) keyboards in the

databasesource.

7 Implementation notes

7.1 ML as a software engineering

workhorse

We believe that Objective Caml, together with

thesyntaxfacilitiesofitscompanionmacro-processor

Camlp4, is ideally suited to such processing. Its

stream library is extremely eÆcient, and the whole

1.7Mbofsourcedatabaseiscompletelyprocessedinto

saythewebsiteinlessthan1mnona500MHzPC.We

recall that ObjectiveCaml is available bothin byte-

codeversion,foreasydebugging,andinoptimisedna-

tivecodecompetitivewith Cprogramming. Theker-

nelofObjectiveCamlbeingitselfwritten inANSIC,

itis availableonallcurrentplatforms.

OneofthemainstrengthsofOCamlistheversatil-

ityof itsprogrammingparadigms: one mayprogram

in a purely applicative functional style, or one may

mixapplicativeandimperativefeatures. Forinstance,

wemadeanextensiveuseofmutabledata structures

tivewayasfunctionalzippers[5].

OCaml may indeed be considered as the modern

successor of both LISP-Scheme and Pascal-Modula.

Itbenetsfrom thesafetyofstrongstatictyping,but

polymorphism together with the synthesis of princi-

pal types by the compiler relieves the programmer

fromcumbersomeexplicittypedeclarations. Itsmod-

ule system is aspecially powerful abstraction mech-

anism for the design of complex systems. It has an

elegantobjectmodel,wellintegratedwiththerest of

the language, although we did not use objectorien-

tation in our application. It ought to be remarked

thatthemechanismsavailableforobjectorientedpro-

grammingdonotincuranypenaltyifonedoesnotuse

them.

Thetotalcodewrittensofarforourplatformcom-

prises45modules, totalingabout10000linesofcode.

Thiswholesystemwasdevelopedsolelybytheauthor

in about 6months asapart-time activity. Thepro-

grams were developed directly with the native code

compiler,withonlyoccasionaluseofbytecodeexecu-

tionand virtuallyno debugging: thewhole develop-

mentcycle consistediniteratingsourceedition,com-

pilation until syntax errors and type inconsistencies

were xed, then native code execution. The depen-

dencyanalysistoolof OCaml,coupledwiththeUnix

makecommand,provedtobeinvaluable.

The onlyother software packagesused forthe de-

velopment of our platform are the Emacs text ed-

itor (with its OCaml mode Tuareg) and the T

E X-

Metafont-L A

T

E X2

"

text formatter (with the devnag

preprocessor[12] fordisplayingdevanagartextunder

theBabelpackage).Thisdemandsanappropriatefont

serverfortheTrueTypefontformat,adiÆcultyforold

versionsofXwindowsonUNIX/Linux platforms.

7.2 Web and other standards

For our Web tools we used the Web standard

HTML 4.0[13] with Cascading Style Sheets in order

toseparatetheformatinginstructionsfromthestruc-

turing constructs. We used UTF-8 encodings corre-

spondingtotheUnicode[23]standard2.1,fordisplay-

ingdiacriticsintheromanindic font[4].

A general remark concerning the W3C standards

isthattheyareevolvingtoofast,andthat interoper-

abilityacrossplatformsisfarfromsatisfactory. Since

alarge percentageof the Web users use browsers in

theoriginalversionwhichwasinstalledatthetimeof

purchase of their platform, they will not be able to

correctlyinterpretmorerecentversionsofthevarious

(8)

features(theso-calleddeprecatedoptionsinW3Cjar-

gon)whicheverybrowserwillaccept,andnewerfunc-

tionalitieswhichmaybeunknowntoolderplatforms.

EvenmatureHTMLfeaturesmayhavequitedierent

look-and-feelundersayNetscapeNavigatorandInter-

netExplorerbrowsersintheirvariousversions.Inthis

compromise we refrained from using ad-hoc browser

scripting languages such as Javascript, mobile code

programminglanguagessuchasJava,andad-hocfea-

turessuchascookies,andwelimitedourselvesinthe

dynamic capabilities of our Web interface to remote

procedurecallsgeneratingHTMLpages(theso-called

CGI-bins).

A related remark concerns XML[24], which is

rapidly imposing itself as the standard data inter-

changeformat,especiallyintheareaoflinguisticand

philology resources such as digital dictionaries and

corpuses[22]. Thismeans that XML versions of digi-

tal lexicons such as our sanskrit dictionay will have

to be provided in the future in order to be usable

in moregenericlinguistic platforms. Suchdocuments

will becompliantwithpreciseDTDs explainingtheir

logical structure, which will be themselves instances

of generic standard formats. However, we are not

convinced at all that the production of these docu-

ments, as well as computations performed on them,

will need to use allthe layersof resourcedescription

and other frameworks which are weekly churned out

by theW3 Consortiumandthe various vendors. We

rather believe that general programming languages

such as OCamlare a moresolid, moreuniform, and

moreeÆcientalternativetosuchspecicsoftwareen-

gineeringtools.

8 Conclusion

Reverseengineeringisoftenreferredtoin thecon-

text oftransferringsomelegacysystemtoanewpro-

gramminglanguage,systemorplatform,orofporting

by hook or crook some obsolete database to an up-

to-date format. Here we show an application where

thereverseengineeringoperationleadstoatranslator

(actuallyafamilyoftranslatorsandprocessingtools)

which allows the development of a naturallanguage

processing platform centered on a lexical database

whose format is only slightly modied. This means

that the development of this platform may process

asynchronously from the subsequent updates to the

database, whose development may continue, without

changing too much its input conventions. Further-

ture is progressively extracted from the data. For

instance, comments in the original source may pro-

gressively ndtheir way in the structure, ascitation

indexesforinstance. This ispossiblebecause,in our

approach, even thoughall computations are doneon

theabstractsyntaxtreerepresentation,westillretain

theoriginal concrete syntax text le asthedatabase

source.

Thereisageneralneedforsuchadaptorsintheeld

ofcomputational linguistics, where resourcessuch as

lexiconsanddigitalisedcorpusesarelong-terminvest-

mentsoftendevelopedbyindependentteams. Rather

than attempting a complete standardization of their

formats,anunendingquestanywayinaworldoffast-

changingtechnologies, our approach permits parallel

developmentswithoutchangingtheeditingtoolsand

acquisitionprocesses. Itisofupmostimportancehow-

ever that these translation processes be clean, well

documentedprogramswhich maybemaintainedand

adapted over long evolution cycles. We have shown

that a general-purpose language such as Objective

Caml, with its extensive meta-syntactic features, its

powerfultype discipline and module system, and its

eÆcientandportablebackend,isanidealtoolforsuch

engineeringtasks. Weseelittlejusticationindeedin

theuseoflessgeneraland poorlyeÆcientcomputing

paradigms,suchastheso-called\scriptinglanguages".

References

[1] Guy Cousineau and Michel Mauny. \The Func-

tional Approach to Programming". Cambridge

UniversityPress,1998.

[2] Daniel de Rauglaudre and Michel Mauny.

\Parsers in ML". Proceedings of Conference on

LISP and Functional Programming (1992), pp.

76{85.

[3] Christiane Fellbaum, Ed. \Wordnet, an Elec-

tronic Lexical Database." MIT Press, 1998. See

also: http://www.cogsci.princeton.edu/~wn.

[4] Christopher J. Fynn. \How to use Diacritics

for Romanised Indic Text on the WWW". See:

http://www.user.dicon.co.uk/~cfynn/

sanskrit.htm.

[5] Gerard Huet. \The Zipper". J. Functional Pro-

(9)

Available fromtheSanskritWebSite:

http://pauillac.inria.fr/~huet/SKT/.

[7] Gerard Huet. \Structure of a Sanskrit Dictio-

nary."INRIAResearchReport toappear, 2001.

[8] Gerard Huet. \Computational Linguistics for

Sanskrit: a Software Engineering Approach".

Toappear, anniversaryvolume in honorof Rod

Burstall,2001.

[9] Donald Knuth. The T

E

Xbook. Addison-Wesley,

1984.

[10] LeslieLamport.L A

T

E

X-ADocumentPreparation

System.Addison-Wesley,1986.

[11] Leslie Lamport et al. L A

T

E X2

"

- The macro

package for T

E

X. Edition 1.6, Dec. 1994.

Available from CTAN archive sites, such as

ftp://ftp.tex.ac.uk/.

[12] AnshumanPandey. Devanagar forT

E

X, version

1.6, Aug. 1998. Available from CTAN archive

sites,suchasftp://ftp.tex.ac.uk/.

[13] Dave Raggett, ArnaudLe Horsand Ian Jacobs,

Eds.HTML 4.0 Specication. W3CRecommen-

dation,April24,1998.

http://www.w3.org/TR/REC-html40.

[14] PierreWeisand XavierLeroy.Lelangage Caml.

2emeedition,Dunod, Paris,1999.

[15] AcrobatReader,anAdobeproduct,freelydown-

loadablefrom: http://www.adobe.com/

products/acrobat/readstep.html.

[16] TheCamlp4preprocessor.Available from:

http://caml.inria.fr/camlp4/.

[17] TheCommonGatewayInterface.See:

http://www.w3.org/CGI/Overview.html.

[18] Ghostscript.See: http://www.cs.wisc.edu/

~ghost/index.html.

[19] ObjectiveCaml.Availablefrom:

http://caml.inria.fr/ocaml/index.html.

[20] PDF, the portable document format, an Adobe

product.See: http://www.adobe.com/

products/acrobat/adobepdf.html.

[21] Postscript,anAdobeproduct.See: http://

http://www.uic.edu/orgs/tei/.

[23] TheUnicodestandard2.1. See:

http://charts.unicode.org/.

[24] The XML norm, version 1.1, W3C consortium.

See: http://www.w3c.org/documents/XML/.