• Aucun résultat trouvé

Design of an Electronic Sanskrit Reader

N/A
N/A
Protected

Academic year: 2022

Partager "Design of an Electronic Sanskrit Reader"

Copied!
26
0
0

Texte intégral

(1)

Design of an Electronic Sanskrit Reader

SALA XXI, SALA XXI,

Konstanz, October 2001 Konstanz, October 2001

Gérard

Gérard Huet Huet INRIA

INRIA

(2)

History

• 1994 - personal lexicon in TeX

• 1996 - available on Internet

• 1998 - 10000 entries - invariants design

• 1999 - reverse engineering

• 2000 - Hypertext version on the Web,

sandhi processor, grammatical engine

(3)
(4)
(5)
(6)

Processing chains

Data base source Abstract structure

devnag tex dvi ps pdf

Internet

Index a ... h

Web site (html)

CGI bin (Ocaml)

(7)

Close-up view of functionalities

Index Engine

Grammar Engine

Dico.ps Dico.pdf Web site (html)

Index

a ... h

Flex

Dico DB

Grind Entries Flexed

CGI

(8)

Tools used

Printable document

• Knuth TeX, Metafont, LaTeX2e

• Velthuis devnag font & ligature comp

• Adobe Postscript, Pdf, Acrobat

Hypertext document

• W3C HTTP, HTML, CSS

• Unicode UTF-8

• Chris Fynn Indic Times Font

Processing & Search

INRIA Objective Caml

(9)

Each entry is a (typed) tree

Entry

syntax usage opt cogs

Type entry = [ | Crossref ]

....

N.B. Syntax is really morphology, usage is part of speech roles plus meanings

(10)

Grammatical information

type gender = [ Mas | Neu | Fem | Any ];

type number = [ Singular | Dual | Plural ];

type case = [ Nom | Acc | Ins | Dat | Abl | Gen | Loc | Voc ];

(11)

The verb system

type voice = [ Active | Reflexive ]

and mode = [ Indicative | Imperative | Causative | Intensive | Desiderative ] and tense = [ Present of mode | Perfect | Imperfect | Aorist | Future ]

and nominal = [ Pp | Ppr of voice | Ppft | Ger | Infi | Peri ] and verbal = [ Conjug of (tense * voice)

| Passive | Absolutive | Conditional | Precative | Optative of voice

| Nominal of nominal

| Derived of (verbal * verbal) ];

(12)

Governance templates (Grammatical valence)

\word{ga.n} ... \sem{imputer qqc. <acc.> à qqn. <loc.>}

\ca{chandayati} … \sem{gratifier qqn. <acc.> de <i.>}

\word{niyuj} ... \sem{confier qqc. <acc.> à qqn. <loc.>}

\root{krii} ... \sem{acheter (qqc. <acc.> à qqn. <g. abl.>)}

(13)

Key points

• Each entry is a structured piece of data on which one may compute

• Consistency and completeness checks :

every reference is well defined once, there is no dangling reference

etymological origins, when known, are systematically listed

lexicographic ordering at every level is mechanically enforced

• Specialised views are easily extracted

• Search engines are easily programmable

• Maintenance and transfer to new technologies is ensured

• Independence from input format, diacritics conventions, etc.

• The technology is scalable to much bigger corpus

(14)

Generic reuse of the technology

The structure of the dictionary makes separate as much as possible 3 layers :

• sanskrit

• french

• generic dictionary structure

Thus the french meanings, at the leaves, could be replaced by e.g.

english definitions or glosses.

(15)

Morphological analysis, sandhi

• Sanskrit is pronounced as written

• … and thus is written as pronounced

• Phonetic alliteration is rendered by morphology junction (sandhi)

• The sentence is formed of words joined by external sandhi

• Compound words are also formed by external sandhi

• Whereas flexion, prefixing and suffixing use internal sandhi

• External sandhi is local, internal sandhi is less

• Sandhi analysis is non-deterministic and sometimes involves sem

(16)

Grammatical engine

• In sanskrit, declension is determined by stem and gender

• Sanskrit is very regular, since the classical language was frozen by Pânini (4th century BC) who invented context-free notation

• But it spans about 35 centuries, and thus there are many exceptions

• Substantive (adjectives, pronouns, numerals) declension may be arranged in 84 tables of 24 endings (3 numbers * 8 cases)

• Then internal sandhi is applied to a stem and an ending

• Two applications :

online declension of words given with gender (cgi-bin)

(17)

Interactions lexicon-grammar

• The index engine, when given a string which is not a stem defined in one of the entries of the lexicon, attempts to find it within the flexed forms persistent database, and if found there will propose the corresponding lexicon entry or entries

• From within the lexicon, the grammatical engine may be called online as a cgi which lists the declensions of a given stem. It is directly accessible from the gender declarations, because of an important scoping invariant:

every substantive stem is within the scope of one or more genders

every gender declaration is within the scope of a unique substantive stem

(18)

Inverting external sandhi

• External sandhi rules are of a finite-state nature

• The flexed forms lexicon index may be seen as the graph of a deterministic finite automaton recognizing its words

• This tree may be uniformly decorated by relevant sandhi rules seen as non-deterministic choice points

• This structure may be evaluated as a finite-state transducer graph

(19)

Examples of segmentation

Chunk: o.mnama.h"sivaaya may be segmented as:

• [ om with sandhi m|n -> .mn ]

• [ namas with sandhi s|"s -> .h"s ]

• [ "sivaaya with no sandhi ]

Chunk: kusuma.mgopiibhya.hk.r.s.nodadati may be segmented as:

• [ kusumam with sandhi m|g -> .mg]

• [ gopiibhyas with sandhi s|k -> .hk]

• [ k.r.s.nas with sandhi as|d -> od]

• [ dadati with no sandhi]

(20)

From segments to tagged lemmas

Chunk: kusuma.mgopiibhya.hk.r.s.nodadati may be lemmatized with tags as:

• [ kusumam < acc. sg. n. of kusuma | nom. sg. n. of kusuma

| voc. sg. n. of kusuma > with sandhi m|g -> .mg ]

• [ gopiibhyas < abl. pl. f. of gopa

| dat. pl. f. of gopa > with sandhi s|k -> .hk ]

• [ k.r.s.nas < nom. sg. m. of k.r.s.na > with sandhi as|d -> od ]

• [ dadati <…> with no sandhi]

(21)

Future work

• Verb conjugation tables preparation - full flexed forms database

• Fixing sandhi analysis for bahuvrihi compounds

• Choice of taggings from concord and valency constraints

• Semantic guidance from ontology classification

• and we shall then able to semi-automatically index corpuses towards

computer-aided concordance of corpus

computer-aided preparation of critical editions

statistical analysis of corpus (co-occurrence, style, etc)

computer-aided accretion of lexicon

fully indexed citations

extraction of corpus-specific lexicons

diachnony control of lexical information

Références

Documents relatifs

Furthermore, nominal stems in -a, by far the most frequent ones, have their vocative forms identical to their bare form used as first component of compounds

The URLs which are being standardized in the IETF are directly based on the way World Wide Web built pointers to resources, by creating a uniform way to specify access

Clients MAY provide language codes in AttributeDescription in the requested attribute list in a search request... If a language code is provided in an attribute description,

For transferring in-sequence data packets, the sequence number will increment for each packet by between 0 and an upper limit defined by the MSS (Maximum Segment Size) and, if

The transport layer effectively provides a container capability to mobility support services, as well as any required transport and security operations required to

An upper triangular matrix is almost in a row echelon form. If all the elements on its diagonal are non-zero then they are the pivots and the matrix is regular. If one of the

The authorisation granted by the General Meeting of Shareholders will be used to issue shares to grant rights under stock option plans to the Managing Board, top executives and

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des