• Aucun résultat trouvé

Using the TEI framework as a possible serialization for LMF

N/A
N/A
Protected

Academic year: 2021

Partager "Using the TEI framework as a possible serialization for LMF"

Copied!
26
0
0

Texte intégral

(1)

HAL Id: inria-00511769

https://hal.inria.fr/inria-00511769

Submitted on 26 Aug 2010

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Using the TEI framework as a possible serialization for LMF

Laurent Romary

To cite this version:

Laurent Romary. Using the TEI framework as a possible serialization for LMF. Rendering endangered languages lexicons interoperable through standards harmonization., Aug 2010, Nijmegen, Netherlands.

�inria-00511769�

(2)

Using the TEI framework as a possible serialization for LMF

Laurent Romary, INRIA & HUB-IDSL

[email protected]

RELISH workshop, Aug 4-5, 2010; Nijmegen

(3)

Executive summary

• Issue: identifying an “appropriate” serialization for LMF

Serialization: mapping a lexical model onto a concrete (computer) representation (field based, XML, etc.)

Wide consensus, maintenance, flexibility, cohesion with other standardization activities

• The TEI is an ideal basis for defining standardized XML formats for lexical data

TEI as an infrastructure

Customization facilities: ODD, classes, pointing mechanisms, etc.

TEI as a reference vocabulary

Print dictionary chapter (PD)

TEI as an application of LMF

• Workplan proposal

Defining the ideal LMF compliant subset of the TEI PD chapter Suggesting extensions to the PD chapter

• Convergence…

Contribution to making ISO and TEI work closer together

(4)

The TEI at a glance

• Started in 1987

• Organized as a consortium: 5 hosts, board, council

• Edition P5 of the guidelines: more than 500 elements covering various text genres and structures

– Genericity: header, text structure, pointing mechanisms, paragraph level elements, surface entities

– Precise and flexible documentation – Maintenance: 2 releases per year

• Wide community of users: default format for most text-based projects worldwide

– Cf. papers from Przepiorkowski, or Erjavec at LREC

• And yes, it is XML based… e.g. preconfigured in Oxygen

(5)

Intermezzo — an XML tutorial

• XML is about awful angle brackets (serialization)

<gramGrp>

<gen>f</gen>

<num>p</num>

</gramGrp>

• XML is about beautiful trees (model)

• Issues

– Specifying structures

– Providing semantics

(6)

Basic concepts of the TEI technical platforms

A specification language ODD (One Document Does it all)

– Literate programming (Knuth)

– Generation of both schemas and documentation

DTD, RelaxNG, W3C scemas

HTML, pdf, ePub, docx

– Provides extended customization facilities – <equiv>

Natural link with ISOCat

Modules

– Each schema specification is a combination of internal or external modules

E.g. ISO-TEI Feature-Structure module

Classes (sharedbehavioursorsemantics)

– Model classes

– Attribute classes

(7)

From ODD to documentation

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-gen.html

(8)

TEI and “dictionaries”

• The TEI Print Dictionary (PD) chapter

– Initially designed by N. Ide and J. Veronis

– Accounts for both presentational and editorial (“content”) issues

• Cf. <entry>, <entryFree>, … and <dictScrap>

– Based on a hierarchical abstract model (cristals)

• <form>: for characterising the othographic or phonetic form of the word

<orth>, <pron>, etc.

• <gramGrp>: grammatical features

May characterize an entry, a specific form or a specific sense <pos>, <gen>, generic <gram> feature

• <sense>: iterative and recursive

May contains definitions, examples, etymological information, translations, etc.

• Main characteristic (drawback?): +very+ flexible

(9)

Examples

<entry>

<form>一乘顯性教</form>

<sense>One of the five divisions made by 圭峰Guifeng of the Huayan 華嚴or Avataṃsaka School; v. 五教.</sense>

</entry>

Source: http://buddhistinformatics.ddbc.edu.tw/glossaries/; thanks to Marcus Bingenheimer

<entry>

<form>眾生不可思議</form>

<sense>

<usg type="dom">術語</usg>

<def>四事不可思議之一。見不可思議條。</def>

<xr>不可思議</xr>

</sense>

</entry>

(10)

Examples – cont.

<entry>

<form type="lemma">

<orth>chat</orth>

</form>

<gramGrp>

<pos>noun</pos>

<gen>masculine</gen>

</gramGrp>

<form type="inflected">

<orth>chat</orth>

<gramGrp>

<number>singular</number>

</gramGrp>

</form>

<form type="inflected">

<orth>chats</orth>

<gramGrp>

<number>plural</number>

</gramGrp>

</form>

</entry>

(11)

Customizing an entry

<entry>

<form>

<orth>table</orth>

</form>

<gramGrp>

<pos>n.</pos>

<gen>f.</gen>

<gramGrp>

<def>Pièce de mobilier…</def>

<cit>

<quote>Une table de cuisine</quote>

</cit>

</entry>

Constraining content f., f, fem, feminin, feminine,…

(12)

Illustrating classes: tei.gramInfo

• Grammatical information in a dictionary entry

– E.g.:

<entry>

<form>

<orth>luire</orth>

</form>

<gramGrp>

<pos>verb</pos>

<subc>intransitive</subc>

</gramGrp>

</entry>

– Rather homogeneous set of elements

• <pos>, <gen>, <number>, <case>, etc.

– May also appear in <form>

(13)

Overall picture

tei.gramInfo

<pos>

<gramGrp>

(14)

Declaring the class: tei.gramInfo

<classSpec xmlns="http://www.tei-c.org/ns/1.0" module="dictionaries-decl"

id="GRAMINFO" type="model" ident="tei.gramInfo">

<gloss>grammatical information</gloss>

<desc>groups those elements allowed within a <gi>gramGrp</gi> element in a dictionary.</desc>

</classSpec>

(15)

<pos> belongs to tei.gramInfo

<elementSpec module="dictionaries" id="POS" ident="pos">

<gloss>part of speech</gloss>

<desc>indicates the part of speech assigned to a dictionary headword (noun, verb, adjective, etc.)</desc>

<classes>

<memberOf key="tei.dictionaryParts"/>

<memberOf key="tei.gramInfo"/>

<memberOf key="tei.dictionaries"/>

</classes>

<content> … </content>

<exemplum> … </exemplum>

</elementSpec>

(16)

Content model for <gramGrp>

<elementSpec module="dictionaries" id="GRAMGRP" ident="gramGrp">

<gloss>grammatical information group</gloss>

<content>

<rng:zeroOrMore

xmlns:rng="http://relaxng.org/ns/structure/1.0">

<rng:choice>

<rng:text/>

<rng:ref name="tei.phrase"/>

<rng:ref name="tei.inter"/>

<rng:ref name="tei.gramInfo"/>

<rng:ref name="tei.Incl"/>

</rng:choice>

</rng:zeroOrMore>

</content>

</elementSpec>

(17)

LMF at a glance

• LMF – Lexical Markup Framework

– ISO standard 24613 (published Oct. 2008) – Edited within ISO committee TC 37/SC 4

• Technical content

– Focus on provided a core meta-model with extensions – Potentially agnostic with regards serialisation

Isomorphism => interoperability

– Default syntax to exemplify its possible use, room for improvement…

• Can the TEI be seen as a conformant

implementation of LMF?

(18)

LMF architecture — playing Lego

Seite 17

Lexical DB

1..1

Global Info

1..1

Lexical Entry

0..n 1..1

1..1

Form

1..1

Sense

0..n 1..1

0..n

1..1

Lexical Entry Morphology

1..1 1..1

Lexical Entry Morphology

1..1 1..1

Lexical extensions

Lexical extensionsLexical

extensionLexical extension

0..1

Paradigm

1..1

Flexion

0..n 1..1 Lexical extension

for morphology

(19)

Example: designing a full-form lexicon

Seite 18

Morphology

1..1 1..1

Paradigm

0..1

1..1

Inflexion

0..n 1..1

Lexical DB

Entry

0..n 1..1 1..1

Global Info

1..1

(20)

Decorating the model

Seite 19

Lexical DB

Entry

0..n 1..1

Morphology

1..1 1..1

Paradigm

0..1

1..1

Inflexion

0..n 1..1

/lemma/

/part of speech/

/word form/

/gender/

/number/

/tense/

1..1

Global Info

1..1

/paradigmId/

(21)

Why is the TEI a good idea for serialising LMF?

• Basic structure already defined

• Provision of additional tags

Surface annotation (e.g. names, dates, abbreviations, alternatives)

Cf<equiv> equivalences to ISOCat when needed

• Integration of lexical data in a textual macro-structure

Creating an edited version of a lexica

Grammar books, teaching material, scientific papers

• Interoperability with other lexical sources

Community of users: sharing a common culture of TEI tags rather than constantly worrying about mappings

Sharing tools: e.g. stylesheets, editors, etc. (cf. Roma)

Note: continuity between dictionary and lexical sources

(22)

A typical entry

<entry>

<form>

<orth>demigod</orth>

<pron>...</pron>

</form>

<gramGrp

<pos>n</pos>

</gramGrp>

<sense n='1'>

<sense n='a'>

<def>a being who is part mortal, part god</def>

</sense>

<sense n='b'>

<def>a lesser deity</def>

</sense>

</sense>

<sense n='2'>

<def>a godlike person</def>

</sense>

</entry>

(23)

Identifying the meta-model components

Lexical DB

Lexical entry

0..n 1..1 1..1

Global Info

1..1

Morphology

1..1 1..1

Form

1..1

1..1

Sense

0..n 1..1

0..n

1..1

<entry>

<sense>

<gramGrp>

<form>

(24)

Data categories

Lexical DB

Lexical entry

0..n 1..1 1..1

Global Info

1..1

Morphology

1..1 1..1

Form

1..1

1..1

Sense

0..n 1..1

0..n

/orthography/ (<orth> ) 1..1 /pronunciation/ (<pron> ) /hyphenization/ (<hyph> ) /syllabification/ (<syll>) /stress pattern/ (<stress> )

/part of speech/ (<pos>) /inflexional class/ (<itype>) /gender/ (<gen>)

/number/ (<number>) /case/ (<case>)

/person/ (<per>) /tense/ (<tns>) /mood/ (<mood>)

/definition/ (<def>) /example/ (<eg>) /usage/ (<usg>)

/etymology/ (<etym>)

(25)

Customizing the TEI lexical model

• Constraining the TEI model

– Sub-setting the TEI default dictionary module – Providing additional rules (XSLT, Schematron) – Constraining possible values

• Complementing the TEI model

– Defining additional data categories

– Defining missing LMF extensions as TEI components

Make use of the class mechanisms

A natural implementation of the LMF extension mechanisms

(26)

Towards a joint ISO-TEI activity

Contributing to convergence, with a pragmatic prespective

Benefiting from advantages of both sides

– TEI reactivity and community support – ISO stability and international validation

LMF serialization seen as

– A subset of the TEI when equivalent construct exist

– An extension of the TEI for missing constructs (e.g. syntax)

Some concrete work on the table…

– Aside activities: specifying LL-LIF in ODD

Reference

– L. Romary, “Standardization of the formal representation of lexical information for NLP”

– http://hal.archives-ouvertes.fr/hal-00436328/fr/

Références

Documents relatifs

As more and more projects gather existing corpora from different databanks and from different research teams and then, in return, produce new annotations, oral and

On trouvera les références et les moyens de se procurer la TEI et sa docu- mentation dans les divers articles qui suivent cet éditorial, notamment dans la présentation de Nancy Ide

– des ensembles de balises de base (base tag sets) pour chaque type parti- culier de texte (prose, poésie en vers, etc.) ;.. – des jeux de balises additionnelles (additional tag

Le marquage électronique relève du même esprit : il s’agit d’insérer, non plus dans la «sur- face » d’une page de codex, mais dans un fichier électronique (que l’on

And if so, what other components (schema files, stylesheets, etc.) should accompany the primary resource files when submitting them for long term preservation in a digital

The most common way to add other types of annotations (such as syntactic ones) in these resources is to use hybrid formats, such as in the tabular format used for the CoNLL-2012

In this article, we will outline the methodology we used to gather data and discuss what the results and suggest about the practices of users, the issues and limitations they have

 Born digital: this is the usual situation for an editorial project such as an electronic journal or a reference document series (standards, patents, activity report) where