• Aucun résultat trouvé

Wake up, standOff!

N/A
N/A
Protected

Academic year: 2021

Partager "Wake up, standOff!"

Copied!
21
0
0

Texte intégral

(1)

HAL Id: hal-01374102

https://hal.inria.fr/hal-01374102

Submitted on 29 Sep 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons Attribution| 4.0 International License

Wake up, standOff!

Piotr Banski, Bertrand Gaiffe, Patrice Lopez, Simon Meoni, Laurent Romary, Thomas Schmidt, Peter Stadler, Andreas Witt

To cite this version:

Piotr Banski, Bertrand Gaiffe, Patrice Lopez, Simon Meoni, Laurent Romary, et al.. Wake up, standOff!. TEI Conference 2016, Sep 2016, Vienna, Austria. �hal-01374102�

(2)

Wake up, standOff!

Piotr Banski, Bertrand Gaiffe, Patrice Lopez, Simon Meoni, Laurent Romary, Thomas Schmidt, Peter Stadler, Andreas Witt

(3)

Overview

• The way towards a <standOff> element in the

TEI architecture

– Relation to ISO 24624 Transcription of Spoken

Language

• Implementation issues

– Reflecting the open annotation model

– Open cans of worms (header, annotation body)

(4)

The simple picture

Inline annotation:

Intertwined with the source text

Stand off annotation:

Source text is referenced from outside

Embedded stand off annotation:

Stand off annotations attached to the same document as the source

(5)

Why embedded stand-off annotation?

• Each time the source document is seen as the

reference organisational unit

– Corpus management

– Transmission workflow

– Multiple annotation layers

– Competing annotations

(6)

Standoff: A long-standing issue

• The idea of standoff annotation is not new in general

– Thompson & McKelvie, 1997

• Standoff annotation has been a core concept in the TEI guidelines since

the beginning

– Cf. Chapter: Linking, Segmentation, and Alignment

– Availability of <anchor>, <span>, <interp>, <link>, @ana

• But: not integrated in the TEI architecture

– Stand-off elements can appear anywhere in a TEI document – Usual trade-off between on-site vs. grouping (<back>)

• The NLP community has also developed its own means

– GraF (Ide & Suderman 2007) , Paula (Zeldes et al. 2009), etc.

• Need for a proper, and inclusive, treatment of standoff annotations in the

TEI

(7)

Embedded standoff: Basic concept

• Building up an autonomous document containing primary source and

additional annotations

– Annotations are conveyed with their specific meta-data

– Annotations have their specific place in the TEI document architecture – Standoff annotations may be recursively organized

– Standoff annotations may point to textual as well as facsimile content – Well-defined elementary annotation units

– Coherence with existing models (Open Annotation, ISO TC 37) should be ensured

• Typical use-cases

– Annotated corpora

• Treebanks

– Text mining

• Named entity recognition, keyword/terms extraction

– Human annotations on a document

• critical editions, patent examination, peer review…

(8)

Timeline

• 2011: Paper by Thomas Schmidt in jTEI (https://jtei.revues.org/142)

• August 2012: new tickets by Javier Pose (EPO)

• January 2014: Workshop in Berlin

– Draft of a first proposal

– Setting-up a github environment

• 2012-2016: ISO 24624 project (Editor: Thomas Schmidt)

– Need for a annotation grouping component (<annotationBlock>)

• May 2015: Council meeting in Ann Arbor

– Several updates to the proposal

– Stabilisation of element names

• March 2016: TEI release 6.0.0

– New element <annotationBlock> for interlinear annotation

• August 2016: publication of ISO 24624 Transcription of Spoken

Language

(9)

Annotations in TEI: <standOff>

TEI teiHeader facsimile standOff teiHeader facsimile listAnnotation annotationBlock

Meta-data related to the

annotation, such as annotator, revisions of the annotations, availability

Recursive construct: allows the organisation of annotations par

method, annotator, campaign <div>-like component for structuring complex series of annotations

Elementary annotation unit

(10)

Application: interlinear annotation

• Encoding interlinear annotation as inline content (in <text>)

<annotationBlock who="#SPK0" start="#T9"end="#T12" xml:id="au1“> <u xml:id="u1">

<seg xml:id="seg45"type="utterance" subtype="declarative">

<w xml:id="w43">Nee</w> <pcxml:id="pc3">,</pc> <w xml:id="w44">hab</w> <w xml:id="w45">kein</w> <w xml:id="w46">Führerschein</w>

</seg> </u>

<spanGrp type="en">

<span from="#T9" to="#T12">No, I don't have a driver's license.</span> </spanGrp>

<spanGrp type="pos“>

<span from="#w43"to="#w43">NE</span> <span from="#pc3" to="#pc3">$,</span>

<span from="#w44"to="#w44">VAIMP</span> <span from="#w45"to="#w45">PIAT</span> <span from="#w46"to="#w46">NN</span> </spanGrp>

</annotationBlock>

(11)

Standoff interlinear annotation

• Encoding interlinear annotation as stand-off markup

– In <standOff>

<annotationBlockinst="#u1">

<spanGrpxmlns="http://www.tei-c.org/ns/1.0"type="en">

<spanfrom="#T9" to="#T12">No, I don't have a driver's license.</span> </spanGrp>

<spanGrpxmlns="http://www.tei-c.org/ns/1.0"type="pos"> <spanfrom="#w43" to="#w43">NE</span>

<spanfrom="#pc3" to="#pc3">$,</span>

<spanfrom="#w44" to="#w44">VAIMP</span> <spanfrom="#w45" to="#w45">PIAT</span> <spanfrom="#w46" to="#w46">NN</span> </spanGrp>

</annotationBlock>

– In <body>

<uxml:id="u1"who="#SPK0" start="#T9"end="#T12">

<seg xml:id="seg45" type="utterance" subtype="declarative"> <w xml:id="w43">Nee</w><pc xml:id="pc3">,</pc>

<w xml:id="w44">hab</w> <w xml:id="w45">kein</w> <w xml:id="w46">Führerschein</w>

(12)

Going further: mapping the Open

Annotation model

annotation body

target

<span type=“” from=“” to=“”>

Any TEI object (with @xml:id) or <surface> <bibl>, <person>, <place>, <fs>, <note>, <body>, MAF, SynAF

<interp type=“” inst=”” ana=“”>

document 1..n 0..n R e p r e s e n t i n g w h a t i s b e i n g o m p o n e n t s i n t h e D o c u m e n t e r a l ( p o s s i b l y r e c u r s i v e )

<zone type="" corresp="#_theSurface"

(13)

Going deeper into <standOff>

TEI teiHeader facsimile standOff teiHeader facsimile listAnnotation annotationBlock spanGrp-span-zone interpGrp-interp

(14)

Systematizing the use of <span> and

<interp> in <annotationBlock>

• <span>

– Close semantic to the notion of target in the OA model – Identifies a markable within the full-text of the document – Requires a precise guidance concerning pointing options – Kind-of business as usual

• <interp>

– Extended usage

– @type: provides the type of the annotation • Cf. @type on the parent standOff element

– @resp: the entity who is responsible for this annotation

– @inst: lists the components (span or surface) to be annotated – @ana: points to annotation content (body in OA speak)

(15)

Prototypical example

Dates in a named entity recognition context

<annotationBlock>

<date xml:id="E4N1" from=“1944-08-17“ to=“1944-08-25”>

17 - 25 août 1944</date>

<interp ana="#E4N1" inst="#d1e173"/>

<span xml:id="d1e173" from="#E4T6" to="#E4T10" />

</annotationBlock>

(16)

Example from the ANR Termith project

<annotationBlock>

<fs>

<f name="lemma"><string>corpus</string></f>

<f name="pos"><symbol value="NOM"/></f>

</fs>

<interp/>

<span target="#t1"/>

</annotationBlock>

(17)

Can we make the model more implicit?

<annotationBlock inst=“#t1”>

<fs>

<f name="lemma"><string>corpus</string></f>

<f name="pos"><symbol value="NOM"/></f>

</fs>

</annotationBlock>

• Closer to the speech transcription version

• Risks:

– Loosing the link with the OA model (hindrance to automation)

– Allowing all types of possible (creative) encodings

(18)

Issues (many)

• Which header do we need?

– Standoff annotation usually requires very restricted meta-data

– If we adopt the TEI header, we need to make it more flexible…

• Should we have a convergence with biblFull (where profileDesc is missed, BTW, SF:533, deeply ambered)

– Stand-off annotations may be generated by humans and machines

• how to put <author> (editionStmt) and <appInfo> (encodingDesc) at the same place?

• How do we provide guidance concerning annotations?

– Mapping the OA model to precise TEI constructs?

– Allowing a wide variety of possible vocabularies depending on the use

case?

• TBX entries, MathML, full-text annotation (<body>?)

(19)

Leaving dust under the carpet for

today: pointing mechanisms

1. Offset based mechanism: string-range(…)

– not stable in case the original text is modified. The

annotation needs to be rebuilt

2. word tokenisation <p><w>.</w><w>,</w></p>

– may generate an insane amount of data

3. <span xml:id=“s1” to=“#a1”/> + <anchor xml:id=“a1”/>

Example:

<p>….

<span to="#a2"/><span to="#a1"/>le petit chat<anchor

xml:id="a1"/> est mort <anchor xml:id="a2"/>...

</p>

(20)

Next steps

• Finalising the content model of <annotationBlock>

– Completely open model?

– Constrained with specific model classes? (OA)

– Alternation between the two (or more) options

• Gathering reference example from existing

implementations

– Istex, Termith, EPO, IDS

• Finalising the graft in the guidelines

– Section in chapter 16 Linking, Segmentation, and

Alignment?

(21)

MERCI !

Références

Documents relatifs

The article then explores some significant models of organization that emerged in previous periods in which the renewal of communist politics was closely linked to attempts to

In contrast, the case of split spin groups, corre- sponding to invariants of Quad n ∩ I 3 (meaning that the Witt classes of the forms must be in I 3 ), is very much open, and has

Paris, le 31 janvier 2022 – Les chambres de recours de l’OEB ont confirmé que seules des personnes physiques pouvaient être désignées comme inventeurs dans une demande de

In this section, three MAC protocols are modeled using the framework previously introduced to compare their power consumption, latency and reliability. The three evaluated pro-

Dans un contexte inédit en Europe, une commission composée d’experts grecs et internationaux est mise en place au sein du Parlement grec avec pour mission d’enquêter sur la

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

First, we combine income tax micro-files, household surveys (housing and wealth surveys) and national accounts to provide series of pre-tax national income by

[r]