• Aucun résultat trouvé

Preprocessing and textometry tools

N/A
N/A
Protected

Academic year: 2021

Partager "Preprocessing and textometry tools"

Copied!
12
0
0

Texte intégral

(1)

Oral History: Users and their Scholarly

Practices in a Multidisciplinary World

CLARIN- EU

(2)

Preprocessing and textometry tools

Florentina Armaselu, Luxembourg

Centre for Contemporary and Digital

History (C

2

DH), University of

(3)

Preprocessing and textometry tools

Summary

Keywords: pre-process, analyse, interpret, evaluate

Overview of the session

Pre-processing data for textometric analysis

Short introduction (+ demo) to the textometric

analysis with TXM

A few guidelines on the TXM hands-on

(4)

Preprocessing and textometry tools

Overview of the session

Goals

familiarise the participants with the textometric analysis and the

TXM software;

• encourage reflection on this type of analysis applied to (oral)

history;

collect feedback on the use of language technology in (oral)

history research.

Session activities (2 hours)

Presentation + TXM demo (20 min.);

TXM hands-on experiment (1 hour and 20 min.);

discussions (10 min.);

(5)

Preprocessing and textometry tools

Pre-processing data for textometric analysis - Sample corpus

Metadata

Text

BLACKIMMIGRANTSEN, interviews transcriptions from the collection Black immigrants to Britain, 1890-1975, UK Data Archive, Study Number 4936 (Thompson, P.).

10 interviews, 1973,

1975; interviewees – 2 women, 9 men.

Key topics: arrival in

Britain, family, leisure, religion, politics,

marriage and children, education, prejudice and race riots, etc.

(6)

Preprocessing and concordance tools

Pre-processing data for textometric analysis –Workflow

Metadata Text

(1) identify speakers and speakers’ roles; (2) clean data (XSLT)

(1) identify speakers and speakers’ roles;

(2) clean data (XSLT) (1) to lower case (XSLT); (2) POS tagging + lemma- tisation (TreeTagger) (1) to lower case (XSLT); (2) POS tagging + lemma- tisation (TreeTagger) XML-TEI transcriptions XML-TEI transcriptions transformed XML-TEI transcriptions transformed XML-TEI transcriptions … where we used to live in Sprules (?) Road …

… where we used to live in Sprules (?) Road …

… she came up here and - to [missing] College …

(7)

Preprocessing and concordance tools

Short introduction to the textometric analysis with TXM

What is textometry?

Methodology allowing quantitative and qualitative analysis of textual corpora, by combining developments in lexicometric and statistical research with corpus technologies (Unicode, XML, TEI, NLP, CQP, R).

What is TXM?

•Open-source platform (Heiden et al., 2010) used for the analysis of large bodies of texts in various fields of the humanities (history, literature, geography, linguistics, sociology, political sciences) and allowing to:

import from different textual sources, e.g. raw text combined to flat metadata (CSV), raw XML/w+metadata, XML-TEI BFM; exports of results in CSV for lists and tables or in graphic format (SVG, JPEG, etc.) for diagrams;

manage NLP tools for processing the input files during the import process (e.g. Tree Tagger for lemmatisation and POS tagging);

build a sub-corpus or a partition based on metadata (date, author, genre, etc.) or structural units (text, section, etc.) of a corpus;

query for word and word properties patterns (via the CQP search engine);

build frequency lists, KWIC concordances and co-occurrence scores for words and words properties;

compute specificity scores for words/properties in a sub-corpus or a partition,

(8)

Preprocessing and concordance tools

Short introduction to the textometric analysis with TXM

Create sub-corpus and partition using structural properties Create sub-corpus and partition using structural properties

Build queries and look for

co-occurrences of words/properties Build queries and look for co-occurrences of words/properties Build concordances and visualise contexts at the document level Build concordances and visualise contexts at the document level

Compute specificity scores and draw diagrams

(9)

Preprocessing and concordance tools

Short introduction to the textometric analysis with TXM

Specificities - probabilistic model (Lafon, 1980)

using hypergeometric

distribution formulae

allowing to:

•study the frequency distribution of words/properties in (sub-)corpus divided

on several parts;

•compare the parts, in terms of specific (excess/deficit) or basic use of

words/properties.

Specificities score

(see also TXM Manual, 2015: &11.9; Bernard and Bohet, 2017: 68-78)

-sign: (+/-) if the observed frequency f

i

(w

k

) is >/< than in a “normal” distribution

(taking into account the size of part i compared to the whole);

(10)

Preprocessing and concordance tools

A few guidelines on the TXM experiment

Materials:

TXM tutorial;

tasks descriptions.

During the experiment, please pay attention to the following

aspects:

proposed tasks;

hypotheses (and eventually new questions) that may be formulated based

on the observed linguistic phenomena in the studied corpus;

the role played by the language technology in formulating these hypotheses

or new questions, and potentially its “added value” (if applicable);

possible limitations, bias, etc. of the approach or data sample;

general reflections on the application of this type of analysis to (oral) history

(11)

Preprocessing and concordance tools

References

Bernard, M. Bohet, B. (2017). Littérométrie. Outils numériques pour l’analyse des

textes littéraires. Presses Sorbonne Nouvelle.

Heiden, S., Magué, J-P., Pincemin, B. (2010). TXM : « Une plateforme logicielle

open-source pour la textométrie – conception et développement ». In Sergio Bolasco, Isabella Chiari, Luca Giuliano (Ed.), Proc. of 10th International Conference on the Statistical Analysis of Textual Data - JADT 2010 (Vol. 2, p. 1021-1032). Edizioni Universitarie di Lettere Economia Diritto, Roma, Italy.

https://halshs.archives-ouvertes.fr/halshs-00549779/fr/. TXM Website: http://textometrie.ens-lyon.fr

(accessed May 15, 2018).

Lafon P. (1980). Sur la variabilité de la fréquence des formes dans un corpus, Mots

N°1, p 127-165. http://www.persee.fr/doc/mots_0243-6450_1980_num_1_1_1008.

TEI: Text Encoding Initiative. http://www.tei-c.org/.

TXM User Manual 0.7 - June 2015.

http://textometrie.ens-lyon.fr/files/documentation/TXM%20Manual%200.7.pdf.

XML: Extensible Markup Language. https://www.w3.org/XML/. XSLT: Extensible Stylesheet Language Transformations.

(12)

Preprocessing and concordance tools

Références

Documents relatifs

We revisit the compilation of such rules into a single se- quential transducer given by Roche and Schabes (Comput. 1995) and provide a direct construction of the minimal

Beim Erlernen einer neuen Sprache geht es nicht darum, neue Wörter oder Sprachregeln isoliert zu lernen, sondern um eine Reihe von Fähigkeiten, die regelmäßig und zur gleichen

Trois domaines nominaux sont pré-définis : le domaine nominal nul (0), le domaine nominal XML (http://www.w3.org/1998/xml) (1) qui est pré-défini avec le préfixe xml en accord avec

The ”FIXML to Java, C# and C++” case study of the Transformation Tool Contest (TTC) 2014 addresses the problem of automatically synthesizing program code from financial

Introduction Pr´ esentation XSLT Comment transformer XML avec XSLT Programme XSLT.. Transformation de documents XML

XSLT = production d’un document résultat à partir d’un document source..

Figure 3: Temperature profile in hydrate formation and dissociation in pure water following the slow procedure.. Figure 5: The algorithm of equilibrium temperature calculation in

In this paper, the potential static strength benefit of HBB joints with functionally graded adhesives (FGAs) is assessed, through a shear lag type simplified