Preprocessing and textometry tools

(1)

Oral History: Users and their Scholarly

Practices in a Multidisciplinary World

CLARIN- EU

(2)

Preprocessing and textometry tools

Florentina Armaselu, Luxembourg

Centre for Contemporary and Digital

History (C

2 _{DH), University of}

(3)

Preprocessing and textometry tools

Summary

Keywords: pre-process, analyse, interpret, evaluate

• _{Overview of the session}

• _{Pre-processing data for textometric analysis}

• _{Short introduction (+ demo) to the textometric}

analysis with TXM

• _{A few guidelines on the TXM hands-on}

(4)

Preprocessing and textometry tools

Overview of the session

Goals

•

_{familiarise the participants with the textometric analysis and the}

TXM software;

• encourage reflection on this type of analysis applied to (oral)

history;

•

_{collect feedback on the use of language technology in (oral)}

history research.

Session activities (2 hours)

•

_{Presentation + TXM demo (20 min.);}

•

_{TXM hands-on experiment (1 hour and 20 min.);}

•

_{discussions (10 min.);}

(5)

Preprocessing and textometry tools

Pre-processing data for textometric analysis - Sample corpus

Metadata

Text

BLACKIMMIGRANTSEN, interviews transcriptions from the collection Black immigrants to Britain, 1890-1975, UK Data Archive, Study Number 4936 (Thompson, P.).

• _{10 interviews, 1973,}

1975; interviewees – 2 women, 9 men.

• _{Key topics: arrival in}

Britain, family, leisure, religion, politics,

marriage and children, education, prejudice and race riots, etc.

(6)

Preprocessing and concordance tools

Pre-processing data for textometric analysis –Workflow

Metadata Text

(1) identify speakers and speakers’ roles; (2) clean data (XSLT)

(1) identify speakers and speakers’ roles;

(2) clean data (XSLT) (1) to lower case (XSLT); (2) POS tagging + lemma- tisation (TreeTagger) (1) to lower case (XSLT); (2) POS tagging + lemma- tisation (TreeTagger) XML-TEI transcriptions XML-TEI transcriptions transformed XML-TEI transcriptions transformed XML-TEI transcriptions … where we used to live in Sprules (?) Road …

… where we used to live in Sprules (?) Road …

… she came up here and - to [missing] College …

(7)

Preprocessing and concordance tools

Short introduction to the textometric analysis with TXM

What is textometry?

•_{Methodology allowing quantitative and qualitative analysis of textual corpora, by} combining developments in lexicometric and statistical research with corpus technologies (Unicode, XML, TEI, NLP, CQP, R).

What is TXM?

•Open-source platform (Heiden et al., 2010) used for the analysis of large bodies of texts in various fields of the humanities (history, literature, geography, linguistics, sociology, political sciences) and allowing to:

• _{import from different textual sources, e.g. raw text combined to flat metadata (CSV),} raw XML/w+metadata, XML-TEI BFM; exports of results in CSV for lists and tables or in graphic format (SVG, JPEG, etc.) for diagrams;

• manage NLP tools for processing the input files during the import process (e.g. Tree Tagger for lemmatisation and POS tagging);

• build a sub-corpus or a partition based on metadata (date, author, genre, etc.) or structural units (text, section, etc.) of a corpus;

• _{query for word and word properties patterns (via the CQP search engine);}

• _{build frequency lists, KWIC concordances and co-occurrence scores for words and} words properties;

• _{compute specificity scores for words/properties in a sub-corpus or a partition,}

(8)

Preprocessing and concordance tools

Short introduction to the textometric analysis with TXM

Create sub-corpus and partition using structural properties Create sub-corpus and partition using structural properties

Build queries and look for

co-occurrences of words/properties Build queries and look for co-occurrences of words/properties Build concordances and visualise contexts at the document level Build concordances and visualise contexts at the document level

Compute specificity scores and draw diagrams

(9)

Preprocessing and concordance tools

Short introduction to the textometric analysis with TXM

Specificities - probabilistic model (Lafon, 1980)

using hypergeometric

distribution formulae

allowing to:

•study the frequency distribution of words/properties in (sub-)corpus divided

on several parts;

•compare the parts, in terms of specific (excess/deficit) or basic use of

words/properties.

Specificities score

(see also TXM Manual, 2015: &11.9; Bernard and Bohet, 2017: 68-78)

-sign: (+/-) if the observed frequency f

_i

(w

_k

) is >/< than in a “normal” distribution

(taking into account the size of part i compared to the whole);

(10)

Preprocessing and concordance tools

A few guidelines on the TXM experiment

Materials:

• _{TXM tutorial;}

• _{tasks descriptions.}

During the experiment, please pay attention to the following

aspects:

• _{proposed tasks;}

• _{hypotheses (and eventually new questions) that may be formulated based}

on the observed linguistic phenomena in the studied corpus;

• _{the role played by the language technology in formulating these hypotheses}

or new questions, and potentially its “added value” (if applicable);

• _{possible limitations, bias, etc. of the approach or data sample;}

• _{general reflections on the application of this type of analysis to (oral) history}

(11)

Preprocessing and concordance tools

References

• _{Bernard, M. Bohet, B. (2017). Littérométrie. Outils numériques pour l’analyse des}

textes littéraires. Presses Sorbonne Nouvelle.

• _{Heiden, S., Magué, J-P., Pincemin, B. (2010). TXM : « Une plateforme logicielle}

open-source pour la textométrie – conception et développement ». In Sergio Bolasco, Isabella Chiari, Luca Giuliano (Ed.), Proc. of 10th International Conference on the Statistical Analysis of Textual Data - JADT 2010 (Vol. 2, p. 1021-1032). Edizioni Universitarie di Lettere Economia Diritto, Roma, Italy.

https://halshs.archives-ouvertes.fr/halshs-00549779/fr/. TXM Website: http://textometrie.ens-lyon.fr

(accessed May 15, 2018).

• _{Lafon P. (1980). Sur la variabilité de la fréquence des formes dans un corpus, Mots}

N°1, p 127-165. http://www.persee.fr/doc/mots_0243-6450_1980_num_1_1_1008.

• _{TEI: Text Encoding Initiative.}_{http://www.tei-c.org/}_.

• _{TXM User Manual 0.7 - June 2015.}

http://textometrie.ens-lyon.fr/files/documentation/TXM%20Manual%200.7.pdf.

• _{XML: Extensible Markup Language.}_{https://www.w3.org/XML/}_. • _{XSLT: Extensible Stylesheet Language Transformations.}

(12)