Oral History: Users and their Scholarly
Practices in a Multidisciplinary World
CLARIN- EU
Preprocessing and textometry tools
Florentina Armaselu, Luxembourg
Centre for Contemporary and Digital
History (C
2
DH), University of
Preprocessing and textometry tools
Summary
Keywords: pre-process, analyse, interpret, evaluate
•
Overview of the session
•
Pre-processing data for textometric analysis
•
Short introduction (+ demo) to the textometric
analysis with TXM
•
A few guidelines on the TXM hands-on
Preprocessing and textometry tools
Overview of the session
Goals
•
familiarise the participants with the textometric analysis and the
TXM software;
• encourage reflection on this type of analysis applied to (oral)
history;
•
collect feedback on the use of language technology in (oral)
history research.
Session activities (2 hours)
•
Presentation + TXM demo (20 min.);
•
TXM hands-on experiment (1 hour and 20 min.);
•discussions (10 min.);
Preprocessing and textometry tools
Pre-processing data for textometric analysis - Sample corpus
Metadata
Text
BLACKIMMIGRANTSEN, interviews transcriptions from the collection Black immigrants to Britain, 1890-1975, UK Data Archive, Study Number 4936 (Thompson, P.).
• 10 interviews, 1973,
1975; interviewees – 2 women, 9 men.
• Key topics: arrival in
Britain, family, leisure, religion, politics,
marriage and children, education, prejudice and race riots, etc.
Preprocessing and concordance tools
Pre-processing data for textometric analysis –Workflow
Metadata Text
(1) identify speakers and speakers’ roles; (2) clean data (XSLT)
(1) identify speakers and speakers’ roles;
(2) clean data (XSLT) (1) to lower case (XSLT); (2) POS tagging + lemma- tisation (TreeTagger) (1) to lower case (XSLT); (2) POS tagging + lemma- tisation (TreeTagger) XML-TEI transcriptions XML-TEI transcriptions transformed XML-TEI transcriptions transformed XML-TEI transcriptions … where we used to live in Sprules (?) Road …
… where we used to live in Sprules (?) Road …
… she came up here and - to [missing] College …
Preprocessing and concordance tools
Short introduction to the textometric analysis with TXM
What is textometry?
•Methodology allowing quantitative and qualitative analysis of textual corpora, by combining developments in lexicometric and statistical research with corpus technologies (Unicode, XML, TEI, NLP, CQP, R).
What is TXM?
•Open-source platform (Heiden et al., 2010) used for the analysis of large bodies of texts in various fields of the humanities (history, literature, geography, linguistics, sociology, political sciences) and allowing to:
• import from different textual sources, e.g. raw text combined to flat metadata (CSV), raw XML/w+metadata, XML-TEI BFM; exports of results in CSV for lists and tables or in graphic format (SVG, JPEG, etc.) for diagrams;
• manage NLP tools for processing the input files during the import process (e.g. Tree Tagger for lemmatisation and POS tagging);
• build a sub-corpus or a partition based on metadata (date, author, genre, etc.) or structural units (text, section, etc.) of a corpus;
• query for word and word properties patterns (via the CQP search engine);
• build frequency lists, KWIC concordances and co-occurrence scores for words and words properties;
• compute specificity scores for words/properties in a sub-corpus or a partition,
Preprocessing and concordance tools
Short introduction to the textometric analysis with TXM
Create sub-corpus and partition using structural properties Create sub-corpus and partition using structural properties
Build queries and look for
co-occurrences of words/properties Build queries and look for co-occurrences of words/properties Build concordances and visualise contexts at the document level Build concordances and visualise contexts at the document level
Compute specificity scores and draw diagrams
Preprocessing and concordance tools
Short introduction to the textometric analysis with TXM
Specificities - probabilistic model (Lafon, 1980)
using hypergeometric
distribution formulae
allowing to:
•study the frequency distribution of words/properties in (sub-)corpus divided
on several parts;
•compare the parts, in terms of specific (excess/deficit) or basic use of
words/properties.
Specificities score
(see also TXM Manual, 2015: &11.9; Bernard and Bohet, 2017: 68-78)-sign: (+/-) if the observed frequency f
i(w
k) is >/< than in a “normal” distribution
(taking into account the size of part i compared to the whole);
Preprocessing and concordance tools
A few guidelines on the TXM experiment
Materials:
•
TXM tutorial;
•
tasks descriptions.
During the experiment, please pay attention to the following
aspects:
•
proposed tasks;
•
hypotheses (and eventually new questions) that may be formulated based
on the observed linguistic phenomena in the studied corpus;
•
the role played by the language technology in formulating these hypotheses
or new questions, and potentially its “added value” (if applicable);
•
possible limitations, bias, etc. of the approach or data sample;
•
general reflections on the application of this type of analysis to (oral) history
Preprocessing and concordance tools
References
• Bernard, M. Bohet, B. (2017). Littérométrie. Outils numériques pour l’analyse des
textes littéraires. Presses Sorbonne Nouvelle.
• Heiden, S., Magué, J-P., Pincemin, B. (2010). TXM : « Une plateforme logicielle
open-source pour la textométrie – conception et développement ». In Sergio Bolasco, Isabella Chiari, Luca Giuliano (Ed.), Proc. of 10th International Conference on the Statistical Analysis of Textual Data - JADT 2010 (Vol. 2, p. 1021-1032). Edizioni Universitarie di Lettere Economia Diritto, Roma, Italy.
https://halshs.archives-ouvertes.fr/halshs-00549779/fr/. TXM Website: http://textometrie.ens-lyon.fr
(accessed May 15, 2018).
• Lafon P. (1980). Sur la variabilité de la fréquence des formes dans un corpus, Mots
N°1, p 127-165. http://www.persee.fr/doc/mots_0243-6450_1980_num_1_1_1008.
• TEI: Text Encoding Initiative. http://www.tei-c.org/.
• TXM User Manual 0.7 - June 2015.
http://textometrie.ens-lyon.fr/files/documentation/TXM%20Manual%200.7.pdf.
• XML: Extensible Markup Language. https://www.w3.org/XML/. • XSLT: Extensible Stylesheet Language Transformations.