FROM TEXT TO BITS
MAKING THE PRINTED TREATIES OF A MONK FROM THE 16
THCENTURY “UNDERSTANDABLE”
TO THE COMPUTER
Bastien DUBUISSON
Ph.D. Candidate in History
Greene’s Institute, Oxford, 19/10/19
@DubuissonBasti1
“machine-readable”
■ ≠ Machine Translation
■ ≠ Natural Language Processing (NLP)
■ “Data in a data format that can be automatically [read and] processed by a computer, such as CSV, JSON, XML, etc. Machine-readable data must be structured data”
1. Create a digital corpus ex nihil 2. Text à Structured Data
3. Digital inquiries
The Context
■ 1512: Tunica Christi
■ “Ostensiones reliquiarum”
■ “Heilthumsschriften”
Johannes ENEN, Medulla Gestorum Trevirensium, 1514, 2nd edition in 1515
èTranslated into Latin by Johannes SCHECKMANN (+1531) Epitome alias medulla Gestorum Trevirorum,
Metz, 1517
OPTICAL CHARACTER RECOGNITION
Creating Digital Copies of Texts
Digital “Editions”
■ Raw text format è character strings (scripting languages)
■ Manually typing (txt, docx,…)
■ PROBLEM: large corpus of (longer) texts
Optical Character Recognition (OCR)
■ Convert images of handwritten/printed text into machine-encoded text
■ Kraken (http://kraken.re)
■ Neural Network: “machine
learning system that simulates a large number of interconnected processing units that resemble abstract version of neurons”
Optical Character Recognition (OCR)
■ Avoid a Silver Bullet Syndrome:
- High quality images (400dpi) è IIIF
- Pre-processing (deskewing, denoising) è GIMP
Optical Character Recognition (OCR)
- Train the model
a. 1 model = 1. typeface b. length of a text
c. abbreviations
è TIME CONSUMING ACTIVITY
1. Humanistic Typeface (3/4 books)
2. Gothic Typeface (1/4 books)
WHAT CAN BE
LOST
Adapting Texts forCounting Words
Tokenization
■ Slicing a text into tokens
■ Latin:
- appended conjunctions and adverbs (-que; -ne; -ve) e.g.: dominusque è dominus; que
- contraction of reflexive pronouns with idiomatic reinforcements e.g.: seipsum è se; ipsum
Lemmatization
■ Latin = highly inflected language
■ Lemma = dictionnary form of a word
e.g.: amabit; amaverunt; amabantur è amo
■ TreeTagger
è Unrecognized words - Neolatin forms - German words
- Inconsistent spelling è Remaining unknown lemmas
WHAT CAN BE FOUND
Discoveries by Means of Digital Inquiries
Text Comparisons
■ Scheckmann’s Heilthumsschriften à chronicles
■ Use of hagiographic texts?
è Identify common passages (even ≠ verbatim) è textreuse package (“Optimal alignments”)
Text Comparisons
■ Passages on relics and miracles (inventio, translatio, miraculum,…)
■ Altar dedications (St Maximin & St Paulin)
■ Gesta Treverorum
■ Flores Epytaphii Sanctorum (Thiofrid of Echternach, 12th cent.)
Stylometry (Authorship Attribution)
■ Stylometry: “the quantitative study of literary style through computational distant reading methods (...), [that is] based on the observation that authors tend to write in relatively consistent, recognizable and unique ways”
– Laramée, D., Introduction to Stylometry with Python
■ Disputed authorship of:
- Historia Excidii Sancti Maximini - Supplicatio ad Caesarem
Stylometry
■ “Stop words” (or “function words”) à used unconsciously
à topic independent
■ “Impostors” (Nicholas of Cusa & Erasmus of Rotterdam)
■ 200 MFW è 65 MFFW
Stylometry
PC1
PC2 x
y
Concluding note
bastien.dubuisson@uni.lu
@DubuissonBasti1
■ Missing ‘traditional’ study
■ Creation of new digital corpora
■ Thoughtful use of digital tools
■ Challenge for historical languages
■ New research opportunities