• Aucun résultat trouvé

From Text to Bits: Making the Treaties of a Monk from the 16th Century "Understandable" to the Computer

N/A
N/A
Protected

Academic year: 2021

Partager "From Text to Bits: Making the Treaties of a Monk from the 16th Century "Understandable" to the Computer"

Copied!
24
0
0

Texte intégral

(1)

FROM TEXT TO BITS

MAKING THE PRINTED TREATIES OF A MONK FROM THE 16

TH

CENTURY “UNDERSTANDABLE”

TO THE COMPUTER

Bastien DUBUISSON

Ph.D. Candidate in History

Greene’s Institute, Oxford, 19/10/19

@DubuissonBasti1

(2)

“machine-readable”

■ ≠ Machine Translation

■ ≠ Natural Language Processing (NLP)

“Data in a data format that can be automatically [read and] processed by a computer, such as CSV, JSON, XML, etc. Machine-readable data must be structured data”

1. Create a digital corpus ex nihil 2. Text à Structured Data

3. Digital inquiries

(3)
(4)

The Context

1512: Tunica Christi

“Ostensiones reliquiarum”

“Heilthumsschriften”

Johannes ENEN, Medulla Gestorum Trevirensium, 1514, 2nd edition in 1515

èTranslated into Latin by Johannes SCHECKMANN (+1531) Epitome alias medulla Gestorum Trevirorum,

Metz, 1517

(5)

OPTICAL CHARACTER RECOGNITION

Creating Digital Copies of Texts

(6)

Digital “Editions”

■ Raw text format è character strings (scripting languages)

■ Manually typing (txt, docx,…)

■ PROBLEM: large corpus of (longer) texts

(7)

Optical Character Recognition (OCR)

■ Convert images of handwritten/printed text into machine-encoded text

Kraken (http://kraken.re)

■ Neural Network: “machine

learning system that simulates a large number of interconnected processing units that resemble abstract version of neurons”

(8)

Optical Character Recognition (OCR)

■ Avoid a Silver Bullet Syndrome:

- High quality images (400dpi) è IIIF

- Pre-processing (deskewing, denoising) è GIMP

(9)

Optical Character Recognition (OCR)

- Train the model

a. 1 model = 1. typeface b. length of a text

c. abbreviations

è TIME CONSUMING ACTIVITY

1. Humanistic Typeface (3/4 books)

2. Gothic Typeface (1/4 books)

(10)

WHAT CAN BE

LOST

Adapting Texts for

Counting Words

(11)

Tokenization

■ Slicing a text into tokens

■ Latin:

- appended conjunctions and adverbs (-que; -ne; -ve) e.g.: dominusque è dominus; que

- contraction of reflexive pronouns with idiomatic reinforcements e.g.: seipsum è se; ipsum

(12)

Lemmatization

Latin = highly inflected language

Lemma = dictionnary form of a word

e.g.: amabit; amaverunt; amabantur è amo

TreeTagger

è Unrecognized words - Neolatin forms - German words

- Inconsistent spelling è Remaining unknown lemmas

(13)

WHAT CAN BE FOUND

Discoveries by Means of Digital Inquiries

(14)

Text Comparisons

■ Scheckmann’s Heilthumsschriften à chronicles

■ Use of hagiographic texts?

è Identify common passages (even ≠ verbatim) è textreuse package (“Optimal alignments”)

(15)

Text Comparisons

■ Passages on relics and miracles (inventio, translatio, miraculum,…)

■ Altar dedications (St Maximin & St Paulin)

Gesta Treverorum

Flores Epytaphii Sanctorum (Thiofrid of Echternach, 12th cent.)

(16)

Stylometry (Authorship Attribution)

■ Stylometry: “the quantitative study of literary style through computational distant reading methods (...), [that is] based on the observation that authors tend to write in relatively consistent, recognizable and unique ways”

– Laramée, D., Introduction to Stylometry with Python

■ Disputed authorship of:

- Historia Excidii Sancti Maximini - Supplicatio ad Caesarem

(17)

Stylometry

■ “Stop words” (or “function words”) à used unconsciously

à topic independent

■ “Impostors” (Nicholas of Cusa & Erasmus of Rotterdam)

■ 200 MFW è 65 MFFW

(18)

Stylometry

PC1

PC2 x

y

(19)
(20)
(21)
(22)
(23)
(24)

Concluding note

bastien.dubuisson@uni.lu

@DubuissonBasti1

■ Missing ‘traditional’ study

■ Creation of new digital corpora

■ Thoughtful use of digital tools

■ Challenge for historical languages

■ New research opportunities

Références

Documents relatifs

Key to this principle is the notion that, knowing the processes that generated the visual image and the internal generative model that allows its representation, predictive

Aspects of Genevois architecture from the Reformation to the Nineteenth Century.

H4a: Average cognitive load levels from participants reading the texts will be higher when background music is being played in comparison to silence, regardless of

2 Most of the documents studied in this paper were made available to the public as soon as 1970, when Robert Stowell published the Thoreau Gazetteer, edited by William

Montévil, Speroni, Sonnenschein & Soto: Modeling mammary organogenesis from biological first principles: cells and their physical constraints. The typical approach for

Since the systems we want to concentrate on mainly use ontology representation languages that are frame systems or (subsets of) description logics like OWL [7], we developed

The subtractions are caused by increases in load taken by the consumers, or by reduction in the gE:neration when the governors of the prime movers respond to high frequency.. 1n

The activities of all these early adaptors (Fablab founders, fabmanagers, science explainers, academic staff, science center staff, etc.) include the practice of digital