Detailed Contents

(1)

Contents . . . i

Acknowledgements . . . iii

List of Figures . . . v

List of Tables . . . vii

List of Abbreviations . . . ix Introduction. . . 1 1 Motivation . . . 3 2 Objectives . . . 7 2.1 Research questions . . . 8 2.2 Method . . . 10 2.3 Use case . . . 12 3 Structure . . . 15 I Information Extraction . . . 19 1 Background . . . 21

1.1 Natural language processing . . . 21

1.2 Information extraction . . . 24 1.3 Related ﬁelds . . . 26 2 Named-Entity Recognition . . . 29 2.1 Task deﬁnition . . . 29 2.2 Typologies . . . 31 2.2.1 Named-entity typology . . . 31

2.2.2 Types of NER systems . . . 34

2.3 Entity ambiguity and disambiguation . . . 35

2.3.1 Synonymy . . . 36

2.3.2 Homonymy and polysemy . . . 37

2.3.3 Metonymy . . . 38

2.3.4 Disambiguation . . . 38

3 Relations and Events . . . 40

(2)

3.1.1 Typology of relations . . . 42

3.1.2 Relation detection systems . . . 43

3.2 Event extraction and temporal analysis . . . 44

3.2.1 Event annotation . . . 46

3.2.2 Event extraction systems . . . 46

3.3 Template ﬁlling . . . 48

II Semantic Enrichment with Linked Data . . . 51

1 Making Sense of the Web . . . 53

1.1 The original vision . . . 54

1.2 Data structure and interoperability . . . 56

1.2.1 XML and RDF . . . 56

1.2.2 SPARQL . . . 59

1.2.3 SKOS . . . 61

1.3 From the Semantic Web to Linked Data . . . 61

2 Semantic Resources . . . 64 2.1 Knowledge bases . . . 64 2.1.1 DBpedia . . . 65 2.1.2 YAGO . . . 66 2.1.3 Freebase . . . 66 2.1.4 Wikidata . . . 67 2.1.5 ConceptNet . . . 67 2.2 Ontologies . . . 70

2.2.1 Web Ontology Language . . . 71

2.2.2 Limits of ontologies . . . 72

2.3 Identiﬁers . . . 73

2.3.1 Uniform resource identiﬁers . . . 73

2.3.2 Identiﬁers and locators . . . 75

3 Enriching Content . . . 76

3.1 Terminology . . . 77

3.1.1 Terms and concepts . . . 78

3.1.2 Terms and entities . . . 80

3.2 Entity linking . . . 81

3.2.1 Wikiﬁcation . . . 82

3.2.2 Semantic Annotation . . . 83

3.2.3 Knowledge Base Population . . . 84

3.3 Semantic relatedness . . . 85

III The Humanities and Empirical Content . . . 89

(3)

1.1 Deterministic and empirical data . . . 92

1.2 Crossover application domains . . . 93

1.3 Speciﬁcities of the humanities . . . 95

2 Digital Humanities . . . 96

2.1 Context . . . 96

2.1.1 From humanities computing to digital human-ities . . . 97

2.1.2 The era of digitisation . . . 97

2.1.3 Information extraction for cultural heritage . . 98

2.2 Close and distant reading . . . 100

2.2.1 Close reading and New Criticism . . . 100

2.2.2 Distant reading or the end of theory . . . 101

2.2.3 Reconciling the two approaches . . . 102

2.3 Critiques . . . 103

2.3.1 Over-interpretation . . . 103

2.3.2 The Hype cycle . . . 104

2.3.3 Picking the low-hanging fruit . . . 106

3 Historische Kranten . . . 107

3.1 Structure . . . 109

3.2 Linguistic distribution . . . 112

3.2.1 Hard-coded language tag . . . 112

3.2.2 Periodical titles . . . 113

3.2.3 Language detection . . . 114

3.3 People and needs . . . 118

3.3.1 Stakeholders . . . 118

3.3.2 Field survey . . . 119

3.3.3 Speciﬁcations . . . 121

IV Quality, Language, and Time. . . 123

1 Data Quality . . . 125

1.1 Fitness for use . . . 126

1.2 Optical character recognition . . . 128

1.3 Linked Open Data . . . 131

1.3.1 owl:sameAs and identity . . . 132

1.3.2 Quality of DBpedia . . . 134

2 Multilingualism . . . 137

2.1 Language-independent information extraction . . . 138

2.1.1 Multilingual NER . . . 140

2.1.2 Other cross-lingual applications . . . 143

(4)

2.3 Multilingual corpora . . . 146

3 Language Evolution . . . 147

3.1 The generative lexicon . . . 147

3.2 Stratiﬁed timescales . . . 148

3.2.1 Application to empirical databases . . . 149

3.2.2 Application to language evolution . . . 150

3.3 Concept drift . . . 151

3.3.1 Application to place names . . . 153

3.3.2 Emergence and salience of concepts . . . 154

V Knowledge Discovery. . . 157

1 MERCKX: A Knowledge Extractor . . . 159

1.1 Similar tools . . . 160 1.1.1 DBpedia Spotlight . . . 161 1.1.2 OpenCalais . . . 162 1.1.3 AlchemyAPI . . . 163 1.1.4 Stanford NER . . . 164 1.1.5 AIDA . . . 164 1.1.6 Zemanta . . . 165 1.1.7 Babelfy . . . 165 1.2 Components . . . 166 1.2.1 Python and NLTK . . . 167 1.2.2 X-Link . . . 168 1.2.3 DBpedia dump . . . 169 1.3 Workﬂow . . . 171 1.3.1 Download . . . 171 1.3.2 Dictionary . . . 172

1.3.3 Tokenisation, spotting, and annotation . . . 174

2 Evaluation . . . 175 2.1 Preliminary assessment . . . 176 2.1.1 Linguistic coverage . . . 176 2.1.2 SQuaRE analysis . . . 177 2.2 Methodology . . . 179 2.2.1 Objective . . . 179 2.2.2 Metrics . . . 180 2.2.3 Corpus . . . 182

2.3 Results and discussion . . . 185

2.3.1 Quantitative analysis . . . 186

2.3.2 Qualitative analysis . . . 187

(5)

3 Validation . . . 190

3.1 Beyond search engines . . . 190

3.2 Applications . . . 192

3.2.1 Search suggestions . . . 193

3.2.2 Related resources and data visualisation . . . . 195

3.3 Generalisation . . . 197 3.3.1 Other languages . . . 198 3.3.2 Other domains . . . 200 3.3.3 Other entities . . . 203 Conclusions . . . 207 1 Overview . . . 209 2 Outcomes . . . 213 2.1 Main ﬁndings . . . 213 2.2 Limitations . . . 215 2.3 Operational recommendations . . . 216 3 Perspectives . . . 218 3.1 Implementation . . . 218 3.2 Extrinsic evaluation . . . 219 3.3 Other applications . . . 220 A Source Code . . . 223

B Guidelines for Annotators . . . 225

C Follow-up. . . 226

Detailed Contents . . . 227