Contents . . . i
Acknowledgements . . . iii
List of Figures . . . v
List of Tables . . . vii
List of Abbreviations . . . ix Introduction. . . 1 1 Motivation . . . 3 2 Objectives . . . 7 2.1 Research questions . . . 8 2.2 Method . . . 10 2.3 Use case . . . 12 3 Structure . . . 15 I Information Extraction . . . 19 1 Background . . . 21
1.1 Natural language processing . . . 21
1.2 Information extraction . . . 24 1.3 Related fields . . . 26 2 Named-Entity Recognition . . . 29 2.1 Task definition . . . 29 2.2 Typologies . . . 31 2.2.1 Named-entity typology . . . 31
2.2.2 Types of NER systems . . . 34
2.3 Entity ambiguity and disambiguation . . . 35
2.3.1 Synonymy . . . 36
2.3.2 Homonymy and polysemy . . . 37
2.3.3 Metonymy . . . 38
2.3.4 Disambiguation . . . 38
3 Relations and Events . . . 40
3.1.1 Typology of relations . . . 42
3.1.2 Relation detection systems . . . 43
3.2 Event extraction and temporal analysis . . . 44
3.2.1 Event annotation . . . 46
3.2.2 Event extraction systems . . . 46
3.3 Template filling . . . 48
II Semantic Enrichment with Linked Data . . . 51
1 Making Sense of the Web . . . 53
1.1 The original vision . . . 54
1.2 Data structure and interoperability . . . 56
1.2.1 XML and RDF . . . 56
1.2.2 SPARQL . . . 59
1.2.3 SKOS . . . 61
1.3 From the Semantic Web to Linked Data . . . 61
2 Semantic Resources . . . 64 2.1 Knowledge bases . . . 64 2.1.1 DBpedia . . . 65 2.1.2 YAGO . . . 66 2.1.3 Freebase . . . 66 2.1.4 Wikidata . . . 67 2.1.5 ConceptNet . . . 67 2.2 Ontologies . . . 70
2.2.1 Web Ontology Language . . . 71
2.2.2 Limits of ontologies . . . 72
2.3 Identifiers . . . 73
2.3.1 Uniform resource identifiers . . . 73
2.3.2 Identifiers and locators . . . 75
3 Enriching Content . . . 76
3.1 Terminology . . . 77
3.1.1 Terms and concepts . . . 78
3.1.2 Terms and entities . . . 80
3.2 Entity linking . . . 81
3.2.1 Wikification . . . 82
3.2.2 Semantic Annotation . . . 83
3.2.3 Knowledge Base Population . . . 84
3.3 Semantic relatedness . . . 85
III The Humanities and Empirical Content . . . 89
1.1 Deterministic and empirical data . . . 92
1.2 Crossover application domains . . . 93
1.3 Specificities of the humanities . . . 95
2 Digital Humanities . . . 96
2.1 Context . . . 96
2.1.1 From humanities computing to digital human-ities . . . 97
2.1.2 The era of digitisation . . . 97
2.1.3 Information extraction for cultural heritage . . 98
2.2 Close and distant reading . . . 100
2.2.1 Close reading and New Criticism . . . 100
2.2.2 Distant reading or the end of theory . . . 101
2.2.3 Reconciling the two approaches . . . 102
2.3 Critiques . . . 103
2.3.1 Over-interpretation . . . 103
2.3.2 The Hype cycle . . . 104
2.3.3 Picking the low-hanging fruit . . . 106
3 Historische Kranten . . . 107
3.1 Structure . . . 109
3.2 Linguistic distribution . . . 112
3.2.1 Hard-coded language tag . . . 112
3.2.2 Periodical titles . . . 113
3.2.3 Language detection . . . 114
3.3 People and needs . . . 118
3.3.1 Stakeholders . . . 118
3.3.2 Field survey . . . 119
3.3.3 Specifications . . . 121
IV Quality, Language, and Time. . . 123
1 Data Quality . . . 125
1.1 Fitness for use . . . 126
1.2 Optical character recognition . . . 128
1.3 Linked Open Data . . . 131
1.3.1 owl:sameAs and identity . . . 132
1.3.2 Quality of DBpedia . . . 134
2 Multilingualism . . . 137
2.1 Language-independent information extraction . . . 138
2.1.1 Multilingual NER . . . 140
2.1.2 Other cross-lingual applications . . . 143
2.3 Multilingual corpora . . . 146
3 Language Evolution . . . 147
3.1 The generative lexicon . . . 147
3.2 Stratified timescales . . . 148
3.2.1 Application to empirical databases . . . 149
3.2.2 Application to language evolution . . . 150
3.3 Concept drift . . . 151
3.3.1 Application to place names . . . 153
3.3.2 Emergence and salience of concepts . . . 154
V Knowledge Discovery. . . 157
1 MERCKX: A Knowledge Extractor . . . 159
1.1 Similar tools . . . 160 1.1.1 DBpedia Spotlight . . . 161 1.1.2 OpenCalais . . . 162 1.1.3 AlchemyAPI . . . 163 1.1.4 Stanford NER . . . 164 1.1.5 AIDA . . . 164 1.1.6 Zemanta . . . 165 1.1.7 Babelfy . . . 165 1.2 Components . . . 166 1.2.1 Python and NLTK . . . 167 1.2.2 X-Link . . . 168 1.2.3 DBpedia dump . . . 169 1.3 Workflow . . . 171 1.3.1 Download . . . 171 1.3.2 Dictionary . . . 172
1.3.3 Tokenisation, spotting, and annotation . . . 174
2 Evaluation . . . 175 2.1 Preliminary assessment . . . 176 2.1.1 Linguistic coverage . . . 176 2.1.2 SQuaRE analysis . . . 177 2.2 Methodology . . . 179 2.2.1 Objective . . . 179 2.2.2 Metrics . . . 180 2.2.3 Corpus . . . 182
2.3 Results and discussion . . . 185
2.3.1 Quantitative analysis . . . 186
2.3.2 Qualitative analysis . . . 187
3 Validation . . . 190
3.1 Beyond search engines . . . 190
3.2 Applications . . . 192
3.2.1 Search suggestions . . . 193
3.2.2 Related resources and data visualisation . . . . 195
3.3 Generalisation . . . 197 3.3.1 Other languages . . . 198 3.3.2 Other domains . . . 200 3.3.3 Other entities . . . 203 Conclusions . . . 207 1 Overview . . . 209 2 Outcomes . . . 213 2.1 Main findings . . . 213 2.2 Limitations . . . 215 2.3 Operational recommendations . . . 216 3 Perspectives . . . 218 3.1 Implementation . . . 218 3.2 Extrinsic evaluation . . . 219 3.3 Other applications . . . 220 A Source Code . . . 223
B Guidelines for Annotators . . . 225
C Follow-up. . . 226
Detailed Contents . . . 227