• Aucun résultat trouvé

Introducing spatial coverage in a semantic repository model

N/A
N/A
Protected

Academic year: 2022

Partager "Introducing spatial coverage in a semantic repository model"

Copied!
122
0
0

Texte intégral

(1)

Thesis

Reference

Introducing spatial coverage in a semantic repository model

TARDY, Camille

Abstract

In this thesis, we propose a model for semantic digital libraries with a geospatial context and a definition of coverage as key concept. We present the document and spatial resource model.

We define the annotation model and more particularly the geographic coverage that detail and define the location of each resource taking into account its type. Finally, we present the query model and matching process where the geospatial context is an essential feature. To validate this model, we develop some use cases and implementation. We first focus on annotating documents and precisely locating the documents within the spatial resource. To do so, we describe the implementation of the annotation model, presented in the digital library model, especially the geo-semantic knowledge resources alignment. Then we present the methodology and implementation of a new technique to extract geographic information and place semantic from tags issued of volunteered geographic information (VGI) sources. This technique is based on a categorisation system, with a non-statistical knowledge-based approach. This extraction can partly automate the [...]

TARDY, Camille. Introducing spatial coverage in a semantic repository model . Thèse de doctorat : Univ. Genève, 2017, no. GSEM 40

URN : urn:nbn:ch:unige-1106527

DOI : 10.13097/archive-ouverte/unige:110652

Available at:

http://archive-ouverte.unige.ch/unige:110652

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Introducing Spatial Coverage in a Semantic

Repository Model

THESIS

presented to the Faculty of Economics and Management of the University of Geneva


by

Camille Tardy

Under the direction of Dr. Laurent Moccozet and

Prof. Gilles Falquet to obtain the title of

Docteur ès systèmes d’information Jury members:

Prof. Dimitri Konstantas, University of Geneva, President of the jury

Prof. Javier Nogueras-Iso, University of Zaragoza, Spain Prof. Giovanna Di Marzo Serugendo, University of Geneva

Dr. Claudine Métral, University of Geneva Thèse n° 40

Geneva, 30 Janvier 2017

(3)

La Faculté d’Economies et Management, sur préavis du jury, a autorisé l’impression de la présente thèse, sans entendre, par là, émettre aucune opinion sur les propositions qui s’y trouvent énoncées et qui n’engagent que la responsabilité de leur auteur.

Genève, le 30 Janvier 2017

La doyenne

Maria-Pia Victoria-Feser

Impression d'après le manuscrit de l'auteur

(4)

Table of Contents

Table of Contents ... iii

List of Figures ... v

Acronyms and Abbreviations ... viii

Résumé ... ix

Abstract ... xi

Remerciements ... xiii

Chapter 1 Introduction ... 1

1.1 Collaboration and Resource Sharing within a Spatial Context ... 1

1.2 Motivation and Problem Statement ... 2

Chapter 2 State of the Art and Related Work ... 7

2.1 Geographic Information System and Geographic Information Retrieval Example ... 9

2.2 Indexing, Matching and Ranking ... 16

2.3 Summary table ... 18

2.4 Spatial Visualisation of Documents ... 21

Chapter 3 Model ... 27

3.1 The Repository Model ... 29

3.1.1 Document Model ... 30

3.1.2 Annotation Vocabulary Model ... 32

3.1.3 Spatial Resource Model ... 34

3.2 The Coverage Model ... 35

3.2.1 Coverage Definition ... 35

3.2.2 Spatial Coverage and Spatial Content ... 36

3.2.3 Coverage Model ... 43

3.3 Method ... 44

3.4 Query and Matching ... 45

3.4.1 Query Model ... 45

3.4.2 Matching Algorithm ... 46

3.5 Summary ... 48

Chapter 4 Use Cases for Associating Documents and City Objects ... 49

4.1 Building the Annotation Model ... 49

4.1.1 Method implementation ... 49

(5)

4.1.2 Building the Core: Implementation Choices ... 50

4.1.3 Aligning the Ontologies ... 51

4.2 Annotating the Spatial Resource ... 53

4.2.1 Identification of Spatial Objects ... 53

4.2.2 Sub-Object Identification ... 54

4.3 Resource Annotation Techniques ... 56

4.4 Locate Resources ... 57

4.4.1 Align Documents and Spatial Objects ... 57

4.4.2 Example ... 61

4.5 Use Case: Geographic Coverage in Web Search Queries ... 63

4.6 Summary ... 70

Chapter 5 Use Case: Creating a New Technique for Extracting Geographic Information from Tags ... 71

5.1 Method Presentation ... 72

5.2 Algorithm ... 75

5.2.1 Geo Process ... 76

5.2.2 Word Sense Process ... 77

5.2.3 Disambiguation & Tags Extraction ... 79

5.3 Implementation ... 82

5.4 Validation ... 86

5.5 Summary ... 91

Chapter 6 Conclusion and Future Work ... 93

6.1 Contribution ... 93

6.2 Future Work ... 95

Annexe: Techniques and tools mentioned in this thesis ... 96

References ... 103

(6)

List of Figures

Figure 1 Smart city infographic ... 2

Figure 2 Illustration of decision taking in an urban context ... 5

Figure 3 SPIRIT document search space ... 9

Figure 4 Example of GIPSY index output for a document. ... 10

Figure 5 GeoTracker user interface ... 11

Figure 6 STEWARD user interface ... 11

Figure 7 PIV user interface ... 12

Figure 8 Geooreka architecture ... 13

Figure 9 GeoWorlds process example ... 14

Figure 10 VisGets user interface ... 14

Figure 11 World Explorer user interface ... 15

Figure 12 Harmony’s anchoring example ... 22

Figure 13 Arrigo Reloaded project example ... 22

Figure 14 Topos geospatial interface ... 23

Figure 15 Example of air quality model visualisation ... 24

Figure 16 CityGML composition ... 25

Figure 17 DELOS and DL.org domain concept map ... 27

Figure 18 DELOS and DL.org resource domain concept map ... 28

Figure 19 5S simple library structure ... 28

Figure 20 Repository basis bricks ... 29

Figure 21 Repository resources connections ... 30

Figure 22 Document model ... 31

Figure 23 Document annotation model ... 32

Figure 24 Annotation vocabulary facets ... 33

Figure 25 Annotation vocabulary model ... 34

Figure 26 Resource's coverage table ... 39

Figure 27 “Tank Man” near Tiananmen Square by, Jeff Widener 1989 ... 42

Figure 28 Algerian delegation arrivals to the hotel, for the signature of the “Accords d'Evian”, © AFP/STF 1962 ... 43

Figure 29 Coverage model ... 44

(7)

Figure 30 Query model ... 46

Figure 31 Document relevance scale ... 48

Figure 32 Ontologies alignment example ... 52

Figure 33 Onotlogies alignement schema ... 53

Figure 34 Example of identification ... 54

Figure 35 Example of storey identification ... 55

Figure 36 Link document-3D city model ... 57

Figure 37 IsGroupObject algorithm ... 60

Figure 38 Geneva international airport ... 62

Figure 39 University of Geneva “Uni Dufour” building ... 63

Figure 40 Ranking weight scale ... 63

Figure 41 CG and DCG for DuckDuckGo ... 65

Figure 42 CG and DCG for our ranking algorithm ... 65

Figure 43 Comparing CGs ... 66

Figure 44 Comparing DCGs ... 66

Figure 45 “legislation eau potable” DCGs full comparison ... 66

Figure 46 "formation continue informatique" DCGs comparison ... 67

Figure 47 Geographic coverage - ranking test result for region = France ... 68

Figure 48 Geographic coverage - ranking test result for region = Switzerland ... 69

Figure 49 Photo "VioleTT Pi" from Ludtz ... 73

Figure 50 Methodology global schema ... 75

Figure 51 Geo process ... 76

Figure 52 Word sense process ... 77

Figure 53 Disambiguation process ... 79

Figure 54 Extraction and selection ... 80

Figure 55 Methodology complete schema ... 81

Figure 56 Wikidata "concert"ancestors ... 85

Figure 57 Photo "Street #9" from Alonso Ormeño ... 87

Figure 58 Validation view of photo tags ... 87

Figure 59 Prototype photo result view ... 88

Figure 60 Photo "Où est-tu?" from Mr Brique ... 91

(8)

Figure 61 Photo "Linda Naeff exhibition @ Musée de Carouge" from Eric ... 91

(9)

Acronyms and Abbreviations

3DCM 3D City Model

CG Cumulated Gain

DC Dublin Core

DCG Discounted Cumulated Gain DL Digital Library/ies

DLMS Digital Library Management System EXIF EXchangeable Image file Format GIR Geographic Information Retrieval GIS Geographic Information System

IPTC International Press Telecommunications Council IR Information Retrieval

LOD Level of Detail

MBR Minimum Bounding Rectangle NER Name Entity Recognition

NERC Name Entity Recognition and Classification OWL Web Ontology Language

OGC Open Geospatial Consortium POI Point of Interest

POS Part-of-Speech

PPGIS Public Participation Geographic Information System RDF Resource Description Framework

VGI Volunteered Geographic Information W3C World Wide Web Consortium WOEID Where on Earth Identifier

WS Word Sense

WSD Word Sense Disambiguation XML Extensible Markup Language

(10)

Résumé

La prise de décision est une tâche universelle qui demande souvent de rassembler des données hétérogènes provenant de différents domaines d’activité. Les bibliothèques numériques sont des outils renommés pour le stockage, la gestion et la manipulation de données hétérogènes. Elles sont souvent utilisées pour aider et faciliter les tâches de collaboration et de prise de décision. Pour exécuter ces tâches, les utilisateurs-trices de ces bibliothèques devraient se voir présenter les informations nécessaires contextualisées, sans qu’il/elle n’ait besoin de les traiter afin de les lire de façon compréhensible. Toutes les relations entre les ressources contenues dans la bibliothèque numérique doivent apparaître clairement à l’utilisateur-trice afin que celui/celle-ci puisse facilement identifier les ressources complémentaires.

La connaissance peut varier temporellement et spatialement, il est donc pertinent de prendre en considération la couverture (ou la portée) spatio- temporelle des entités qui composent les ressources de connaissance. Dans le contexte des bibliothèques numériques, les ressources de connaissance comme les ontologies ou les thésaurus sont utilisées pour annoter les ressources gérées par la bibliothèque. Donc la couverture spatio-temporelle peut aussi être calculée pour les ressources annotées par les ressources de connaissance.

Dans cette thèse nous proposons un modèle de bibliothèque numérique sémantique avec un contexte géo-spatial et une définition de couverture comme concept clé. Nous présentons les modèles de documents et de ressources spatiales.

Nous définissons le modèle d’annotation et plus particulièrement la couverture géographique qui détaille et définit la localisation de chaque ressource en prenant en compte son type. Finalement, nous détaillons le modèle de requête et le procédé de correspondance où le contexte géo-spatial est une caractéristique clé.

Afin de valider ce modèle, nous développons des cas d’utilisation et des parties d’implémentation. Pour commencer, nous nous concentrons sur l’annotation de documents et précisément sur la localisation de documents au sein d’une ressource spatiale. Pour cela nous décrivons l’implémentation du modèle d’annotation, présenté dans le modèle de la bibliothèque numérique, et spécialement l’alignement des ressources de connaissances géo-sémantiques.

Puis nous présentons la méthodologie et l’implémentation d’une nouvelle technique d’extraction d’informations géographiques et de sémantique des lieux, depuis des tags issus de sources VGI (information géographique volontaire).

Cette technique est basée sur un système de catégorisation avec une approche non statistique, basée sur la connaissance. Cette extraction peut partiellement

(11)

automatiser la création des couvertures géographiques pour les ressources des bibliothèques numériques, ou être utilisée pour enrichir sémantiquement ou compléter des modèles 3D ou des services géographiques.

(12)

Abstract

Decision-making is a universal task that often calls for the gathering of heterogeneous data from different domains of activity. Digital libraries are a famous tool for storing, managing and handling heterogeneous data. They are often used to support collaboration and decision-making task. In order to complete those tasks, the digital library user should be presented with all the necessary information contextualised, with no need for him to process them in order to understand them. All the relations between resources in the digital library should appear clearly for the user to easily identify complementary resources.

Knowledge can vary temporally and spatially, thus it is pertinent to take into account the spatio-temporal coverage (or scope) of entities that compose the knowledge resources. In the context of digital libraries, knowledge resources such as ontologies or thesauri are used to annotate the resources managed by the library. So the spatio-temporal coverage can also be computed for the resources annotated by the knowledge resources.

In this thesis, we propose a model for semantic digital libraries with a geospatial context and a definition of coverage as key concept. We present the document and spatial resource model. We define the annotation model and more particularly the geographic coverage that detail and define the location of each resource taking into account its type. Finally, we present the query model and matching process where the geospatial context is an essential feature.

To validate this model, we develop some use cases and implementation. We first focus on annotating documents and precisely locating the documents within the spatial resource. To do so, we describe the implementation of the annotation model, presented in the digital library model, especially the geo- semantic knowledge resources alignment.

Then we present the methodology and implementation of a new technique to extract geographic information and place semantic from tags issued of volunteered geographic information (VGI) sources. This technique is based on a categorisation system, with a non-statistical knowledge-based approach. This extraction can partly automate the definition of the geographic coverage for the digital library resources, or be used to enhance semantically or complete 3D models and geo services.

(13)
(14)

Remerciements

Je tiens tout d’abord à remercier Philippe sans qui je n’aurais rien commencé ni fini. Merci pour ton soutien et ta patience au jour le jour pendant toutes ces années.

Merci aussi à mes parents, ma sœur et ma grand-mère pour leur soutien et leurs encouragements. Merci à mon oncle pour ses conseils et nos échanges.

Merci à mes collègues pour tous les moments partagés, les échanges, l’entraide et les conseils.

Merci à toute l’équipe du CUI pour leur aide infaillible et grâce à qui la vie d’un chercheur est bien plus facile. Merci tout spécialement à Marie-France Culebras, Nicolas Mayencourt, Daniel Aguillero et Elie Zagury.

Merci à tous les membres de mon jury pour vos conseils et pour avoir lu et corrigé ce travail : Les professeurs Dimitri Konstantas, Giovanna Di Marzo Serugendo, Claudine Metral et spécialement Javier Nogueras-Iso de l’université de Zaragoza.

Je tiens aussi à remercier spécialement mes directeurs, Laurent Moccozet et Gilles Falquet. Merci de m’avoir permis de participer à cette aventure et de m’avoir fait confiance. Merci pour votre accompagnement, votre disponibilité, pour nos moments d’échanges et pour m’avoir permis malgré mes doutes d’en arriver là. Merci aussi à vous pour m’avoir permis d’ouvrir mes horizons en participants à des projets très enrichissants. Laurent pour m’avoir permis de collaborer sur les projets e-learning avec toi, et Gilles pour notre collaboration en formation continue.

(15)
(16)

Chapter 1

Introduction

1.1 Collaboration and Resource Sharing within a Spatial Context

Nowadays, more and more tasks rely on collaboration and more importantly in collaboration between different domains of activity. For example, in the urban or construction domain, different professions must access the same resources in order to take a decision, like politicians, electricians, builders, solicitors… Sharing knowledge has always been a key issue in research. With the democratisation of digital tools, more possibilities have emerged to solve this issue.

In a large-scale context, digital libraries (DL) are a renowned tool for storing and managing documents and resources. Candela et al. have defined the main concepts and foundation of digital libraries. They describe digital libraries “as a tool at the centre of intellectual activity having no logical, conceptual, physical, temporal, or personal borders or barriers to information” [1]. Today digital libraries are capable of handling a wide range of resources, and can manipulate them within complex processes. They also handle user management and collaboration through digital library management system (DLMS). Many digital libraries system (DLS) and infrastructure have been presented in recent research such as BRICKS [2], with the introduction of semantic as seen in 5S framework [3] or Inspire [4]. Example of cross-domain semantic DLS implementation can be seen in the Papyrus [5] project which gathers the history and news domain in a news archive library.

Digital libraries can be enhanced with geographic information system (GIS), to handle the storage and usage of spatial data. In the context of decision-making and knowledge sharing, GIS allows to spatially contextualise numerous types of information such as images, web pages, text documents, etc. The spatial contextualisation enables collaboration between different domains of activity.

The geographic axis is a transversal aspect unrelated to any domain but used by many. As a common ground for visualisation and browsing, the spatial context is often translated in existing GIS through a cartographic 2D map of the world.

The combination of GIS and digital library is a pertinent solution for the design of a collaborative cross-domain tool that can manage heterogeneous resources with a spatial axis.

(17)

1.2 Motivation and Problem Statement

Nowadays the task of decision taking is often related to many domains of activities and often needs to handle heterogeneous corpora of resources. Decision- making is a key issue in our society, with concepts like smart cities that aims at finding ways to manage current challenges such as ecology, governance, technology, data accessibility, economy, and social implication. Albino et al. [6]

present the different existing definitions of smart cities in the literature, and some existing projects. The authors output the main characteristics of smart cities as: an infrastructure that enables social development and political efficiency; the urban development through business or creative activity; the social inclusion of resident and the social capital of the city; and the environment as strategic component for the future. Figure 1 shows the infographic representation of smart cities domain of applications, from the World Smart City1 community built by ISO2, ITU3 and IEC4, the three global standard institutions.

Figure 1 Smart city infographic

In this context we have seen the emergence of initiatives to share data openly. In May 2015, the European Council voted the open access to all scientific papers by

1 http://www.worldsmartcity.org

2 http://www.iso.org/iso/home.html

3 http://www.itu.int/en/Pages/default.aspx

4 http://www.iec.ch

(18)

20205. They aim that open access to scientific publications will lead to optimal re-use of research data, and that embedding open science in society will make science more responsive to societal and economic expectations. The open access policy supports the open science6 initiative.

More globally, the European Union legislates since 2003 on open data in Europe and proposes an open data portal7, since 2012, to freely access datasets from the EU institutions. It also lists a series of applications, third parties or not, that uses the available datasets. Those applications cover a large range of application domain: space research, pharmaceutical, marine biology, quality of life, law, integrity and corruption watch, etc. Each government member of the EU also provides its own open data portal. The open data portal in the US was open in 2009, following the signature of President Obama of the “Memorandum on Transparency and Open Government”8. Open data platforms propose datasets in structured format such as json, csv, xml, kml, geojson… or unformatted such as pdf. The users of such systems, must then process the raw data to render it readable, e.g. turning a dataset into a graphic or integrating the set to a map when the data contains explicit geographic information.

As we have seen the data is available, through digital libraries, freely for anyone to use. Another aspect of sharing and openness is the open consultation where citizens are openly offered an input in the process of decision-making. For example in the European Union platform, at this date three open consultations are currently running9 on diverse European research programs. In Europe, the British government is very active in this area, and proposes at this date 94 open consultations on diverse laws, regulation or other government decisions. In the same context, the IEEE10 organisation launched the smart city challenge in 2014 where users could propose solutions for any cities on four domains: energy, communication infrastructure, traffic systems or buildings. For those consultations, the available additional resources are always of the textual form like pdf or links to the web.

Some cities and territories have also implemented a GIS services for users to visualise geo-localised public information. For example land registry services like the SITG11 (Système d’Information du Territoire à Genève) in Geneva,

5 http://www.consilium.europa.eu/en/meetings/compet/2016/05/26-27/

6 https://ec.europa.eu/digital-single-market/en/open-science

7 https://data.europa.eu/euodp/en/data

8 https://www.whitehouse.gov/the-press-office/transparency-and-open-government

9 https://ec.europa.eu/research/consultations/index.cfm?pg=list

10 https://www.ieee.org/index.html

11 http://ge.ch/sitg/

(19)

Switzerland, or the IGN (institut national de l’information geographique et forestière) in France with their geoportail12, offer access to open data and proprietary data through a cartographic service. They processed the datasets into cartographic layers and enable users to create their own map by matching different layers. Both systems also provide a 3D view of their territory. However, those services are usually national services and so only propose certain dataset and not all of them are accessible to everyone, but some are reserved as a paid service.

Complementary to the previously described systems, there are also systems called Public Participation GIS or PPGIS as detailed in [7]. Those systems are a geospatial tool to inform planning processes with public knowledge, as it asks users to provide geographic information about their perception of a place. They are great systems to involve users in decision-making process, such as the open consultations.

In the context of decision-making, the user should be presented with all the necessary information contextualised, with no need for him to process them in order to understand them. Furthermore, if we carry on the example of the open consultation, citizens should be presented other law texts in relation with the one currently discussed. The links and relations to other text or complementary resources should be made clear in a way that contextualise the text currently examined by citizens, in order for their decision-making to be simplified.

In Figure 2, we depict an example of stakeholders and resources for the process of decision-making for a given task in a urban context. As we can see in our example, four stakeholders are implicated: a citizen, a politician, a urban planner and an ecologist. In order to make their decision, they need information contained in three corpora: institutions, transportation and ecology, each of them holds heterogeneous resources such as maps, text documents, images, videos, etc.

12 https://www.geoportail.gouv.fr

(20)

Figure 2 Illustration of decision taking in an urban context

This example depicts clearly the need for such systems to be able to handle multi-domain knowledge and heterogeneous types of resources. It is rare that a decision can be made efficiently by only taking into account the domain in question and not the surrounding implications. The stakeholders will also benefit to be able to filter within those resources those, which are relevant for their task according to their geographic and temporal validity.

Knowledge can vary temporally and spatially; a term or its definition can be valid in a given geographic area and in a given temporal range. For example, the term “trunk” defines a car boot in the U.S. whereas in the U.K. it defines a luggage. It is thus important to take into account this variation of scope in knowledge to be efficient. In digital libraries, domain is represented by its knowledge under the form of ontologies or thesauri for example. The handling of multiple domains in digital libraries can be done through the knowledge of each domain. Using ontologies as knowledge base, their alignment links the library’s domains together.

We will use in this thesis the urban domain as context for our research and example as it reflects the use of spatial axis and is a multi-disciplinary domain.

As in our example in Figure 2, a multi-disciplinary domain implies a heterogeneous public that must be able to handle interaction of different vocabularies and knowledge.

In this work, we seek to establish precise links between a document and city objects that are either directly referenced in the document or considered as relevant. We aim to be able to group those relevant city objects using logic paradigm and semantic entities.

(21)

We propose a model of semantic digital library supported by GIS to answer the following research questions:

1. How to present, qualify and define the geo-spatial context of any resource: text, image, dataset, 3D models etc. in digital libraries?

Our contribution, presented in Chapter 3, is a proposed solution for this question. It defines a model of digital library with a semantic core, based on geographic semantics. We present the document and spatial resource model. We define the annotation model and more particularly the geographic coverage that details and defines the location of each resource taking into account its type.

Finally, we detail the query model and matching process where the geo-spatial context is a key feature. They are both part of the information retrieval process, the later to define the interest of the research, and the former to filter the available resources according to the query.

2. How to validate this model? How to use this model to localise documents in 3D scenes and to semantically enhance 3D models?

We developed different use cases and implementations to demonstrate the feasibility of our model. In Chapter 4, we focus on annotating documents and precisely locating the documents within the spatial resources. We describe the implementation of the annotation model and notably the ontologies alignment.

In Chapter 5, we present the methodology and implementation of a new technique to extract geographic information and places semantics from tags issued from volunteered geographic information (VGI) sources. This extraction can then be used to enhance semantically or complete 3D models and geo services.

In the following chapter, we present the state of research on GIS, geographic information retrieval, and the spatial visualisation of documents in 3D environments.

(22)

Chapter 2

State of the Art and Related Work

The geospatial and temporal contextualisation of information has emerged as an important research and development direction to increase the quality of search engines. Well-known web search engines such as Yahoo!13 or Google14 have already implemented techniques to take the user location into account when processing queries. In fact, there exist many systems that enable users to query a repository of documents according to spatial and/or temporal contexts. Those systems are also known as Geographic Information System (GIS) and Geographic Information Retrieval Systems (GIR). Jones et al. [8] briefly present the current issues and state of GIR domain. They highlight the following key issues: the detection and disambiguation of geographic references, the interpretation of fuzzy geographic terminology, spatial and textual indexing, geographic relevance ranking and interfaces. We will address in this work the issues of the detection and disambiguation of geographic references and the interpretation of fuzzy geographic terminology.

GIR systems merge traditional contextualisation information retrieval (IR) issues with the one brought by the geographic and/or temporal dimension. The result ranking in GIR is not only influenced by the keyword search, but also by the matching of the query and the document geographic and temporal scope. This brings forward the need to define a matching algorithm that aggregates the keyword matching result with the scope matching results. The aggregation of both those results must take into account the fact that they are both issued from different scales, and so they cannot simply be added together.

Traditional IR is based on the appearance of the query keywords in the document. However, the temporal and geographic scopes need to be extracted as annotation or metadata of the document. The identification of the scopes can be done in different ways. In the case of web resources, as cited in the SPIRIT project [9], the web scope is defined as “the geographic area the creator of the web resource intends to reach”. The authors propose to use the link/URL structure of the resource and describe two steps to identify the resource location.

First, the Power value is the fraction of web pages from the location that should contain a link to the resource. So Power reflects the interest for the resource in a given location. Then, the Spread value measures the distribution of the Power

13 http://local.yahoo.com

14 http://google.com

(23)

for the resource. A location is in the scope of the resource if the Spread is over a given threshold and if the location ancestor’s Spread value is below this threshold. For example, if for a given resource, the Spread value is high for Geneva city but low for Geneva Canton, then only Geneva City should be considered in the scope of the resource. The authors also propose to extract geographical scope from the resource content. Such methods can be applied to any type of resources and are detailed below.

The easiest and simplest way to identify the geographic and temporal scopes is to extract geographic and/or temporal entities from the document metadata, as used in the GeoTracker, VisGet and World Explorer projects; or extract the information from its content via identification techniques from the named entity recognition and classification (NERC) field [10] such as named entity recognition (NER) or part-of-speech (POS). The NER technique can identify entities like persons, organisations, locations, expressions of times, etc, and the POS identifies similar grammatical and syntax entities such as noun, verb, adjective, adverb, pronoun, etc. However, these techniques can be too vague on their own, as it will gather all geographic/temporal entities mentioned in the document and not only those specific to its scope. A way to make up for this flow is to combine them with manual annotation, or using ontologies, thesauri or databases built by experts. The entities extracted using NER are filtered using the ontology or thesaurus, as seen in the SPIRIT, STEWARD and GIPSY projects. Lastly, the NER technique can be associated with a spatial expression interpretation process to decode expressions such as “south of”, or “adjacent”, etc. as used in the GeoSem, PIV and SINAI projects.

The last important point is the question of the documents’ presentation. In GIS the geographic and/or temporal aspect must be explicit in the query results display. Likewise, the interface for browsing and querying the repository should be adapted. A common way to explicit the geographic and or temporal dimension is for systems to display the documents on 2D maps of the world according to their scope, as it is done in the GeoTracker, STEWARD, PIV and GeoWorlds projects. Most use a point fix in space, others a polygon as a visual footprint for the document like it is done in the PIV project. Another often-used, complementary or stand-alone, way of displaying those aspects of the repository is to show them through the querying interface. As seen in the World Explorer and more particularly in the VisGet project, a time slider and an interactive map are used to define the query scope. The user navigates the map and uses the zoom to display and select the geo scope in the map window. The selected documents are then often simply listed and not integrated into the map.

However, other less common visualisation techniques can be seen such as the GIPSY index result. This project uses a 3D map canvas of the document-wide

(24)

scope and highlight through elevation picks the precise entities composing the document scope. An example is shown in Figure 4.

We present below in more detail, the list of GIR systems and projects we have previously cited.

2.1 Geographic Information System and Geographic Information Retrieval Example

The SPIRIT project [11], [12] proposes an information retrieval system that takes into account the geographic semantics of queries and documents. For instance, a query about towns in Switzerland should retrieve documents about Geneva, Zurich, etc. The system is based on geographic parsing and indexing of documents that assign a spatial footprint to each one of them. The document footprint is computed according to the place names extracted from the document content and matched with the geo-ontology. The query is given in the text, in three parts: subjects, place name and relation, by the user. The system then translates the place name and spatial relations into a geometric footprint for the query based on a spatial ontology. The query-document matching process combines textual matching with geographical matching. The SPIRIT project introduces the notion of query footprint to indicate the portion of the map relevant to the user’s query. This work also deals extensively with the resolution of the frequent ambiguities that arise in the naming of geographic entities.

Figure 3 SPIRIT document search space

The GIPSY [13], Georeferenced Information Processing System, is a non-domain specific system that geographically indexes full-text documents. This system does not provide a querying engine or browsing interface, as it is solely designed for indexing. The algorithms determine coordinates from place names in text, with the use of a geographic thesaurus. The thesaurus contains 200’000 entries such as place names, feature types or land use types. The algorithm can interpret

(25)

relations to place names such as south of, adjacent… The system returns a 3D grid, composed of a superposition of polygons, of the general location with picks where areas are identified in the text. An example of the output is shown in Figure 4 from [13] and labelled as: “Surface plot produced from the State Water Project text which talks about Santa Barbara County, San Luis Obispo, and the Santa Ynez Valley area at some length”.

Figure 4 Example of GIPSY index output for a document.

The GeoSem system [14] extracts and interprets the spatial expressions in documents and document passages, and allows the query and ranking of those documents and passages. The spatial expressions are more complex than simple place names. The system can interpret spatial operations such as “north of”, and spatial expressions such as “all the canton of Switzerland”. The system was implemented using the LinguaStream15 platform [15] to perform the semantic analysis of the geographic expressions. GeoSem returns extracts of documents in response to a geographic query, as each extract can be assigned a specific footprint.

GeoTracker [16] is a middleware system for RSS feed aggregator and browser. It enables users to query on a geographic and temporal axis by presenting the feed items on a world map. There is no query properly speaking but each user defines its profile with its interest. They make the assumption that the location information for each RSS feed item is explicitly given in the item. If many locations are present in the item, each will receive a pin linking to the item. The output is shown in Figure 5 as presented in [16].

15 Available at: http//users.info.unicaen.fr/fbilhaut/linguastream.htm

(26)

Figure 5 GeoTracker user interface

STEWARD [17] is a spatio-textual search engine for unstructured text documents, particularly web pages. Contrary to other similar search engines it does not assign the same scope to web pages according to their link structure, but fetch in every document references to geographic locations and register them as their scope. Each georeference registered as the scope is assigned a weight.

The system uses a hybrid approach of part-of-speech and named-entity recognition techniques to identify the georeferences in the document. The process also contains a semantic disambiguation algorithm. The user can query the system using both location and keywords. An example of STEWARD user interface is shown in Figure 6.

Figure 6 STEWARD user interface

(27)

In the cultural heritage domain, the PIV project [18] proposes a repository of spatio-temporalised contents that gathers heterogeneous types of resources that represent human modes of expression. The system builds a semantic tag system to associate direct and indirect locations to the documents as well as their evolution in time or temporal references. Only one location is kept for each document scope. Gaio et al. have developed a process to interpret spatial expressions and translate them in a polygon. The PIV project also have implemented the LinguaStream platform [15] for their geo-semantic textual data process. Gaio et al. have developed a query engine to allow users to define a location of interest when creating a query. The system only manages geographic content and so does not include content and domain search, and relies on other library management system to do so.

Figure 7 PIV user interface

Geooreka [19] is a web search engine integrated with a GIS database. Users select an area on a map to query the system. The system uses the zoom level chosen by the user to determine the type of place to use as the query scope (i.e.

country, region, city, etc.). The higher the zoom level, the fewer toponyms are selected. The selected toponyms are then extracted and associated with the query keywords as a pair (Theme - Toponym). The pairs are then filtered and compared to each other processing a probability weight. For the 20 best-scored pair, they compute a Borda count [20] to determine which pair will be used to query the web search engine Google or Yahoo!.

(28)

Figure 8 Geooreka architecture

The SINAI GIR system [21] uses GATE [22] to detect geo-entities via NER, verified using Geonames. To complete the NER results, the SINAI system has developed a process to detect and recognise topological spatial relationships, and uses Lemur16 as an IR engine and to build a document index. They re-rank the IR results using their document index, and Geographical index built using GATE NER and validated by Geonames. By taking into account and interpreting the spatial expression in the query and the type of the locations in the query, the filtering process calculates the coordinates of the corresponding bounding box and filters the pre-selected documents.

GeoWorlds [23] is a collaborative GIR system. It allows users to select a geographic region on a GIS display and returns to the user a list of documents associated with the chosen region. Inversely, once a document is selected, the region attached to it is highlighted in the GIS. The documents are harvested from information spaces maintained by specialised groups and data warehouses.

The document manager module of GeoWorlds allows users to organise, and annotate the documents. GeoWorlds focuses on the disaster domain and provides a data analysing module. In Figure 9, we can see the process to generate an analysis report with GeoWorlds.

16 https://www.lemurproject.org/lemur.php

(29)

Figure 9 GeoWorlds process example

The VisGet project [24], seves to browse news item from RSS feed. The system is based on the three following axis: temporal, geographic and topic. To add geographic information to an RSS item not specifying any, the project uses location information from the textual part of the item and identifies the corresponding coordinates with Geonames. The query time dimension is defined using a bar chart. The query’s spatial dimension is defined zooming on a 2D map of the world. Finally, the topic dimension is defined using a tag cloud. The query’s dimensions are displayed in Figure 10.

Figure 10 VisGets user interface

The World Explorer project [25] allows users to geographically query the Flickr photo database as shown in Figure 11. A visualisation tool shows the high-scored tags on a world map according to the zoom level. The tags are chosen to identify

(30)

a region according to a statistical process, taking into consideration the number of unique tag’s creator as well as tags repetition. A user can then retrieve the pictures related to a tag and a place. This project uses localisation and geographic information about concepts to retrieve geo-localised resources. Users can only browse the map, not input search keywords.

Figure 11 World Explorer user interface

As we have seen, GISs are used to contextualise spatially the resources and information. In GIS the geocoding process essentially consists, as described in the system presented before, in finding and extracting place names, geo entities or simply coordinates from textual data, such as street addresses or building names from the document content. The geographic footprint of the resource is then constructed by grouping the found entities. The problem we wish to address here is similar, but instead of finding place names in a document, we seek to establish precise links between a document and city objects that are either directly referenced in the document or considered as relevant. We aim to be able to group those relevant city objects using logic paradigm and semantic entities.

Each of the previously presented systems is either centred on a specific domain such as history or news, or not related to any domain. In this research we aim at enabling semantic cross-domain collaboration, to allow users from different expertise domains to actively and efficiently share knowledge. We will present in the following chapters a multi-domain integration model through ontology alignment.

Finally, contrary to the majority of the systems presented here that handles a single format of documents; we propose a model that can handle heterogeneous documents.

(31)

2.2 Indexing, Matching and Ranking

In this section we review in more detail the indexing, matching and ranking for the systems presented in section 2.1.

The SPIRIT project runs two indexes: a textual index that processes both spatial and non-spatial terms, and a spatio-textual index that combines a text and spatial indexing according to the documents footprint. In the first index, the matching depends on an exact match between the query terms and the document, whereas the former index uses geometric footprint for matching. The text relevance is based on the BM25 algorithm [26], and the spatial relevance on the distance between the query and the documents footprints. The two relevance scores are combined to form the final relevance score. The ranking is done using the geo-ontology to retrieve geometric footprint of places and comparing them to the query’s footprint geometry.

The GeoSem project uses a linguistic analyser of spatial expression and text analysis tool to extract geographic location. It then translates the found location into a semantic representation, to use as indexes for the documents and documents’ passages. The relevance is calculated depending on quantification, whether the query mentions quantity concept, or granularity. The quantification relevance weight is processed using probability such as “a quarter” will be 25%

relevance. The granularity relevance is processed using the hierarchy between the geo entities from the queries and the document index.

The STEWARD project handles web documents. The indexing process stores them in a database with its URL, metadata, the ASCII and HTML version of the document. The geo-location of the resource is extracted and selected using TF-IDF statistic along with NER and POS techniques and compared with a geodatabase to differentiate the geo entities from the other extracted entities.

The ranking is calculated according to the frequency and distribution of the keywords and references to geolocation in the document.

The PIV project gathers the extracted spatial feature in an index. Each feature is stored with its name, its interpretation and its geometric shape. The geometric shape of each spatial feature is recovered using GISs as a point for a building, a line for a road, etc. Then its shape is simplified as a minimum-bounding rectangle (MBR). They developed an algorithm to determine how to transform the spatial feature MBR to comply/translate the spatial relations found in the document. The retrieval process selects the documents or paragraphs to return according to the mapping result between the query geo features to the document’s geo scope. There is no ranking of the selected valid document or paragraphs.

(32)

The SINAI project uses GATE to detect geo entities in documents and validates the identification with Geonames and manual rules for spatial relationship identification. The project implemented the Lemur system as an information retrieval engine. The project holds two indexes: a document and a geographic index. The document index stores stem words for each document. A stem is the root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. The geographic index stores the list of all locations detected in the collection. The documents returned by the lemur engine are re- ranked according to filtering rules. The new rank is influenced by the weight of the corresponding filtering rule.

The GeoWorld project returns results imported from web search engines. The results are indexed in a table with their URL, their source, and their title. Rows are sorted according to the search engine ranking. The system processes two classifications: a textual and a place name classification. The textual classification is done using keyword extraction. The place name classification is done from place names extraction in the document comparing them with the one extracted from the map (query). The final ranking is processed doing a cross product of the two classifications.

The VisGet project extraction process assumes that each RSS item contains a title, a description, tags, a date and time of publication and a geo-location. If no location is explicit, the system uses Geonames web service to extract geo information from the RSS and transforms it in GeoRSS. There is no ranking in the result display, and each RSS item is pinned to the map according to its location.

The World Explorer project uses photos, users and tags as its dataset. The indexing of the tags is done for each photo in the dataset. The tags are associated with the coordinates of the picture they were extracted from. The world is then divided into tiles on different granularity (zoom). For each tile the system retrieves a geographic cluster of photo and their tags. Each tag receives a score within each cluster. The score is computed using a combination of terms frequency, using TF-IDF, the number of times a tag is used in the cluster, and user frequency, the percentage of photographers in the cluster that uses the tag.

If the tag’s score is above a given threshold, the tag is selected to appear in the corresponding tile. The user queries the system by zooming on the map. The system retrieves between 1 and 4 tiles that fit the display area and shows the corresponding tag cloud on the map.

The GIPSY projects only handle the indexation process. The system extracts geo locations from the resource content and retrieves their corresponding coordinates from a thesaurus. The system generates a matrix as the index of the resource.

(33)

The matrix represents the geographic scope of the document as a 3D grid. Each extracted geo-location appears as a peak on the grid according to their coordinates.

The Geooreka project did not involve ranking or information retrieval processes.

2.3 Summary table

Below is summary table of all the previously presented systems.

Geo tools Indexing, matching and ranking process

Document format handling SPIRIT

[11,12]

Spatial ontology Text and spatial indexing according to the

document footprint. Two relevance’s score (text and spatial) that are combined for the final ranking. Text matching is based on exact match. Spatial matching is based on inclusion according to a spatial ontology.

Web pages with

footprints

GIPSY [13] Geographic thesaurus

No query engine, the system handles the geographic indexing only.

Extract the place names in the text and

determined the coordinates using the geographic thesaurus.

Text documents

GEOSEM SYSTEM [14]

LinguaStream, linguistic analyser

of spatial

expressions.

Extracts spatial expressions from the documents, to index them. The system can be queried using geographic queries. Relevance is calculated using quantification.

Text documents

(34)

GEO TRACKER [16]

Embedded map for the user to query the system.

Middleware system that allows to query on spatial and temporal axis.

Compares the location information embedded in the RSS items to the map selected by the user.

RSS feed

STEWARD [17]

Semantic disambiguation algorithm. Geo database

Spatio textual search engine. Generates

documents scope from the body of the document using POS and NER.

Each scope gets assigned a weight. The ranking is computed taking into account the frequency and distribution of the

keywords.

Unstructured text

documents (Web pages)

PIV [18] LinguaStream Builds a semantic tag system. Translate the document scope into an MBR. Retrieval process returns the matching between the document scope and the query geo features. There is no ranking of the results.

Heterogeneous

type of

resources

GEOOREKA [19]

GIS database The query scope is determined by the zoom level set by the user. The higher the zoom the fewer the toponyms selected.

The system associates the toponyms with the selected theme as pairs to query web search engines.

Only the best pair (determined using a Borda count) is used for

No resources are directly handled.

(35)

the final query.

SINAI GIR [21]

GATE to detect geo-entites and verified by Geonames. Lemur system as an IR engine.

Each document gets a geographical and a document index. The system interprets spatial expression in the query to determine the query bounding box to filter the pre-selected documents.

Text documents

GEO WORLDS [23]

GIS display Users can annotate and organise the documents.

The system generates two indexes for each document using keywords

extraction: textual and place names. The ranking is the result of a cross product using the two indexes.

Documents are web pages harvested from information spaces.

VISGET [24] Geonames Define the geographic scope of an item by extracting place names from its content and identifying them with Geonames. There is no ranking of the results.

News item in RSS feeds

WORLD EXPLORER [25]

Yahoo Where On

earth ID

(WOEID)

Tags are indexed on a map using the photo coordinates. The tags identify a region using a statistic approach, taking into account the number of unique tag’s creator and the tag repetition.

Users can retrieve photos using the tags on the map. The results are not ranked.

Photos and tags form Flickr

(36)

Most of the presented projects use as index a kind of geographic database that contains all the location associated with a document. The retrieval process is then either an exact matching or a testing for an inclusion with the query footprint. The query footprint is of textual or geometric form, i.e. a polygon of coordinates.

We propose a geographic indexing and geographic ranking process based on the semantic and geography of places. We wish to allow a finer querying mechanism that allows more complex geographic queries than an inclusion testing from a footprint polygon as most of the presented systems do. To do so, we introduce a query and indexing engine based on a semantic tool that combines a geo gazetteer and a knowledge base of building and place semantics.

2.4 Spatial Visualisation of Documents

As we have seen in the previously described projects, the most common way of displaying the geographic dimension of a repository is through 2D maps.

However, the most natural visualisation environment is a 3D or at least 2,5D environment as the world is in three dimensions. 3D environments have been used to organise documents such as in the Bead system [27], where Chalmers build a 3D landscape from the similarities and dissimilarities of documents from a corpus. The assumption is that a retrieval task is better achieved if the relationships within a corpus are visible.

Annotation and more precisely resource implantation in 2D scenes are done using single points in space or flat polygons on the map, whereas within a 3D scene it can be attached to 3D objects or parts of objects. There is non-negligible research in the area of annotation and data integration of 3D models. We show through the example presented below the use of 3D models and particularly 3D city models as tools for data visualisation and interpretation.

The Harmony project [28], proposes an anchoring solution to link a document repository to objects in a 3D scene.

(37)

Figure 12 Harmony’s anchoring example

More precisely, research on 3D annotation has brought forward ways of linking heterogeneous and dynamic information to 3D objects. Havemann et al. [29]

describes a markup method which attaches information to parts of 3D models.

This provides the possibility to associate hyperlinks and links to parts of 3D objects to web documents, as shown in Figure 13.

Figure 13 Arrigo Reloaded project example

If we focus on the geographic domain and more precisely the urban domain, we can find many projects implementing annotation and data integration in 3D urban models. The Topos 3D spatial hypermedia system [30], provides a geospatial interface, as shown in Figure 14, for the exchange and organisation of information and for facilitating collaboration among users. The example scenario given is the collaboration of parties during a building construction on site. Topos also integrates a GPS to enable the matching of the real site and its 3D representation. Their work combines spatial hypermedia, GIS and a collaborative virtual environment.

(38)

Figure 14 Topos geospatial interface

The use of urban 3D city models to facilitate the interpretation of data leads to a visualisation tool to display air quality data in a collaborative environment [31]. This system allows the comparison of different scenarios to facilitate the collaboration. An example of the data visualisation is depicted in Figure 15. A similar system to generate 3D noise calculation and simulation within a 3D city model can be seen in [31]. This system allows visualising the noise impact in the city from the street level up to the building roofs. Such system allows the simulation of urban modification to impact the noise level. Metral et al. [33]

developed an ontology of 3D visualisation techniques in 3D city models.

Depending on the dataset format and with the use of the ontology, systems will be able to select the most appropriate visualisation technique automatically.

(39)

Figure 15 Example of air quality model visualisation

In the catalogue of existing 3D modelling languages, 3D semantic formats have been created to semantically enhance 3D representations. They integrate, within the language, descriptive semantics of the object or the domain. In the context of urbanism, 3D languages are used to model cities or neighbourhoods. In the context of GIS or urban GIS, user environment and/or repository’s browsing interface, can be built using a 3D modelling language to create a 3D view of the repository scope.

In this context, the most pertinent language is CityGML, it is used to describe 3D city models. Kolbe describes the CityGML data model adopted by the Open Geospatial Consortium (OGC) [34]. This XML-based language brings semantic to the 3D shape and texture and so represents the following aspects of a city model: semantic, geometry, topology and appearance. The language can represent five levels of detail (LOD). The highest level allows for each building parts to be represented such as windows and doors and indoor details. It also enables the definition of groups and parts of buildings. The semantics describes the following most important geographic features: buildings, water bodies, vegetation, city furniture and land use as described in Figure 16. For example, the building semantics holds information on its function and its usage. Finally, this language allows the description of the objects’ 3D spatial properties and interrelationships. It is a complete language and is now the international standard for representing, storing and exchanging 3D urban objects.

(40)

Figure 16 CityGML composition

Many models are already available in CityGML format. Goetz presents a method to automatically generate high-level CityGML, the level of detail (LOD) 3 and 4, level 4 being the highest available LOD including the interior definition [35]. The automated generation is done using crowd-sourced data from OpenStreetMap17 (OSM) and the proposed feature service IndoorOSM [36], [37].

Goetz et al. present in [38] the automated generation of CityGML models from OSM for LOD 1 and 2.

As seen here, most of the GISs and GIR systems use 2D cartography as browsing and querying interface. However, a 3D environment is more natural and precise interaction among users, as a certain level of details can be achieved. For example using a CityGML model, each part of the building can be accessed as an object: a particular storey, door or window…

We aim to implement and use the advances in 3D annotation as seen in this section, by using 3D city models as browsing and querying interface. The corpus of documents will be attached correspondingly to our geographic scope definition to the objects or part of objects in the city model. Users will then be able to query or browse documents by navigating in the city model and clearly visualise documents’ geographic scopes.

17 https://www.openstreetmap.org

(41)
(42)

Chapter 3

Model

Many framework and models for digital libraries exist, such as the DELOS [39]

and DL.org [40], the 5S framework [3], [41], and domain oriented models like Inspire [4] for high-energy physics and BRICKS [2] for Cultural Heritage.

With DELOS, and later DL.org as an enhancement of DELOS model, the authors propose a reference model of digital library as a result of the research on digital libraries from European researchers. DELOS defines digital libraries as a three-tier framework composed of a digital library management systems (DLMS) which provides the generic infrastructure, produces and administers the digital library system (DLS); a DLS which is the system that provides the functionality needed by the digital library; and the digital library which manages the content and provides the functionalities to the users. DELOS has defined the domains that compose the digital library as shown in Figure 17, and more precisely, the resource model in Figure 18. A DL domain is composed of resources. Resources are among others, annotated, described, and expressed using information objects in all their forms, such as documents, metadata, images, annotations, queries, results sets… A resource can be part of another resource and can be grouped in a set of resources.

Figure 17 DELOS and DL.org domain concept map

(43)

Figure 18 DELOS and DL.org resource domain concept map

The 5S framework divides DL in five layers: Streams, Structures, Spaces, Scenarios, and Societies as depicted in Figure 19 [42]. Streams are static or dynamic sequences of elements of any type, for example video delivered to users or document viewing. Structures specify how parts are organised as for example user relationships, taxonomies. Spaces represent set of objects along with the related operations like document space. Scenarios are sequences of events, which involve actions that alter a computation and influence future events, as workflow or dataflow.

Figure 19 5S simple library structure

(44)

Our research is based on the concepts and domains highlighted by both DELOS and 5S models. However, none of those generic models provide a geographic axis to the digital library. To fill this gap, we introduce in this chapter a model for the high-level repository model that includes a transversal geographic axis.

This repository is based on four elements as depicted in Figure 20. Those elements are the three repository resources and a transversal notion of spatiotemporal coverage. Our notion of coverage is defined as follow: the geographic and temporal context of a resource refers to the spatial and temporal regions in which the resource is true or must be true in the real world or in a fiction. We use here both meanings of “true”: something that is correct, accurate, or something that is real, genuine. The coverage model is detailed in section 3.2.

The three repository’s resources are defined below.

Figure 20 Repository basis bricks

3.1 The Repository Model

The repository is composed of three repository resources: a spatial resource, documents and the annotation vocabulary. As pictured in Figure 21, they are linked as follows: the annotation vocabulary annotates the documents; it also identifies the city objects from the spatial resource; the documents are located in the spatial resource. The following examples describe the links between the repository resources.

• annotates(d1, [“health”, “hospital”, “medecine”, …]); where d1 is a document and the concepts are issued from the annotation vocabulary.

• identifies(obj1, HUG); where obj1 is an object within the spatial resource and HUG is an instance that represents the public hospital in Geneva in the geographic ontology that composes the spatial vocabulary.

(45)

• coverage(d1, obj1); the document d1 is associated with obj1, the object is defined as its spatial coverage.

Figure 21 Repository resources connections

We first present the document model and annotation model.

3.1.1 Document Model

The document model represents the non-spatial resources stored in the repository. The document model and the document annotation model are inspired by the DELOS model as depicted in [39] and 5S framework [3]. Our major contribution is on the annotation model and particularly on the coverage annotation.

In DELOS, the space and time coverage in documents are stored in metadata using standards such as the Dublin Core (DC) [43]. Dublin Core proposes a

“coverage” property and is composed of a space and a time attribute. It is defined as follows in the DC metadata registry18: “The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant”. The spatial attribute is of range dc:location19 which is defined as a spatial region or named place, and the time attribute is of range dcterms:PeriodOfTime20 which is defined as an interval of time that is named or defined by its start and end dates. In other cases, such as in the 5S framework [3], system developed a facet called “Spaces” that contains metric or vector spaces which allow the location of a resource in 4D, the fourth dimension

18 http://purl.org/dc/elements/1.1/coverage

19 http://purl.org/dc/terms/Location

20 http://purl.org/dc/terms/PeriodOfTime

Références

Documents relatifs

The northern breakpoint in the distribution of Japanese honey bees, Apis cerana japonica, was determined to be the Shimokita peninsula of Aomori, while the southern breakpoint was

L’entreprise de gestion des services aéroportuaires (EGSA) ; représente un partenaire indispensable pour les compagnies aérienne, dont Air Algérie. Cette entreprise assure le

Our approach uses all the available information, however, our method relies mostly on tags analysis. Our idea is to identify tags that are geographically descriptive, discard- ing

In order to assign a set of facets to a given Last.fm tag, we process the subnetwork of Wikipedia pages specialized to the Last.fm folksonomy (obtained in section 3.1), as described

The Identification in the Terahertz (THz) domain (THID) using tags based on structured materials is a promising way to address counterfeiting issues [1]-[2] .This

Given our poor results, we took time to test our hypotheses with statistical tools, and found that, if tags provide sufficient information to predict the geographic distribution

Our approach uses all the available information, however, our method relies mostly on tags analysis. Our idea is to identify tags that are geographically descriptive, discard- ing

We took of course into account that SNOMED CT concepts can be either active or inactive in a release and that a concept that is active in one release may be deactivated in the