• Aucun résultat trouvé

IRI Discrepancies

Dans le document The DART-Europe E-theses Portal (Page 60-66)

Motivating Examples

3.4 IRI Discrepancies

3.4.1 Use Case 4 - Luxembourg Country (Logical Representation)

This use case is based on part of a Dbpedia real RDF graph. The RDF Graph 5 represents the RDF resource: Luxembourg with different types of IRIs and literals and two ontology1

<?xml version="1.0" encoding="utf-8" ?>

<rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

xmlns:ns8="http://dbpedia.org/ontology/PopulatedPlace/"

xmlns:dbo="http://dbpedia.org/ontology/"

xmlns:dcterm="http://schema.org/"

xmlns:dct="http://purl.org/dc/terms/" >

<rdf:Description rdf:about="http://dbpedia.org/resource/Luxembourg">

<dcterm:about rdf:resource="http://dbpedia.org/resource/Category:Luxembourg" />

<dct:subject rdf:resource="http://dbpedia.org/resource/Category:Luxembourg" />

</rdf:Description>

</rdf:RDF>

Namespace duplication

PDFill PDF Editor with Free Writer and Tools

Figure 3.8: Sub-part of RDF Serialization for the RDF Graph 5 in Figure 3.7 with Physical disparities due to IRI discrepancies (concerning problems 11 and 12).

http://dbpedia.org/resource/Luxembourg

http://dbpedia.org/ontology/Place

Luxemburgo Luxembourg

rdf:type wdrs:describedby owl:sameAs

Figure 3.7: RDF Graph 5 about of Luxembourg RDF resource

3.4.2 Use Case 5 - Luxembourg Country (Physical Representation)

This use case 5 represents a possible serialization of the RDF Graph 5 developed in the use case 4 (Section 3.4.1) in Figure 3.7. The RDF Graph 5 is encoded in RDF/XML format to show the namespaces linked with the resources in Figure 3.8.

! "

# " ! $

#$

! "

#% " ! $

#$

! "

# " ! $

#$

! "

# " ! $

#$

! "

#% "! " ! $

#$

Figure 3.9: RDF Graph with IRI discrepancies - IRI identity

3.4.3 Challenges in Use Cases 4 and 5

Consider Figures 3.9 and 3.10 which represent the RDF graph 5 with severals IRIs describing the same resource (e.g., Luxembourg), such that Fig. 3.9 highlights an IRI identity problem, whereas Figure 3.10 reflects an IRI coreference problem. In other words, several types of identities (in Figure 3.9) and references (in Figure 3.10) are introduced to give extra information about one resource, but not all the IRIs have the same information of the resource.

• Problem 11 - IRI Identity: where two different IRIs are used to designate in a dif-ferent way the same resource. Consider for instance the case of DBpedia describing the resource “Luxembourg” in Figure 3.9. For example, http://dbpedia.org/resource/

Luxembourg,http://en.wikipedia.org/wiki/Luxembourg, andhttp://dbpedia.org/

data/Luxembourg.nt (cf. Figure 3.9.a, b and d) represent the same resource in different ways: the first one is an identifier, the second one is a Web page, and the last one is a document representation in N-triple format,

• Problem 12 - IRI Coreference: where two different IRIs are used to designate the same resource in the same way. Following the example of DBpedia in Figure 3.10, DBpedia uses different IRIs that provide information about resource “Luxembourg” in order to describe it. Also, DBpedia uses vocabularies for the predicates to connect the statements

! "

# " ! $

#$

! "

#% " ! $

#$

! "

# " ! $

#$

! "

#% "! " ! $

#$

Figure 3.10: RDF Graph with IRI discrepancies - IRI reference

In short, various types of semantic ambiguities and IRI discrepancies can occur in an RDF description. For example, the fact that the same semantic information1 can be described in totally different ways, can seriously complicate RDF data processing such as RDF indexing, storage, and querying (making it more difficult for example to define proper indexing structures based on syntactic cues, or formulate meaningful SPARQL queries). Furthermore, semantic ambiguities and IRI discrepancies in RDF may produce different kinds of logical redundancies (RDF graph-level) and physical (RDF serialization-level) disparities in the RDF descriptions which, on their own, can have a huge burden on RDF processing and the development of RDF databases and solutions (processing time, loading time, similarity measuring, mapping, alignment, and versioning) [Gea04, THTC+15, THTCL16].

3.4.4 IRI Discrepancies creating Logical (Graph) Redundancies

Consider now the example given in Figure 3.11. Here, one can also identify various logical re-dundancies occurring in the forms of both RDF graph node duplications and edge duplications:

1Recall that the semantic information of an RDF statement refers not only to the values of the sub-ject/predicate/object nodes/edges in the statement, but rather to the meaning of the statement as a whole:

such that the meaning of a literal/blank/IRI node/edge depends on the subject/predicate/object nodes/edges it connects with in the containing statement.

http://dbpedia.org/resource/Luxembourg

http://dbpedia.org/ontology/Place rdf:type

wdrs:describedby owl:sameAs

Figure 3.11: RDF Graph with Logical redundancies due to IRI discrepancies (concerning prob-lems 3 and 4)

• Problem 13 - Node Duplication based on IRI discrepancies: where equivalent IRI nodes, designating equivalent subjects and/or objects, appear more than once. For instance, Figures 3.11.a, b, d, and e highlight different node duplications with: identifier IRI, document IRI, document representation IRI, and ontology IRI respectively.

• Problem 14 - Edge Duplication based on IRI discrepancies: where equivalent IRI edges, designating equivalent RDF predicates, appear more than once, such as in Figure 3.11.c with highlights an edge duplication with concept IRI.

3.4.5 IRI Discrepancies creating Physical (Serialization) Disparities

IRI discrepancies can also produce disparities at the RDF serialization level, namely producing duplicate namespaces in the same RDF file. More formally:

• Problem 15 - Namespace Duplication based on IRI discrepancies: where two different namespaces are used to designate the same vocabulary, e.g., in Figure 3.8: http:

//schema.org/ and http://purl.org/dc/terms/ point to the same vocabulary.

of our research. From these use cases, we identified 15 research challenges that we called prob-lems, which broadly fall into our four levels: logical redundancies, physical disparities, semantic ambiguities and IRI discrepancies.

One can clearly realize the compound effect of missing the different kinds of RDF logical duplications and physical disparities which can result from the various problems of syntactic redundancies (Sections 3.1 and 3.2), semantic ambiguities (Section 3.3) and IRI discrepancies (Section 3.4), all of which represent same (syntactic) or equivalent (semantic or coreferenced) RDF information which needs to be normalized into unified and unambiguous statements.

Against this background, in the next chapter, we introduce our first contribution towards the first two levels of the challenges: logical redundancies and physical disparities in a syntactic evaluation for RDF Normalization. Consequently, we cover semantic and IRI discrepancies in the following chapters.

Dans le document The DART-Europe E-theses Portal (Page 60-66)