Guarded Structural Indexes

(1)

Guarded Structural Indexes

Theory and Application to Relational RDF Databases

Thèse présentée en vue de l'obtention du grade de Docteur en Sciences de l’Ingénieur sous la direction du Professeur Stijn Vansummeren

François Picalausa

(2)

(3)

Acknowledgments

This thesis would have been impossible without the mentoring of my advisor Stijn Vansummeren. I have always been impressed by Stijn’s sharp insights during our discussions and his boundless enthusiasm. For making the whole experience instructive and thoroughly enjoyable, thank you!

I also want to thank George Fletcher, Jan Hidders, and Yongming Luo for the numerous interesting discussions and fruitful collaborations. They should receive due credit for the substantial contributions they made to the results of this thesis.

I am also grateful to the members of my commité d’accompagnement, Es- teban Zimányi and Thierry Massart, for providing many useful comments on my work. It was also a great fun to work and discuss with my colleagues, Anh Vu, Boris, Fred, Geoffroy, Gary, Mich, Serge. Thank you for the stimulating environment you created. Furthermore, I am grateful to the FNRS for the financial support they provided for the realization of this work.

Finally, I want to thank my friends Avit, Anne-Sophie, Nycos, and Jehan for the great moment we spent together, as well as my family for their constant support.

i

(4)

(5)

List of Figures

1.1 Fragment of a music fans RDF database. . . 2

1.2 Graph about academic relations. . . 4

2.1 Example of databases in a business setting. . . 10

2.2 Tree-shaped query evaluation plan. . . 14

2.3 Illustration of a join tree. . . 15

3.1 Fragment of a music fans RDF dataset. . . 30

3.2 Graph view illustration. . . 31

4.1 Distribution of values in the subject position. . . 52

4.2 Distribution of values in the predicate position. . . 53

4.3 Distribution of values in the object position. . . 54

6.1 Example of fact simulation. . . 86

6.2 Illustration of a join forest. . . 92

6.3 Illustration of the unrolled database construction. . . 104

7.1 Illustration of a guarded structural index. . . 116

8.1 Illustration of the GLG of a database . . . 130

8.2 Illustration of fact bisimulation. . . 132

8.3 Illustration of the fact graph of a database. . . 135

8.4 Direct and indirect representation of equality types between facts.137 8.5 Illustration of the FPG of a database. . . 149

8.6 Per iteration computation time of GuardedKS. . . 164

8.7 Per iteration computation time of optGuardedKS. . . 165

8.8 Performance gain of optGuardedKSoverGuardedKS. . . . 166

8.9 Number of factk-bisimulation partitions versus k. . . . 167 9.1 Number of read requests for different query processing strategies.177

vii

(10)

(11)

List of Tables

4.1 Size of each dataset, and number distinct subjects, predicates,

and objects in each dataset. . . 51

4.2 Size of the intersection of the sets of values in the subject, predicate and object location. . . 55

4.3 Size of self-joins, in millions. . . 56

4.4 p-values of the Kolmogorov-Smirnov tests for power-law distribution. . . 58

5.1 Use of individual and,filter,opt, union, and graphoper- ators. . . 64

5.2 Distribution of sets of operators. . . 66

5.3 Top 20 predicate IRIS in triple patterns. . . 71

8.1 Characteristics of RDF datasets for experimental evaluation of guarded (bi)simulation computation. . . 162

8.2 Size of graph representation of RDF datasets. . . 162

8.3 The number of factk-bisimulation blocks. . . 167

8.4 Number of factk-simulationblocks, for k= 2, k= 3. . . 168

9.1 Number of read requests for SAINT-DB and RDF-3X on the CHAIN dataset. . . 178

9.2 Number of read requests for SAINT-DB and RDF-3X on the LUBM and SOUTHAMPTON datasets. . . 178

ix

(12)

(13)

List of Symbols

Symbol Meaning Definition Page

Databases, atoms, facts, and conjunctive queries

U Universe of atomic data values 9

V Universe of variables 9

S Database schema 9

α Maximum arity 9

a. . .g Atom 9

s. . .z Fact 9

EQ Set of all equality types 7.2 115

eqtp(a,b) Equality type 1 6.13 85

a, b, . . . Tuple 10

db Database 2.1 10

A Set of atoms 10

f Mapping –

dom(f) Domain of mapping f –

im(f) Image of mapping f –

◦ Functions composition –

f|_C Restriction to domain¹ 10

f(a) Point-wise mapping of terms 10

f(A) Mapping of atoms in a set 10

rel(a) Relation of a 10

terms(a) Terms of a^{2 3} 10

var(a) Variables of a 1 1 10

val(a) Atomic data values of a 1 1 10

|a| Arity of a 1 10

a.i i-th term ofa 1 10

Q(x)←a₁, . . . ,a_n Conjunctive query 2.3 11

1. Also applies to restrictions to terms of an atom.

2. Also applies to tuples.

3. Also applies to conjunctive queries.

xi

(14)

body(Q) Body of a conjunctive query 11

head(Q) Head of a conjunctive query 11

µ Valuation, embedding 2.6 12

Ω Set of valuations 13

true Non-empty set of embeddings 12

false Empty set of embeddings 12

l Compatibility of valuations 2.11 13

1 Join 13

JQKdb Query evaluation semantics⁴ 2.12 13

V Indistinguishability 6.27 90

V^k Indistinguishability up tok 6.36 92

ν Fact assignment 6.39 93

|= Fact assignment consistency 6.41 93

RDF and SPARQL

I IRIs 18

B Blank nodes 18

L Literals 18

IB IRIs and blank nodes 18

IBL IRIs, blank nodes, and literals 18

G RDF graph 2.26 18

γ Dataset assignment 2.26 18

D= (G, γ) RDF dataset 2.26 18

select SELECT query 19

ask ASK query 19

construct CONSTRUCT query 19

describe DESCRIBE query 19

P Graph pattern 19

and AND operator 2.28 20

union UNION operator 2.28 20

opt OPTIONAL operator 2.28 20

filter FILTER operator 2.28 20

graph GRAPH operator 2.28 20

select x where P Basic graph pattern query 2.27 20 Graphs, forests and trees

G= (Σ,∆, V, E,lab) Graph 22

Σ Finite set of node labels 22

∆ Finite set of edge labels 22

V, V(G) Finite set of nodes of graphG 22

E, E(G) Finite set of edges of graph G 22

4. Also applies to atoms, joins, graph patterns, and SPARQL queries.

(15)

lab Node labelling function 22

v−→^a _Gw Edge in graph G 22

F Join forest 2.13 14

ecc Eccentricity of a node 6.30 91

rad Radius of a graph 6.30 91

height Height of a forest⁵ 6.32 91

Hypergraphs

H Hypergraph 2.15 15

N Set of nodes of an hypergraph 2.15 15

E Set of hyperedges 2.15 15

Modal simulation and bisimulation

R Bisimulation relation 2.30 23

≈ Bisimilar 23

bisim(G, v) Bisimilarity block 2.31 23

S Simulation relation 2.33 24

Simulated 24

∼ Similar 24

sim(G, v) Similarity block 24

R_k⊆ · · · ⊆ R₀ k-bisimulation relation 2.34 25

≈^k k-bisimilar 25

S_k⊆ · · · ⊆ S₀ k-simulation relation 2.35 25

^k k-simulated 26

∼^k k-similar 26

Guarded simulation and bisimulation

db|_X Restriction of a database 81

I Guarded bisimulation relation 6.8 81

≈_g Guardedly bisimilar 81

gblocks Guarded bisimilarity blocks 8.2 127

J Guarded simulation relation 6.11 83

_g Guardedly simulated 83

∼_g Guardedly similar 83

gsim Guarded similarity block 7.17 123

gsblocks Guarded similarity blocks 8.1 127

T Fact bisimulation relation 8.7 132

≈_f Fact bisimilar 132

T_k⊆ · · · ⊆ T₀ Fact k-bisimulation relation 6.23 144

≈^k_f Fact k-bisimilar 144

F Fact simulation relation 6.15 85

_f Fact simulated 85

5. Also applies to atoms in a acyclic conjunctive query, and to acyclic conjunctive queries

(16)

∼_f Fact similar 85 F_k⊆ · · · ⊆ F₀ Fact k-simulation relation 6.23 89

^k_f Fact k-simulated 89

∼^k_f Fact k-similar 89

k-fsim Fact k-similarity block 7.10 120

Fact relation 6.17 86

F[J] Guarded to fact simulation 86

J[F] Fact to guarded simulation 86

Database unrolling

canondb_Q Canonical database 6.54 101

δ Canonical mapping 6.54 101

unrolldb Unrolled database 6.54 101

x^◦ White copy of variable x 102

x^• Black copy of variable x 102

ρV Black and white atom copy 102

w_V Covering 6.56 102

F∩ Maximum intersections 6.57 102

unrolldb Unrolled databases 6.59 103

I_Q,F Unrolling mappings 6.59 103

Guarded structural index

I Guarded structural index 7.2 115

ξ Embedding 7.4 116

[s]k Index Block 7.11 120

simk Guarded k-simulation index 7.11 120

sim Guarded simulation index 7.18 123

Simulation and bisimulation computation

GLG Guarded list graph 8.3 129

GLG_∅ GLG w/o empty lists 163

FG Fact graph 8.11 135

FG_∅ FG w/o empty equality types 136

FSG Fact sublist graph 8.17 139

FSG_∅ FSG w/o empty sublist 144

FPG Fact projection graph 8.30 148

FPG_∅ FPG w/o empty projection 153

◦ Composition of equality types 137

gl Guarded list 129

gsl Guarded sublists 138

minpos_s First occurrence of terms 138

restr Restrictions 138

proj Projection 148

(17)

step_≈ Restrictions simulation step 8.20 140

step Facts simulation step 8.34 140

(18)

(19)

List of Acronyms

Acronym Definition Page

BGP Basic Graph Pattern 20

BSBM Berlin SPARQL Benchmark 51

CPF Conjunctive Pattern with Filters 65

CQ Conjunctive Query 11

DBMS Database Management System 50

FO First Order logic 5

IRI Internationalized Resource Identifier 18

JSON JavaScript Object Notation 1

LUBM Lehigh University Benchmark 51

OLAP Online Analytical Processing 49

OWD opt-well-Designed 67

PSACQ Pure and Strict Acyclic Conjunctive Query 90

RDF Resource Description Framework 1

RDF-3X RDF Triple eXpress 34

RDFS RDF Schema 41

SPARQL SPARQL Protocol And Query Language 3

UNF Union Normal Form 67

UWD union-well-Designed 68

XML eXtensible Markup Language 1

xvii

(20)

(21)

1

Introduction

With the rising popularity of the World Wide Web, a flurry of new data formats have been proposed in recent years to allow applications on the Web to interchange data in a flexible and scalable way. In particular, in contrast to classical relational databases where each piece of information has a rigid predefined structure shared by all applications, data on the web describes the structure of the data alongside the data itself. Hence, these new data formats are known as self-describing, or semi-structured data formats. Examples of such data formats that have gained traction in the recent years include the eXtensible Markup Language (XML) [18], the Resource Description Frame- work (RDF) [61, 68], and the JavaScript Object Notation (JSON) [32].

As the volume of semi-structured data has increased along the years, the need for efficient solutions to store and query semi-structured data has also increased. This has led to the creation of specialized semi-structured database engines. These engines typically exploit the particular shape of the semi- structured data to provide the necessary performance. For instance, many semi-structured data formats can be seen as being graph-shaped—that is, constituted of a set of nodes connected by edges. In this context, graph-specific storage techniques such as the pre-computation of paths [86] (successive edges of the graph), the grouping of nodes in clusters according to their distance from one another [123], as well as the use of efficient representation techniques for nodes and edges [48] have been applied to semi-structured data.

In addition to these specialized engines that focus on exploiting the graph structure, different proposals have been made to store semi-structured data

1

(22)

Triples

Subject Predicate Object work1234 Composer Debussy

user8604 likes work1234

user3789 name Umi

user3789 likes work1234

user8604 name Teppei

user8604 friendOf user3789 (a) Relational perspective

work1234 Debussy

user3789

Umi user8604

Teppei

Composer likes likes

friendOf

name name

(b) Graph perspective

Figure 1.1: A small fragment of an RDF database concerning music fans.

directly in relational databases [9,109]. In fact, many state-of-the-art research prototypes as well as commercial database engines for RDF are built upon relational database engines, as we will see in Chapter 3. This may be due to the many well-known and well-refined techniques developed in relational databases for processing large amounts of data from external memory.

Both the graph-based and the relational perspectives on storing and querying semi-structured data face the challenge of providing suitable efficiency for ever increasing amounts of data. Indeed, spurred by efforts like the Linking Open Data project [62], increasingly large volumes of data are being published in RDF. By the end of 2011, there were over 31 billion triples published by data sources participating in the Linking Open Data Project¹, with many individual data sources counting hundred of millions of triples. XML too has been the publication format of choice for many well known large scale datasets such as Uniprot²; the Open Government Initiative data³; and the Penn Treebank⁴.

Given that both the graph-based and the relational perspectives on storing and querying semi-structured data each have their own (complementary) strengths and weaknesses with respect to performance and efficiency, the question arises: is it possible to marry the two approaches, combining their strengths and eliminating as much as possible their weaknesses? In this thesis, we provide a partial answer to this question by studying the generalization of structural indexing—a technique of importance in the graph-based storage of XML and RDF data—to arbitrary relational databases. Our study focuses on the application of this generalization to obtain a structural indexing method that is faithful to querying RDF data.

1. http://www4.wiwiss.fu-berlin.de/lodcloud/state/

2. http://www.uniprot.org/

3. http://www.data.gov/

4. http://www.cis.upenn.edu/~treebank/

(23)

Graph-based and relational perspectives on RDF data. To illustrate why the graph-based perspective is not fully faithful to querying RDF data, let us briefly introduce both the relational and graph-based perspectives on RDF.

Informally, from a relational perspective, RDF data are triples of the form (subject,predicate,object) that are stored in a single large ternary database table. From a graph-based perspective, in contrast, the subjects and objects of the triples are seen as nodes in a graph and each triple corresponds to an edge from its subject to its object, labeled with its predicate. These perspectives are illustrated in Figure 1.1, where the left-hand side (a) gives the relational representation of an RDF music fans database, and its corresponding graph representation on the right-hand side (b).

There is, however, a discrepancy between the two approaches that comes from the fact that RDF does not make any distinction between the roles of predicates and subjects or objects. For instance, extending our example, one could add a triple (friendOf,is-a,socialRelationship) to give more information on the predicatefriendOf that appears in (user8604,friendOf,user3789). In the graph representation, however, this would add the new nodes friendOf and socialRelationship with an is-a edge between them. Note that there are now two representations of friendOf in the graph: once as a node, and once as an edge label. Triples like (friendOf,is-a,socialRelationship), where the subject is the predicate of some other triples are fairly common in practice, in particular in vocabularies concerned with ontologies [19, 88]. In addition, the SPARQL Protocol and Query Language (SPARQL) [100], which is the standard query language for RDF, does not distinguish between predicate, and subjects or objects: it allows information in the predicate position to be freely combined with information in the subject or object position. Most graph-based techniques, and in particular the structural indexes that have previously been developed for RDF data [75, 123, 137], however, assume that the set of predicates is fixed and distinct from the set of nodes. As such, they do not allow queries where predicate information needs to be joined with subjects or objects, and are hence not fully faithful to RDF.

Structural Indexing. Towards the generalization of efficient graph-based databases techniques and their convergence with relational database technolo- gies, in this thesis, we study the application of structural indexing to relational databases. This will result in what we call a guarded (structural) index.

The key idea behind traditional structural indexing is that for many practical classes Q of queries (e.g., reachability queries, XPath queries [40, 52], queries expressed in modal or temporal logics [15],. . . ) it is possible to group together the nodes of input graphGto obtain a more compact representation of G, called the structural index forG (with respect to Q). The grouping is

(24)

GraphG prof

1

prof

2

prof

3

prof

4

phd

5

stud

6

stud

7

adv adv

adv

adv adv

sup sup

GraphH prof prof

prof

phd stud

adv adv

adv

adv adv

sup

GraphR 1 2,3 4 5 6,7

prof prof prof phd stud

adv adv adv sup

Figure 1.2: Graph about academic relations between professors,phd students, and bachelor students, with advisor-of and supervises relationships. For con- venient later reference, the nodes of Gare given an identity (1,2,3, . . .). The dotted lines indicate a bisimulation between G and H. The graph R is the bisimulation reduct of G.

done in such a way that any query Q ∈ Q can be answered directly on the structural index of G instead of on G itself [40, 67, 89, 102, 122]. Since this index is typically (much) smaller than G itself, this way of processing Q can be significantly faster evaluatingQ overGitself.

Example 1.1. To illustrate, Figure 1.2 shows a graphGand a more compact representation R. Observe that each node of R is actually a set of nodes in G, and that there is an edge between sets V and W in R if there is a corresponding edge between v ∈ V and w ∈ W in G. Clearly, R has fewer nodes and edges than G. Further observe that to evaluate a query Q such as

“select all professors that advised someone who is currently a professor who is advising a PhD student”, it suffices to evaluate Qon R: the resulting node {2,3} of Q onR is exactly the set of nodes resulting from Q onG.

Obviously, the grouping of nodes in the index must be done in such a way that the right information can be retrieved from the index when processing a query. The notions of simulation and of bisimulation are fundamental for this purpose. Essentially, bisimulation characterizes when two nodes in a graph share basic structural characteristics such as labels and neighborhood connectivity. Simulation relates pairs of nodes (v, w) such that whas at least the basic structural characteristics of v. In the context of XML, simulation-

(25)

and bisimulation-based indexes are known to be coveringfor different restrictions of the XPath query language [89, 102]. That is, given a query of the restricted query language, its evaluation on the structural index will provide exactly the nodes that would be returned had the query been evaluated on the original data. Stated differently, the query cannot distinguish between nodes that have been grouped in the index. We also say that the query is invariant under bisimulation. The formal definition of (bi)simulation and its use to obtain a compact representation of a graph is given in Chapter 2.

Towards a generalization of structural indexing to relational databases—

and in the hope of developing a structural index that is faithful to the RDF data model—it is natural to consider known generalizations of bisimulation to relational structures such as those formulated by the logic community. Specif- ically, we consider the so-called guarded fragment of first order logic (FO), which was shown to be characterized by a tractable generalized notion of guarded bisimulation [10, 97]. Its connection to databases is also well-known.

For instance, Leinders et al. [77] and, indirectly, Flum et al. [41] have shown that the semi-join variant of Codd’s relational algebra is equally expressive as the guarded fragment. Moreover, Gottlob et al. have established the expressive equivalence of the acyclic conjunctive queries, a tractable query language for relational databases, and the conjunctive fragment of guarded FO [47].

Despite this generalization of the notion of bisimulation to guarded bisimulation, to the best of our knowledge, no other work has tackled the problem of structural indexing for relational data.

Organization and contributions. After the introduction of the necessary formal background in Chapter 2, the organization of this thesis and its main contributions are as follows.

In Part I we give an overview of existing techniques for storing and querying RDF data, and perform a study of the characteristic properties exhibited by real-world RDF data and SPARQL queries. Apart from motivating several design choices in later chapters, this study is interesting in its own right and provides insights that can more generally be used for the development of new practical RDF storage solutions and practical heuristics for processing SPARQL queries. Concretely:

– We start with a survey of the state of the art in the relational and graph- based approaches to storing RDF data in Chapter 3. In particular we elaborate on the relational representation of RDF data, in Section 3.2, and on structural indexing, in Section 3.3.

– In Chapter 4, we study the characteristics of real world and synthetic RDF data. Our main result is that, far from the uniform distribu-

(26)

tion of data expected by some query processing engines, in many RDF databases the distributions of subjects, predicates and objects follow a so-called power-law.

– In Chapter 5 we study the characteristics of real-world SPARQL queries.

We find that the majority of the real-world SPARQL queries are conjunctive, and that 99.99% of these are acyclic. For these queries, efficient evaluation algorithms are known [46]. For non-conjunctive SPARQL queries we propose a syntactic criterion called well-behavedness in Sec- tion 5.3. We show that well-behaved queries can not only be evaluated efficiently, but in addition more than 75% of the SPARQL queries posed in practice are well-behaved.

From our results in Chapter 5 we retain in particular that the answering of acyclic conjunctive relational queries is a key problem in answering SPARQL queries. This motivates the development of a structural index specifically geared towards the acyclic conjunctive queries.

Our development follows the methodology proposed by Fletcher et al. for the design of covering structural indexes. This methodology hinges on three main components [40]: (1) a so-called structural characterization of indistinguishability of data objects by queries of the target query language; (2) an efficient algorithm to group together data that cannot be distinguished by any query of the target language; (3) a data structure (i.e., the index) that exploits this grouping to support query answering by means of that index instead of reverting to the full database.

Of the three main components needed to build a structural index, Part II provides two: a structural characterization of the acyclic conjunctive queries, and an index structure that can help in answering conjunctive queries and is especially efficient for acyclic ones. These two components are complemented by an efficient grouping algorithm in Part III. Concretely:

– We propose a structural characterization of acyclic conjunctive queries in Chapter 6. In particular, we show in Section 6.3 that the acyclic conjunctive queries are invariant under the notion of guarded simulation—a variant of guarded bisimulation that we introduce. Moreover, we show that guarded simulation characterizes exactly query indistinguishability for the acyclic conjunctive queries. Furthermore, in Section 6.4, we show that acyclic conjunctive queries are exactly the conjunctive queries that are invariant under guarded simulation. These results lead us to conclude that guarded simulation is the “right” notion for constructing a structural index for the acyclic conjunctive queries.

– Then, in Chapter 7, we describe an index structure that summarizes relational data. More specifically, in Section 7.3, we specify how to group

(27)

relational data suitably to support arbitrary (not necessarily acyclic) conjunctive queries. In Section 7.4, we provide a notion of pruning power that is similar to the notion of covering for normal structural indexes.

We use this notion to tie our index to data elements invariant under guarded simulation. We also note that data elements invariant under guarded bisimulation are also invariant under guarded simulation, and that while guarded bisimulation may lead to larger indexes it may prove more efficient computationally.

In Part III, we present algorithms geared towards the practical realization of guarded structural indexes. To this end, we first develop an algorithm to group data according to guarded simulation and guarded bisimulation. Then, building on our three components of a guarded structural index, we show in the context of SPARQL how a relational query engine can integrate a structural index for additional performance.

– In Chapter 8, we devise efficient algorithms for computing the guarded bisimulation partitionand theguarded similarity partitionof a database.

That is, these algorithms group relational data according to guarded (bi)similarity. While the guarded bisimilarity partitioning problem had been studied from a theoretical point of view [49, 65], we show in Sec- tion 8.2.1 that unfortunately the proposed algorithm does not scale to large databases. The core idea of this existing algorithm is to reduce a database to a graph on which normal bisimulation corresponds to guarded bisimulation. Efficient algorithms that solve the normalbisim- ilarity partitioning problem can then in principle be used to solve the guarded bisimilarity partitioning problem. In Section 8.2.3 (resp. 8.4.1), we propose reductions which are more space-efficient, but whose bisimulation partitions (resp. similarity partitions) still correspond to the guarded bisimilarity partition (resp. guarded similarity partition) of the original database.

– Finally, in Chapter 9, we consider the problem of integrating our guarded structural indexes with relational query processing engines. We first show in Section 7.5 how the index structure that we introduce in Chap- ter 7 is specialized to integrate our structural characterization of Chap- ter 6. In Section 9.2, we devise query processing strategies that can be integrated with a query optimizer. We conclude this chapter with an experimental evaluation of our query processing techniques.

We conclude in Chapter 10 with discussion and pointers for future work.

Acknowledgments and related publications. This work draws from discussions and collaborations with my supervisor, Stijn Vansummeren, and my

(28)

colleagues from the Netherlands, Jan Hidders, George H.L. Fletcher, Yong- ming Luo, and Paul de Bra. In particular:

– The state of the art on RDF storage contained in Chapter 3 is mainly an excerpt from the book chapter by Luo, myself, Fletcher, Hidders, and VansummerenStoring and indexing massive RDF datasets, chapter 2 in De Virgillio et al., editors,Semantic Search over the Web, pages 29–58.

Springer, 2012.

– The empirical analysis of SPARQL queries of Chapter 5 is based on the article by myself and Vansummeren, What are real SPARQL queries like?, in Proceedings of Semantic Web Information Management 2011, pages 7:1–7:6, ACM, 2011.

– The structural characterisation of Chapter 6 significantly extends and corrects the communication of Fletcher, Hidders, Luo, myself, Vansum- meren, and De Bra,On guarded simulations and acyclic first-order lan- guages, presented at the Symposium on Database Programming Lan- guages 2011.

– Chapters 7 and 9 draw from and extend the article by myself, Luo, Fletcher, Hidders, and Vansummeren,A structural approach to indexing triples, in Simperl et al. eds.,The Semantic Web: Research and Applica- tions, Lecture notes in Computer Science volume 7295, pages 406–421, Springer, 2012.

– The results on guarded bisimilarity partitioning in Chapter 8 are due to myself and Vansummeren, and are currently under review for publication.

(29)

2

Preliminaries

In this chapter, we present the theoretical background required for a good understanding of the development that follows in later chapters. In particular, we review relational databases and their standard query languages as well as the specialization of relational databases to RDF and SPARQL. We also recall the notions of (modal) simulation and bisimulation.

Throughout, we assume familiarity with the syntax and semantics of first order logic [80]. Several well-known results are stated informally without elab- oration, allowing us to focus on the notational conventions used. For more detailed discussion, we refer the interested reader to standard textbooks on database theory (e.g., [4]) and modal logic or bisimulation (e.g., [15, 111]).

2.1 Relational Databases

2.1.1 Atoms, Facts, Databases, and Queries

From the outset, we assume given a fixed universeU of atomic data values, as well as a fixed universe V of variables, and a fixed, finite setS of relation symbols, all pairwise disjoint. We refer toSalso as theschema. Every relation symbol r ∈ S is associated with a natural number called the arity of r. We useα to denote the maximum arity of a relation symbol in S.

A term is either an atomic data value or a variable. An atom is an expression of the form r(a1, . . . , ak) with r ∈ S a relation symbol; k the arity of relation symbol r; and each of the a₁, . . . , a_k ∈ V ∪ U a term. A fact is an

9

(30)

Project

PID Mgr Auditor s1 1 Mary John s2 2 John Mary s3 3 Sue Sue

Databasedb1

WorksOn

Emp Proj

t₁ Mary 1 t₂ John 2 t3 Sue 3 t4 Jeffrey 3 t5 Cathy 3

Project

PID Mgr Auditor u1 a Liv Rob u2 b Rob Liv u3 c Bill Bill u4 d Ellen Fred

u5 e Fred Ellen

Databasedb2

WorksOn Emp Proj v₁ Liv a v₂ Rob b v3 Bill c v4 Bob c v5 Ellen d v6 Fred e

Figure 2.1: Two company databases. For future reference, facts are labeled with identifiers (s1,s2, . . .).

atom that contains no variables.

Definition 2.1. A relational database over S is a finite set db of facts.

In what follows, we will range over atoms and facts by boldface letters drawn respectively from the beginning and from the end of the alphabet. We write rel(a) for the relation symbol r of atom a=r(a1, . . . , a_k); terms(a) for the set{a₁, . . . , a_k}of all terms occurring ina;var(Q) for the setterms(Q)∩V of variables occurring in Q; and val(Q) for the setterms(Q)∩ U of all atomic data values occurring in Q. In addition, we write|a|for the arity k of r and write a.i for the i-th terma_i in a, provided 1≤ i≤ |a|. An atom a is said to be built over a setA ⊆ V ∪ U (or simply to be over A) if terms(a) ⊆ A.

A database isbuilt over A⊆ U if all of its facts are built overA. We denote tuples (a₁, . . . , a_k) as a, and give the natural semantics to terms(a), var(a), val(a),|a|, anda.i. We also say that a tupleais built over (or simply is over) a setA⊆ V ∪ U ifterms(a)⊆A.

If f: A → B is a mapping and a is an atom over A then we denote by f(a) the atom r(f(a₁), . . . , f(a_k)) obtained by applying f point-wise to each term in a =r(a1, . . . , ak). We also denote byf|_C the restriction of the domain of f to the set of terms C and, extending this notation to atoms, denote by f|_a the restriction of the domain of f to the set terms(a). Finally, given a set A = {a₁, . . . ,an} of atoms, we write f(A) for the set of atoms {f(a)|a∈A,terms(a)⊆dom(f)}.

(31)

Definition 2.2. A relational queryof arityn≥0(or simply queryfor short) is a function ϕ that maps each relational database to a subset of Uⁿ.

2.1.2 Conjunctive queries

The class of conjunctive queries (CQ for short) has been recognized early in the study of database query languages as a particularly important and practical fragment of first-order logic (FO) [24]. As the basic language for expressing join patterns between database objects, CQs have since continued to play a central role in query language design across all major data models:

relational, complex object, object-oriented, semi-structured, XML, graph, and RDF data (as we will see) [3, 4]. While the expressiveness of CQs is limited, they are the source of many optimization techniques [24, 46]. The formal definition of conjunctive queries is as follows.

Definition 2.3 (Conjunctive query). A conjunctive queryQis an expres- sion of the formQ(x)←a1, . . . ,an with everyai an atom andxa tuple over the set ^S{var(a_i)|1 ≤i≤n} of the variables mentioned in the atoms. The set {a₁, . . . ,a_n} is called the body of Q andx is called the head of Q.

Let Q be a conjunctive query. We write body(Q) and head(Q) for the body and the head of Q, respectively. We further write terms(Q) for the set of all terms occurring in atoms in body(Q); var(Q) for the set terms(Q)∩ V of variables mentioned in Q; and val(Q) for the set terms(Q)∩ U of values mentioned in Q. Finally, we sometimes also writeQ(x) to indicate that x is the head of Q.

Definition 2.4 (Pure conjunctive query). An atoma is pureif val(a) =

∅. A set of atoms is pure if every atom in it is pure. A conjunctive query Q is pure if body(Q) is pure.

Stated differently, val(Q) = ∅, i.e., all the atoms of a pure conjunctive query are built over the setV of variables.

Example 2.5. The following is a pure conjunctive query:

Q(worker)←Project(pid,mgr,mgr),WorksOn(pid,worker).

The intention of this query when applied to the databases of Figure 2.1, as we will show, is to retrieve all the employees who work on a project which is managed and audited by a same person. The atoms of this query are a₁ = Project(pid,mgr,mgr) and a2 = WorksOn(pid,worker), where pid, mgr, and

worker are variables.

(32)

We now formally define the semantics of conjunctive queries.

Definition 2.6 (Valuation). A valuation µis a partial function µ:V → U. As usual, we writedom(µ) to denote the set of variables on whichµis defined.

Given an atoma and a valuationµ withvar(a)⊆dom(µ), we write µ(a) for the fact obtained by replacing each variable x occurring in a by µ(x). For example, if a =r(x,5, y, x), µ(x) = 2,and µ(y) =a then µ(a) =r(2,5, a,2).

We similarly writeµ(a) for the tuple of values obtained by replacing variables in the tupleaoverU ∪ V according to µ.

Definition 2.7 (Embedding). A valuationµis an embeddingof set of atoms A in a database db if var(A)⊆dom(µ) and µ(a) ∈db for all a∈A. A valu- ation µ is an embedding of a conjunctive query Q in a database db if it is an embedding of body(A) in db.

Example 2.8. It is easily verified that the function µ mapping pid7→3 mgr7→Sue worker7→Jeffrey

is an embedding of the conjunctive query Qof Example 2.5 in database db₁ of

Figure 2.1.

The result of a conjunctive query on a database is as follows:

Definition 2.9 (Conjunctive query semantics). The result of conjunctive queryQ(x)on database db is the setQ(db) :={µ(x)|µ is an embedding of Q in db}.

As such, every conjunctive query hence defines a relational query.

Example 2.10. Continuing with our query Q of Example 2.5, the database db₁ of Figure 2.1, and referring to the embedding of Example 2.8, it is easy to verify that Jeffrey∈Q(db₁). In particular,Q(db₁) ={Sue,Jeffrey,Cathy}.

When the head of a conjunctive query Q() ← a₁. . .a_n is empty, we say thatQis a boolean conjunctive query , and writeQ←a₁. . .a_n, omitting the head from the notation. We also adopt the following semantics: Q(db) = true if there is an embedding ofQ indb, and Q(db) = false otherwise.

(33)

Join-based semantics Alternatively, the semantics of a conjunctive query can be built bottom up. This semantics proposes a more operational view and will be used when describing query processing in Chapter 9.

Definition 2.11. Two valuationsµ₁ andµ₂arecompatible, denotedµ₁ lµ₂, when for all common variables x ∈ dom(µ₁)∩dom(µ₂) it is the case that µ₁(x) =µ₂(x).

Clearly, if µ₁ and µ₂ are compatible, then µ₁∪µ₂ is again a valuation.

Given two sets of valuations Ω₁ and Ω₂, the join of Ω₁ and Ω₂ is defined as Ω₁ 1 Ω₂ ={µ₁∪µ₂ |µ₁ ∈Ω₁, µ₂ ∈Ω₂, µ₁ lµ₂}. The projection of a set of valuations Ω to a tuple xoverV is defined as πx(Ω) ={µ|_var(x) |µ∈Ω}.

We are now ready to give the following alternative “bottom-up” definition of the semantics of conjunctive queries.

Definition 2.12. Let Q(x) ← a₁, . . . ,a_n be a conjunctive query and let db be a database. The semantics of the evaluation of Q on db, denoted JQKdb, is defined inductively as follows:

JaKdb :={µ: var(a)→ U |µ(a)∈db}, J{a₁, . . . ,a_n}Kdb :=Ja₁Kdb1· · ·1Ja_nKdb,

JQKdb :=JQ(x)←a1, . . . ,anKdb :=πx(J{a₁. . .an}Kdb).

The result of evaluating Q on db under this semantics is the set Q(db) = {µ(x)|µ∈JQKdb}.

This definition of the semantics is equivalent to that of Definition 2.9. In- deed, for a queryQ(x)←a₁. . .a_n, the valuations in the setJ{a₁, . . . ,a_n}Kdb

are embeddings of Q into db, and the subsequent projection corresponds to the application of the embedding to x. Conversely, each embedding µ of Q into db is by definition a valuation. Clearly, µ|_a_i ∈ JaKdb, for every a_i, and hence µ∈J{a₁, . . . ,a_n}Kdb.

2.1.3 Acyclic Conjunctive Queries

The problem of evaluating an arbitrary given conjunctive query on an arbitrary given database, that is, the combined complexity [126] of conjunctive query evaluation, is known to be NP-complete [24]. This implies that to process an arbitrary conjunctive query one can essentially not do better than trying all possible combinations of mapping query atoms to database facts (assuming P 6= NP). For certain queries exhibiting a particular structure,

(34)

Project⁰⁰(p2, m2, m3)

Project⁰(p2, m2, m3)

Project(p₂, m₂, m₃) t

Project(p₁, m₁, m₂) t⁰

u

Project(p₃, m₃, m₄) u⁰

Figure 2.2: Tree-shaped query evaluation plan.

however, a better strategy exists. To illustrate, consider the following conjunctive query:

Q(p₂)←Project(p₁, m₁, m₂),Project(p₂, m₂, m₃),Project(p₃, m₃, m₄).

It selects the projects p₂ whose manager m₂ is the auditor of some other project, and whose auditor m₃ is the manager of some other project. To evaluate this query efficiently, we can start by looking forProjectfactstwhose managerm₂ is the auditor of some otherProjectfactt⁰. The factstthat have matchingt⁰ are all collected in a temporary relationProject⁰. We then look for the Project⁰ facts u whose auditor is the manager in some fact u⁰ of Project.

We collect these factsuin the temporary relationProject⁰⁰. The answer of the query is exactly the set of all projectsp₂ inProject⁰⁰.

Figure 2.2 shows a corresponding, tree-shaped, evaluation plan. Observe that facts of the Project⁰ and Project⁰⁰ relations are all facts coming from the bottom left relation Project(p₂, m₂, m₃): this relation is essentially filtered bottom-up through joins with other relations. Clearly, the size of Project⁰ and Project⁰⁰ in this tree is bounded by the size of the Project relation, and the computation of these relations is in PTime. Following the exposition of Abiteboul et al [4], we now generalize this intuition to capture the queries that similarly exhibit a well-behaved, tree-shaped execution plan.

Definition 2.13 (Join forest). Let A be a finite set of atoms. Ajoin forest for A is a forest F—i.e., an acyclic undirected graph—with set of nodes A such that, for each pair of atoms a,b∈A that have variables in common, the following two conditions hold:

1. a and bbelong to the same connected component of F; and

2. every variable in var(a)∩var(b) appears in every node of the (unique) path from a to bin F.

The depthof a join forest F is the length of the longest path between any two connected nodes in F.

(35)

Project(p₂, m₂, m₃)

Project(p1, m1, m2) Project(p3, m3, m4)

Figure 2.3: A join tree for the set {Project(p₁, m₁, m₂),Project(p₃, m₃, m₄), Project(p₂, m₂, m₃)}.

A join forest is called a join treeif F is a tree. A forest is a join forest for a conjunctive query Q if it is a join forest for body(Q).

To illustrate, Figure 2.3 shows a join tree for the set {Project(p₁, m₁, m₂), Project(p3, m3, m4),Project(p2, m2, m3)}.

Definition 2.14 (Acyclic Conjunctive Query). A set of atoms is acyclic if there exists a join forest for it. A conjunctive query Qis acyclic if body(Q) is acyclic.

In particular, Figure 2.3 is a join tree for the conjunctive query Q(m2)←Project(p1, m1, m2),Project(p3, m3, m4),Project(p2, m2, m3), which is, hence, acyclic.

In contrast with arbitrary conjunctive queries, the combined complexity of acyclic conjunctive query evaluation is known to be in PTime [134].

Hypergraphs and Acyclicity Because of their low complexity, the class of acyclic conjunctive queries has been studied extensively in the literature [37, 46, 134]. These studies have lead to many distinct, but equivalent, charac- terisations of acyclicity [4]. In addition to the characterisation in terms of join forests, we will use another characterisation formulated in terms of hy- pergraphs. We follow the exposition by Fagin [37].

A hypergraph generalizes the classical notion of an undirected graph, by allowing edges to connect more than two nodes at the same time.

Definition 2.15. A hypergraphHis a pair(N,E), whereN is a set of nodes and E is a set of edges (also called hyperedges), which are arbitrary nonempty subsets of N.

The classical notion of a path in an undirected graph is generalized to hypergraphs as follows.

(36)

Definition 2.16. A path from a node s to a node t in a hypergraph (N,E) is a sequence of k ≥ 1 edges E₁, . . . , E_k ∈ E such that: s∈ E₁, t ∈E_k, and Ei∩Ei+1 6=∅, for every 1≤i < k. Two nodes (or two edges) are connected if there is a path from one to the other. A set of nodes (or a set of edges) is connected if all of its pairs of nodes (resp. edges) are connected.

Each conjunctive query naturally induces a hypergraph, as follows.

Definition 2.17 (Induced hypergraph). The hypergraph induced by a con- junctive query Qis the hypergraph (N,E) with

– N =var(Q), the set of all variables mentioned in Q; and

– E = {var(a) | a ∈ body(Q)} the set of hyperedges induced by atoms of Q.

Example 2.18. Consider the query of Example 2.5. The hypergraph induced by this query has, for its nodes, the set {pid, mgr, worker}. It comprises two hyperedges: {pid,mgr} and {pid,worker}, induced by the atoms a1 and a2

respectively.

Definition 2.19. The reduction of the hypergraph (N,E) is obtained by re- moving fromE each edge that is a proper subset of another edge. A hypergraph is reduced if it is equal to its reduction.

Definition 2.20. Given a hypergraph(N,E), a set of partial edges generated by a set of nodes M ⊆ N is obtained by intersecting the edges in E with M. That is, the set of partial edges generated by M is the reduction of the set {E∩M |E ∈ E} − {∅} of edges. A set B is said to be a node-generated set of partial edges ifB is the set of partial edges generated by M ⊆ N, for some M.

Example 2.21. Once again, we consider the conjunctive query of Exam- ple 2.5 and its induced hypergraph. The set of partial edges generated by the set M ={pid,mgr} of nodes, is the single hyperedge {pid,mgr}. Indeed, ob- serve that the intersection of the hyperedge {pid,worker} with M is exactly the partial hyperedge {pid}. This partial hyperedge is contained in the partial hyperedge{pid,mgr} and is therefore not in the reduction.

Recall that in an undirected graph, an articulation point is a node whose removal increases the number of connected components. For hypergraphs, the notion of articulation point is generalized to the notion of articulation set.

Definition 2.22. Let F be a connected, reduced set of partial edges, and let E and F be in F. Let Q=E∩F. We say that Qis an articulation set of F if the set of partial edges {E−Q|E ∈ F } − {∅}is not connected.

(37)

It is well-known that a connected undirected graph is cyclic if and only if one of its subgraphs with at least two distinct edges has no articulation point.

Indeed, the “only if” direction is readily obtained. For the “if direction”, consider a subgraph Gthat is connected, has at least two distinct edges, and does not have an articulation point. Pick nodenarbitrarily. Thennmust have at least two neighbors in the subgraph (if not,nis either isolated, or the single neighbor ofn is an articulation point sinceGhas at least two distinct edges).

Pick p and q to be distinct neighbors of nin G. Since there is no increase in the number of connected components when we remove n from G, there is a pathP inGfromp toq that does not traversen. Hence, the sequenceP, n, p is a cycle fromp top. Thus the graph that contains Gis cyclic.

Put in the contrapositive, a classical connected undirected graph is acyclic if, and only if, it every subgraph that does not have an articulation point has less than two edges. By considering subgraphs in the hypergraphs setting to correspond to node-generated sets of partial hyperedges, this characterization of acyclicity yields the following generalization of acyclicity to hypergraphs.

Definition 2.23 (Hypergraph Acyclicity). A block of a reduced hyper- graph is a connected, node-generated set of partial edges with no articulation set. A block is trivial if it contains less than two members. A reduced hyper- graph is acyclic if all its blocks are trivial. A hypergraph is said to be acyclic if its reduction is.

Observe that no block can be formed from exactly two partial edges. In- deed, these two edges are either disconnected or their intersection forms an articulation set.

Example 2.24. Consider a hypergraphH with the following edges:

E₁ ={a, b, c}

E₂ ={a, c, d}

E3={a, b, d}

E4 ={b, c, d}

Note thatHitself equals the set of partial hyperedges ofHgenerated by the set {a, b, c, d}. This set is clearly connected and reduced. Furthermore, it has no articulation set, and it is not trivial. Therefore, H itself forms a non-trivial block ofH. Hence H is a cyclic hypergraph.

Now, consider a hypergraph H⁰ with the following edges:

E₁ ={a, b, c}

E₂ ={a, d, e}

E3={a, b, e}

(38)

Observe that the set of partial hyperedges generated by {a, b, c, d, e} has an articulation set {a, b}. Similar observations can be made for blocks generated from any subset of {a, b, c, d, e}. Therefore, H⁰ is an acyclic hypergraph.

For conjunctive queries, the notion of acyclicity in terms of blocks of the induced hypergraph is equivalent to that in terms of join forests [4]:

Proposition 2.25. Let Q be a conjunctive query. The following are equiva- lent:

1. The hypergraph of Qis acyclic; and 2. Q has a join forest.

2.2 RDF and SPARQL

In this section, we present the RDF data model along with the SPARQL query language. We follow the official specifications [61, 68, 100] as well as the notation and exposition of Pérez et al. [98].

2.2.1 RDF

Data in RDF is built from three disjoints sets I,B,Lthat are part of the universeU. These sets are calledInternationalized Resource Identifiers(IRIs), blank nodes, andliterals, respectively. Element ofI ∪ Bare collectively called resources. For convenience we will use shortcuts like IBL and IB to denote the unionsI ∪ B ∪ L andI ∪ B, respectively.

All information in RDF is represented as triples of the form (s, p, o), where s is called the subject, p is called the predicate, and o is called the object. To be valid, it is required that s∈ IB; p ∈ I; and o ∈ IBL. As we saw in the introduction (Figure 1.1), RDF triples can also be seen as graphs. For this reason, finite sets of RDF triples are also called RDF graphs

Definition 2.26. AnRDF datasetis a pairD= (G, γ)withGan RDF graph and γ a function that assigns an RDF graph γ(i) to each IRI i in a finite set dom(γ)⊆ I.

When the RDF graph G is clear from the context, we use the terms RDF dataset and RDF graph indistinctly to refer toG.

2.2.2 SPARQL

SPARQL is a query language for RDF, proposed and standardized by the World Wide Web Consortium [100].

(39)

Abstracting away from its concrete syntax, we can view a SPARQL query Qsyntactically as a 4-tuple of the form

(query-type,dataset-clause,pattern P,solution-modifier)

At the heart ofQlies thegraph patternP that searches for specific subgraphs in the input RDF dataset. Its result is a (multi-)set of valuations (Definition 2.6), each of which associates variables to elements of IBL. The dataset-clause is optional and specifies the input RDF dataset to use during pattern matching.

If it is absent, the query processor itself determines the dataset to use. The optional solution-modifier allows sorting of the valuations obtained from the pattern matching, as well as returning only a specific window of valuations (e.g., valuations 1 to 10). The result is a list L of valuations. The actual output of the SPARQL query is then determined by thequery-type:

– select queries return projections of valuations fromL (in order);

– ask queries return a boolean: true if the graph pattern P could be matched in the input RDF dataset, and false otherwise;

– construct queries construct a new finite set of RDF triples based on the valuations inL; and

– describe queries return a set of RDF triples that describes the IRIs and blank nodes found in L. The exact contents of this description is implementation-dependent.

We provide here a formal definition of SPARQL graph patterns and of selectqueries, but refer to the SPARQL recommendation [100] for a complete description of the syntax and semantics of query-type; dataset-clause; and solution-modifier. Following the example of Pérez et al [98], our definition of SPARQL graph patterns does not use the concrete syntax of SPARQL, but introduces an abstract syntax that is easier to use. All SPARQL queries in concrete syntax can be represented in this abstract syntax in a straightforward manner (see also Example 2.29 below).

Graph Patterns Recall thatV denotes the set of variables, which is disjoint from U (and hence disjoint from the sets I,B, and L). Atriple pattern is an element of (I ∪ V)×(I ∪ V)×(I ∪ L ∪ V). Agraph pattern is an expression that can be generated by the following grammar:

P ::= t|P₁andP₂ |P₁unionP₂ |P₁optP₂

| PfilterR|graphi P |graphx, P

Here, t ranges over triple patterns, i ranges over IRIs in I, and x ranges over variables in V. R ranges over SPARQL filter constraints. We refer

Guarded Structural Indexes