Dependencies: Making Ontology Based Data Access Work

(1)

Dependencies: Making Ontology Based Data Access Work in Practice

Mariano Rodr´ıguez-Muro and Diego Calvanese KRDB Research Centre, Free University of Bozen-Bolzano

Piazza Domenicani 3, Bolzano, Italy {rodriguez,calvanese}@inf.unibz.it

Abstract. Query answering in Ontology Based Data Access (OBDA) exploits the knowledge of an ontology’s TBox to deal with incompleteness of the ABox (or data source). Current query-answering techniques with DL-Lite require exponential size query reformulations, or expensive data pre-processing, and hence may not be suitable for data intensive scenarios. Also, these techniques present severe redundancy issues when dealing with ABoxes that are already (partially) complete. It has been shown that addressing redundancy is not only required for tractable implementations of decision procedures, but may also allow for sizable improvements in execution times. Considering the previous observations, in this paper we present two complementary sets of results that aim at improving query answering performance in OBDA systems. First, we show that we can characterize completeness of an ABox by means of dependencies, and that we can use these to optimize DL-Lite TBoxes. Second, we show that in OBDA systems we can create ABox repositories that appear to be complete w.r.t. a significant portion of any DL- Lite TBox. The combination of these results allows us to design OBDA systems in which redundancy is minimal, the exponential aspect of query answering is notably reduced and that can be implemented efficiently using existing RDBMSs.

1 Introduction

The current approaches to Ontology Based Data Access (OBDA) with lightweight Description Logics (DLs) of theDL-Litefamily [1] rely on query reformulation. These techniques are based on the idea of using the ontology to rewrite a given query into a new query that, when evaluated over the data sources, returns the certain answers to the original query. Experiments with unions of conjunctive queries (UCQs) have shown that reformulations may be very large, and that the execution of these reformulations suffers from poor performance. This triggered the development of alternative reformulation techniques [6, 9], in which the focus has been on the reduction of the number of generated queries/rules. These techniques have shown some success, however query reformulation in all of them is still worst-case exponential in the size of the original query. Alternative approaches [4] use the expansion of the extensional layer of the ontology (i.e., the ABox) w.r.t. the intensional knowledge (i.e., the TBox) to avoid query reformulation almost entirely. However, the cost of data expansion imposes severe limitations on the system.

We believe that approaching the problem of the high cost of query answering in OBDA systems requires a change of focus: namely, from the ’number of queries’ perspective, to

(2)

the perspective that takes into account the ’duplication in the answers’ appearing in query results under SQL multiset semantics. Duplication in results is a sign ofredundancyin the reasoning process; it not only generated not only by the reformulation procedures as traditionally thought, since also techniques based on ABox expansion show this problem. Instead, redundancy is the consequence of ignoring the semantics of the data sources. In particular, when the data in a source (that is used to populate the ABox of the ontology) already satisfies an inclusion assertion of the TBox (i.e., iscompletew.r.t. such an inclusion assertion), then using that inclusion assertion during query answering might generate redundant answers [8]. It has been noted [3] that it is not hard to find examples in which the runtime of decision procedures can change from exponential to polynomial if redundancy is addressed, and this is also the case in OBDA query answering. In this paper we address both problems, redundancy and the exponential blow-up of query reformulations, by following two complementary directions.

First, we present an approach to take into account completeness of the data with respect to the TBox. We characterize completeness usingABox dependenciesand show that it is possible to use dependencies to optimize the TBox in order to avoid redundant computations independently of the reasoning technique. Second, we focus on how we can optimally complete ABoxes in OBDA systems by relying on the fact that in OBDA systems it is possible to manipulate not only the data, but also the mappings and database schema. This allows us to conceive procedures to store an ABox in a source in such a way that itappearsto be complete with respect to a significant portion of the TBox, but without actuallyexpandingthe data. We present two such procedures, one for general and one for ’virtual’ OBDA systems, both designed to take advantage of the features of modern RDBMSs effectively. Our results allow for the design of systems that can delegate reasoning tasks (e.g., dealing with hierarchies, existentially quantified individuals, etc.) to stages of the reasoning process where these tasks can be handled most effectively. The result is a (sometimes dramatic) reduction of the exponential runtime and an increase in the quality of the answers due to the reduction of duplication.

The rest of the paper is organized as follows: Section 2 gives technical preliminaries.

Section 3 presents our first main contribution, a general technique for optimizing TBoxes w.r.t. dependencies. Section 4 introduces data dependencies in OBDA systems, describing why it is natural to expect completeness of ABoxes. Section 5 presents two techniques for completing ABoxes in OBDA systems, and Section 6 concludes the paper.

2 Preliminaries

In the rest of the paper, we assume a fixed vocabularyV ofatomic concepts, denotedA (possibly with subscripts), andatomic roles, denotedP, representing unary and binary relations, respectively, and an alphabetΓ of(object) constants.

Databases. In the following, we regard adatabase (DB)as a pairD=hR,Ii, whereR is a relational schema andIis an instance ofR. The active domainΓDofDis the set of constants appearing inI, which we callvalue constants. An SQL queryϕover a DB schemaRis a mapping from a DB instanceIofRto a set of tuples.

DL-Liteontologies. We introduce the DLDL-Lite_F, on which we base our results.

InDL-Lite_F, abasic role, denotedR, is an expression of the formP orP⁻, and a

(3)

basic concept, denotedB, is an expression of the formAor∃R. Anontologyis a pair O=hT,AiwhereT is a TBox andAan ABox. A TBox is a finite set of assertions of the formB₁ v B₂, called(positive) inclusions,B₁ v ¬B₂, calleddisjointness assertions, or(functR), calledfunctionality assertions. An ABox is a finite set of membership assertionsof the formA(c)orP(c, c⁰), wherec, c⁰∈Γ.

Queries over ontologies. Anatomis an expression of the formA(t)orP(t, t⁰), where tandt⁰areatom terms, i.e., variables or constants inΓ. An atom isgroundif it contains no variables. Aconjunctive query(CQ)qover an ontologyOis an expression of the formq(x)←β(x,y), wherexis a tuple of distinct variables, calleddistinguished,yis a tuple of distinct variables not occurring inx, callednon-distinguished, andβ(x,y)is aconjunctionof atoms with variables inxandy, whose predicates are atomic concepts and roles ofO. We callq(x)theheadof the query andβ(x,y)itsbody. Aunion of CQs(UCQ) is a set of CQs (called disjuncts) with the same head. Given a CQQwith bodyβ(z)and a tuplevof constants of the same arity asz, we call aground instance ofQthe setβ[z/v]of ground atoms obtained by replacing inβ(z)each variable with the corresponding constant fromv.

Semantics. Aninterpretation I = (∆Î,·Î)consists of a non-empty interpretation domain∆Îand aninterpretation function·Îthat assigns to each constantcan element cÎof∆Î, to each atomic conceptAa subsetAÎ of∆Î, and to each atomic roleP a binary relation over∆Î. Moreover, basic roles and basic concepts are interpreted as follows:(P⁻)Î={(o₂, o₁)|(o₁, o₂)∈PÎ}and(∃R)Î={o| ∃o⁰.(o, o⁰)∈RÎ}. An interpretationI is amodelofB₁vB₂ifBÎ₁ ⊆B₂Î, ofB₁v ¬B₂ifBÎ₁ ∩B₂Î =∅, and of(functR)if for eacho, o₁, o₂∈∆Îwe have that(o, o₁)∈RÎand(o, o₂)∈RÎ implieso1=o2. Also,Iis a model ofA(c)ifcÎ∈AÎ, and ofP(c, c⁰)if(cÎ, c^0I)∈PÎ. InDL-Lite_F, we adopt the Unique Name Assumption (UNA), which enforces that for each pair of constantso1,o2, ifo16=o2, thenoÎ₁ 6=oÎ₂. For aDL-Lite_Fassertionα(resp., a setΘofDL-Lite_Fassertions),I |=α(resp.,I |=Θ) denotes thatIis a model ofα (resp.,Θ). Amodel of an ontologyO=hT,Aiis an interpretationIsuch thatI |=T andI |=A. An ontology issatisfiableif it admits a model. An ontologyOentailsan assertionα, denotedO |=α, if every model ofOis also a model ofα. Similarly, for a TBoxT and an ABoxAinstead ofO. Thesaturationof a TBoxT, denotedsat(T), is the set ofDL-LiteF assertionsαs.t.T |=α. Notice thatsat(T)is finite, hence a TBox.

LetΓAdenote the set of constants appearing in an ABoxA. Theanswerto a CQ Q=q(x)←β(x,y)overO=hT,Aiin an interpretationI, denotedans(Q,O,I), is the set of tuplesc∈ΓA× · · · ×ΓAsuch that there exists a tuplec⁰∈ΓA× · · · ×ΓA

such that the ground atoms inβ[(x,y)/(c,c⁰)]are true inI. The answer to an UCQ QinIis the union of the answers to each CQ inQ. Thecertain answerstoQinO, denotedcert(Q,O), is the intersection of everyans(Q,O,I)for all modelsIforO.

Theanswer toQover an ABoxA, denotedeval(Q,A), is the answers toQoverA viewed as a DB instance. Aperfect reformulationofQw.r.t. a TBoxT is a queryQ⁰such that for every ABoxAsuch thathT,Aiis satisfiable,cert(Q,hT,Ai) =eval(Q⁰,A).

Mappings. We adopt the definitions for ontologies with mappings from [7]. First we extend interpretations to be able to create object constants from the value constants in a DBD. Given an alphabetΛoffunction symbolswe define the setτ(Λ, ΓD)ofobject termsas the set of all terms of the formf(d1, . . . , dn), wheref ∈Λ, the arity off is

(4)

n, andd₁, . . . , d_n ∈Γ_D. We setΓ =Γ_D∪τ(Λ, Γ_D), and we extend the interpretation function so that for eachc ∈τ(Λ, Γ_D)we have thatc^I ∈∆^I. We extend queries by allowing the use of predicate arguments that arevariable terms, i.e., expressions of the formf(t), wheref ∈Λwith aritynandtis ann-tuple of variables or value constants.

Given a TBoxT and a DBD, amapping (assertion)mforT is an expression of the formϕ(x);ψ(t)whereϕ(x)is an SQL query overDwith answer variablesx, and ψ(t)is a CQ over T without non-distinguished variables using variable terms over variables inx. We call the mappingsimpleif the body ofψ(t)consists of a single atom, andcomplexotherwise. A simple mappingis foran atomic conceptA(resp., atomic role P) if the atom in the body ofψ(t)hasA(resp.,P) as predicate symbol. In the following, we might abbreviate the queryψin a mapping by showing only its body. Avirtual ABox Vis a tuplehD,Mi, whereDis a DB andMa set of mappings, and anontology with mappingsis a tupleOM=hT,Vi, whereT is a TBox andV =hD,Miis a virtual ABox in whichMis a set of mappings forT.

An interpretationI satisfiesa mapping assertionϕ(x) ;ψ(t)w.r.t. a DBD = hR,Iiif for every tuplev∈ϕ(I)and for every ground atomXinψ[x/v]we have that:

ifXhas the formA(f(c)), then(f(c))Î ∈AÎ, and ifXhas the formP(f1(c1),f2(c2)), then ((f1(c1))Î,f2(c2)Î) ∈ PÎ. An interpretation I is a model ofV = hD,Mi, denotedI |=V, if it satisfies every mapping inMw.r.t.D. A virtual ABoxVentailsan ABox assertionα, denotedV |=α, if every model ofVis a model ofα.Iis a model of OM=hT,ViifI |=T andI |=V. As usual,OMissatisfiableif it admits a model.

We note that, in an ontology with mappingsOM=hT,hD,Vii, we can always replace Mby a set ofsimplemappings, while preserving the semantics ofOM. It suffices to spliteach complex mappingϕ;ψinto a set of simple mappings that share the same SQL queryϕ(see [7]). In the following, we assume to deal only with simple mappings.

Dependencies. ABox dependencies are assertions that restrict thesyntactic form of allowed ABoxes. In this paper, we focus on unary inclusion dependencies only. A (unary) inclusion dependencyis an assertion of the formB₁v_AB₂, whereB₁andB₂ are basic concepts. In the following, for a basic roleRand constantsc,c⁰,R(c, c⁰)stands forP(c, c⁰)ifR =P and forP(c⁰, c)ifR =P⁻. An ABoxAsatisfiesan inclusion dependencyσ, denotedA |=σ, if the following holds:(i)ifσisA1vAA2, then for allA1(c) ∈ Awe haveA2(c) ∈ A;(ii)ifσis∃R vA A, then for allR(c, c⁰)∈ A we haveA(c)∈ A;(iii)ifσisA vA ∃R, then for allA(c)∈ Athere existsc⁰ such thatR(c, c⁰) ∈ A.(iv)ifσis∃R1 vA ∃R2, then for allR1(c, c⁰) ∈ Athere exists c⁰⁰such thatR2(c, c⁰⁰) ∈ A. An ABoxAsatisfies a set of dependenciesΣ, denoted A |=Σ, ifA |= σfor eachσ ∈ Σ. A set of dependenciesΣentailsa dependency σ, denotedΣ |=σ, if for every ABoxAs.t.A |= Σwe also have thatA |=σ. The saturationof a setΣof dependencies, denotedsat(Σ), is the set of dependenciesσs.t.

Σ|=σ. Given two queriesQ1,Q2, we say thatQ1is contained inQ2relative toΣif eval(Q₁,A)⊆eval(Q₂,A)for each ABoxAs.t.A |=Σ.

3 Optimizing TBoxes w.r.t. Dependencies

In aDL-Lite_FontologyO=hT,Ai, the ABoxAmay beincompletew.r.t. the TBoxT, i.e., there may be assertionsB1vB2inT s.t.A 6|=B1vAB2. When computing the

(5)

certain answers to queries overO, the TBoxT is used to overcome such incompleteness.

However, an ABox may already be (partially) complete w.r.t. T, e.g., an ABox A satisfyingA₁v_A A₂is complete w.r.t.A₁vA₂. While ignoring completeness of an ABox is ’harmless’ in the theoretical analysis of reasoning overDL-Lite_Fontologies, in practice, it may introduceredundancy, which manifests itself ascontainment w.r.t.

dependencies among the disjuncts (CQs) of the perfect reformulation, making the contained disjunctsredundant. For example, letT andAbe as before, and letQbe q(x)←A2(x), then any perfect reformulation ofQmust includeq1=q(x)←A1(x) andq2=q(x)←A2(x)as disjuncts. However, sinceq1is contained inq2relative to A1vAA2, we have thatq1will not contribute new tuples w.r.t. those contributed byq2.

It is possible to use information about completeness of an ABox, expressed as a set of dependencies, to avoid redundancy in the reasoning process. One place to do this is during query reformulation, using techniques based on conjunctive query containment (CQC) with respect to dependencies to avoid the generation of redundant queries. However, this approach is expensive, since CQC is an NP-complete problem (even ignoring dependencies), and such optimizations would need to be performed every time a query is reformulated. We show now how we can improve efficiency by pre-processing the TBox before performing reformulation. In particular, given a TBoxT and a setΣof dependencies, we show how to compute a TBoxT⁰that is smaller thanT and such that for every queryQthe certain answers are preserved ifQis executed over an ABox that satisfiesΣ. Specifically, our objective is to determine when an inclusion assertion ofT isredundant w.r.t.Σ, and to do so we use the following auxiliary notions.

Definition 1. LetT be a TBox,B,Cbasic concepts, andΣa set of dependencies over T. AT-chain fromBtoCinT (resp., aΣ-chain fromBtoCinΣ) is a sequence of concept inclusion assertions(B_i v B_i⁰)ⁿ_i=0 inT (resp., a sequence of inclusion dependencies(BivAB_i⁰)ⁿ_i=0inΣ), for somen≥0, such that:B0=B,B⁰_n=C, and for1≤i≤n, we have thatB_i−1⁰ andBiare basic concepts s.t., either (i)B_i−1⁰ =Bi, or (ii)B_i−1⁰ =∃RandBi=∃R⁻, for some basic roleR.

Intuitively, when there is aT-chain fromBtoC, the existence of an instance ofBin a model implies the existence of an instance ofC. For aΣ-chain, this holds for ABox assertions. We useT-chains andΣ-chains to characterize redundancy as follows.

Definition 2. LetT be a TBox,B,Cbasic concepts, andΣa set of dependencies. The concept inclusion assertionB vCisdirectly redundant inT w.r.t.Σif (i)Σ|=B vA

C and (ii) for everyT-chain(Bi vB⁰_i)ⁿ_i=0withB⁰_n = B inT, there is aΣ-chain (BivAB⁰_i)ⁿ_i=0. Then,BvCisredundant inT w.r.t.Σif (a) it is directly redundant,

or (b) there existsB⁰ 6=Bs.t. (i)T |=B⁰vC, (ii)B⁰vCis not redundant inT w.r.t.

Σ, and (iii)B vB⁰is directly redundant inT w.r.t.Σ.

Given a TBoxT and a set of dependenciesΣ, we apply our notion of redundancy w.r.t.Σto the assertions in the saturation ofT to obtain a TBoxT⁰that is equivalent to T for certain answer computation.

Definition 3. Given a TBoxT and a set of dependenciesΣ overT, theoptimized version ofT w.r.t.Σ, denotedoptim(T, Σ), is the set of inclusion assertions{α ∈ sat(T)|αis not redundant insat(T)w.r.t.sat(Σ)}.

(6)

Correctness of using T⁰ = optim(T, Σ)instead of T when computing the certain answers to a query follows from the following theorem.

Theorem 1. Let T be a TBox andΣa set of dependencies overT. Then for every ABoxAsuch thatA |=Σand every UCQQoverT, we have thatcert(Q,hT,Ai) = cert(Q,hoptim(T, Σ),Ai).

Proof. First we note that during query answering, only the positive inclusions are relevant, hence we ignore disjointness and functionality assertions. Sincesat(T)adds to T only entailed assertions,cert(Q,hT,Ai) =cert(Q,hsat(T),Ai), for everyQand A, and we can assume w.l.o.g. thatT =sat(T). Moreover,cert(Q,hT,Ai)is equal to the evaluation ofQoverchase(T,A). (We refer to [1] for the definition of chase for a DL-LiteFontology.) Hence it suffices to show that for everyBvCthat is redundant with respect toΣ,chase(T,A) =chase(T \ {BvC},A). We show this by proving that ifB vCis redundant (hence, removed byoptim(T, Σ)), then there is always a chase(T,A)in whichB vCis never applicable. Assume by contradiction thatBvC is applicable to some assertionB(c)during some step inchase(T,A). We distinguish two cases that correspond to the cases of Definition 2.

(a) Case where B v C is directly redundant and hence Σ |= B vA C. We distinguish two subcases:(i)B(c)∈ A. SinceA |=BvAC, we haveC(c)∈ A, and henceBvCis not applicable toB(c). Contradiction.(ii)B(c)∈ A. Then there is a/ sequence of chase steps starting from some ABox assertionB⁰(c⁰)that generatesB(c).

Such a sequence requires aT-chain(B_i vB_i⁰)ⁿ_i=0withB₀=B⁰andB⁰_n =B, such that eachB_i vB⁰_iis applicable inchase(T,A). Then, by the second condition of direct redundancy, there is aΣ-chain(B_i vAB⁰_i)ⁿ_i=0. SinceA |=B₀ vAB⁰₀we have that B₀⁰(c⁰)∈ Aand henceB₀vB⁰₀is not applicable toB⁰(c⁰). Contradiction.

(b) Case whereB vChas been removed by Definition 2(b), and hence there exists B⁰ 6=Bsuch thatT |=BvB⁰. First we note that any two oblivious chase sequences forT andAproduce results that are equivalent w.r.t. query answering. Then it is enough to show that there exists somechase(T,A)in whichB⁰vCis always applied before B vCand in whichBvCis never applicable. Again, we distinguish two subcases:

(i)B(c)∈ A. Then, sinceBvB⁰is directly redundant, we have thatΣ|=B vAB⁰. SinceA |=Σ, we have thatB⁰(c)∈ A, and given thatB⁰ vCis always applied before BvC,C(c)is added tochase(T,A)before the application ofB vC, henceBvC is in fact not applicable. Contradiction.(ii)B(c)∈ A. Then, arguing as in Case/ (a).(ii), usingB vB⁰instead ofBvC, we can derive a contradiction. ut Complexity and implementation. Due to space limitations, we cannot provide a full description of how to compute optim(T, Σ). We just note that the checks that are required byoptim(T, Σ)can be reduced to computing reachability between two nodes in a DAG that represents the reachability relation of the chains inT andΣ. This operation can be done in linear time.

Consistency checking. Consistency checking may also suffer from redundancy when the ABox is already (partially) complete w.r.t.T. In this case, we need to consider, in addition to inclusion dependencies, alsofunctionalanddisjointnessdependencies.

Due to spaces limitations we cannot provide more details, and just note that using

(7)

these dependencies it is possible to extend the definitions to generate TBoxes that avoid redundant consistency checking operations.

4 Dependencies in OBDA Systems

The purpose of the current section is to complement our argument w.r.t. completeness of ABoxes by discussing when and why we can expect completeness in OBDA systems.

We start by observing that in OBDA systems, ABoxes are constructed, in general, from existing data that resides in some form of data repository. In order to create an ABox, the system requires some form of mappings from the source to the ontology. These may be explicit logical assertions as the ones used in this paper, or they may be implicitly defined through application code. Therefore, the source queries used in these mappings become crucial in determining the structure of the ABox. In particular, any dependencies that hold over the results of these queries will be reflected in the OBDA system as ABox dependencies.

Example 1. LetRbe a DB schema with the relation schemaemployeewith attributes id,dept, andsalary, that stores information about employees, their salaries, and the department they work for. LetMbe the following mappings:

SELECT id,dept FROM employee ; Employee(emp(id))∧

WORKS-FOR(emp(id),dept(dept)) SELECT id,dept FROM employee

WHERE salary > 1000

; M anager(emp(id))∧

MANAGES(emp(id),dept(dept)) whereEmployeeandManagerare atomic concepts andWORKS-FORandMANAGES are atomic roles. Then foreveryinstanceIofR, the virtual ABoxV =hhR,Ii,Mi satisfies the following dependencies:

ManagervAEmployee ManagervA∃MANAGES ∃WORKS-FORvA Employee

∃MANAGESvAManager EmployeevA ∃WORKS-FOR In particular, the dependency in Column 1 follows from the containment relation between the two SQL queries used in the mappings, and the remaining dependencies follow from the fact that we populateWORKS-FOR(resp.,MANAGES) using the same SQL query used to populateEmployee(resp.,Manager).

Turning our attention to the semantics of the data sources, we note that any given DB is based on some conceptual model. At the same time, if we associate the data of any given DB to the concepts and roles of a TBoxT, it follows that this data issemantically relatedto these concepts and roles, and that the conceptual model of the DB has some common aspects with the semantics ofT. It is precisely these common aspects that get manifested as dependencies between queries in the mappings and that give rise to completeness in ABoxes. Therefore, the degree of completeness of an ABox in an OBDA system is in direct relation with the closeness of the semantics of the conceptual model of the DB and the semantics of the TBox, and with the degree in which the DB itself complies to the conceptual model that was used to design it.

(8)

Example 2. To illustrate the previous observations we extend Example 1. First we note that the intended meaning of the data stored inRis as follows:(i)employees with a salary higher than 1,000 are managers,(ii)managersmanagethe department in which they are employed, and(iii)every employee works for a department. Then, any TBox that shares some of this semantics will present redundancy. For example, ifT is

ManagervEmployee Managerv ∃MANAGES Employeev ∃WORKS-FOR

∃MANAGES⁻vDepartment ∃WORKS-FOR⁻vDepartment then the first row of assertions is redundant w.r.t. Σ. Instead, the semantics of the assertions of the second row is not captured by the mappings. In an OBDA system with such components, we should reason only w.r.t.Department. This can be accomplished by optimizingT w.r.t. Σusing the technique presented in Section 3.

5 Dependency Induction

We focus now on procedures to complete ABoxes with respect to TBoxes. The final objective is to simplify reasoning by diverting certain aspects of the process (e.g., dealing with concept hierarchies and domain and range assertions) from the query reformulation stage to other stages of the query answering process where they can be handled more efficiently. We call these proceduresdependency induction proceduressince their result can be characterized by a set of dependencies that hold in the ABox(es) of the system.

Formally, given an OBDA systemO = hT,Vi, whereV = hhR,Ii,Mi, we call a dependency induction procedurea procedure that usesOto compute a virtual ABoxV⁰ such that the number of assertions inT for whichV⁰is complete is higher than those for V. An example of a dependency induction procedure isABox expansion, a procedure in which the data inIischasedw.r.t.T. The critical point in dependency induction procedures is the trade-off between the degree of completeness induced, the system’s performance, and the cost of the procedure. The purpose of the current section is to present two dependency induction mechanism that provide good trade-offs. Both of them are designed for the case in which the data sources are RDBMSs. Given a TBoxT and a (virtual) ABoxV, both techniques are able to generate a system where, ifT |=BvA, thenV⁰ |=B vA A. Hence,V⁰is complete for allDL-LiteF inferences except those involving mandatory participation assertions, e.g.,Bv ∃R.

Semantic Index. The first proposal is applicable in the context of general OBDA systems in which we are free to manipulateanyaspect of the system to improve query answering.

The basic idea of the semantic index is to associate a numeric value to each concept of the ontology in such a way that for any two conceptsA₁,A₂, ifT |=A₁vA₂, then the value ofA₁is less than that ofA₂. ABox membership assertions are then represented in the DB using these numeric values instead of concept names. This allows one to retrieve most of the implied instances of a concept by posing simple range queries to the DB (which are very efficient in modern RDBMSs). Our proposal is related to a technique for XPath query evaluation known asDynamic Intervals[2], however, while the latter deals with XML trees, we have to deal with concept hierarchies that are DAGs. Formally, a semantic index is defined as follows, whereC_T denotes the set of atomic concepts ofT.

(9)

Definition 4. Given a DL-LiteFTBoxT,a semantic index forT is a pair of mappings hidx,rangeiwithidx : C_T → Nandrange : C_T → 2^N^×^N, such that for each pair of atomic concepts A₁,A₂ inC_T, we have that T |= A₁ v A₂ iff there is a pair h`, hi ∈range(A2)such that`≤idx(A1)≤h.

Using a semantic indexhidx,rangeifor a TBoxT, we constructV=hR,Iiwith the completeness properties described above, by proceeding as follows. We define a DB schemaRwith a universal-like relationunivc[c1,idx]for storing ABox concept assertions, and one relationroleP[c1,c2]for each atomic roleP, such thatc1andc2 have typeconstantandidxhas typenumeric. Given an ABoxA, we constructIsuch that for eachA(c)∈ Awe havehc,idx(A)i ∈univcand for eachP(c, c⁰)∈ Awe have hc, c⁰i ∈roleP. The schema and the index allow us to define, for each atomic concept A, a set of range queries overDthat retrieves most constantscsuch thatO |=A(c).

E.g., if range(A) = {h2,35i}, we define ’SELECT c1 FROM univc WHERE idx

>= 2 AND idx <= 35’. We use these queries to define the mappings of the system as follows¹:(i)for each atomic conceptA and eachh`, hi ∈ range(A), we add the mappingσ`≤idx≤h(univc);A(c1);(ii)for each atomic roleP, we add the mapping role_P ;P(c₁, c₂);(iii)for each atomic roleP and each atomic conceptAsuch that T |=∃P v A(resp.,∃P⁻ v A), we add the mappingπ_c₁(role_P) ; A(c₁)(resp., π_c₂(role_P);A(c₂)).

Note that a semantic index can be trivially constructed by assigning to each concept a unique (arbitrary) value and a set of ranges that covers all the values of subsumed concepts. However, this is not effective for optimizing query answering since the size of Mdetermines exponentially the size of the final SQL query. To avoid an exponential blow-up, we generateidxandrangeusing the implied concept hierarchy as follows.

LetT be a TBox, andDthe minimal DAG that represents theimplied is-arelation between all atomic concepts ofT (i.e., the transitive reductof D)². Then we can constructidx in the following way:(a)Define a counteri = 0;(b)Visit the nodes inDin a depth-first fashion starting from the root nodes. At each step and given the nodeNvisited at that step,(1)ifidx(N)is undefined, setidx(N) =iandi=i+ 1, (2)else ifidx(N)is defined, backtrack until the next node for whichidx is undefined.

Now, to generaterange we proceed as follows:(a)Visit the nodes inDstarting from the leafs and going up. For each nodeN in the visit,(1)ifN is a leaf inD, then we setrange(N) ={hidx(N),idx(N)i}(2)ifNis not a leaf, then we setrange(N) = merge({hidx(N), idx(N)i} ∪S

Ni|Ni→N∈Drange(N_i)), wheremergeis a function that, given a setrof ranges, returns the minimal set r⁰ of ranges that has the same numeric coverage asr, e.g.,merge({h5,7i,h3,5i,h9,10i}) ={h3,7i,h9,10i}.

Performance of the Semantic Index. We now describe a preliminary evaluation of the performance of the Semantic Index that was carried out using data from theResource Index(RI) [5]. The RI is an application that offers semantic search services over a collection of 22 well known collections of biomedical data (i.e., documents). The semantics of the search is defined by the hierarchical information of≈200ontologies, including well known bio-medical ontologies like the Gene Ontology, SNOMEDCT, etc. The basic func-

1Here we use relational algebra expressions instead of SQL to simplify the exposition.

2We assume w.l.o.g. thatT does not contain a cyclic chain of basic concept inclusions.

(10)

tioning of the RI can be summarized as follows:(i)Usingnatural language processing, each document is analyzed and annotated with one or more concepts from the ontologies. The annotations can be seen as ABox assertions, e.g.,Cervical Cancer(⁰doc224⁰).

(ii)Users pose queries of the formq(x)←A1(x)∧ · · · ∧An(x), where eachAiis a concept from one of the ontologies in the collection. To compute the answers to the queries, the RI executes the original query over the expanded ABox data. The information in the RI’s ontologies amounts to≈3million concepts and≈2.5millionis-aassertions.

Currently, the number of annotations to be expanded is large, e.g., the annotation process generates≈181million annotations (≈14GB of data) only for the documents in the Clinical Trials.gov(CT) resource collection.

Our experimentation focused on the cost of query answering bya)traditional query reformulation,b)query answering using a Semantic Index, andc)query answering with ABox expansion. To carry out the experimentation we used part of the data from the RI, specifically, we used the full set ofis-aassertions from the ontology collection and all the document annotations for the CT resource. Also note that since there are no atomic roles in the RI, the Semantic Index technique can generate a complete virtual ABox (i.e.,optim(T, Σ) = ∅). We stored the data in a DB2 9.7 DB hosted in a Linux virtual machine with 4x2.67 Ghz Intel Xeon processors (only one core was used) and 4 GB of RAM available to DB2. Using this data we created a Semantic Index (time required: 27 s) and created a new table with the CT annotations extended with the Semantic Index. We issued several queries; the one we describe here isq(x) ← DNA Repair Gene(x)∧Antigen Gene(x)∧Cancer Gene(x)over the CT data. The query returns a total of 2 distinct resources. The results of each technique are as follows³: a)when rewriting using traditional query reformulation (see [1, 6]), the result is either one SQL query with 467874 disjuncts, or a union of 467874 SQL queries; non of these queries is executable by the engine;b)when applying the Semantic Index technique, the result of query reformulation is one single SQL query involving 3 range disjunctions; the query requires3.582s to execute (0.082s if the DB is warm, e.g., the indexes have been preloaded);c)if the ABox is expanded, the TBox is also discarded and the reformulation is equal to the original query; executing this query over the expanded data requires3s (0s if warm). With respect to the cost of the expansion, LePendu et al. indicate that a straightforward expansion for the CT resource collection requires≈7days, generates an additional≈126GB of data and, after a careful optimization of the process (including data partitioning, parallelization, etc.) this time can be reduced to≈40minutes. Given these results, we believe that the Semantic Index is a very promising option for semantic query answering. A system that uses the Semantic Index can provide a better performance than a system that relies only on query reformulation or ABox expansion.

T-Mappings for OBDA. The second technique we present is applicable in thevirtual OBDAcontext, where the data should stay in the data sourceand the only aspect of the virtual ABox we can manipulate are the mappings. Our proposal is based on the idea that, given an atomic conceptAand a basic conceptB, if we want to induce a dependency B vAA, we have to ensure that the mappings forAretrieve all the data retrieved by those forB. We call the techniqueT-mappings, and define it formally as follows.

3More information about these tests, including the SQL queries can be found athttps://

babbage.inf.unibz.it/trac/obdapublic/wiki/resourceindex

(11)

Definition 5. Given a TBoxT and a virtual ABoxV =hD,Mi, aT-mapping forT w.r.t.V is a virtual ABoxV⁰ =hD,M⁰isuch that for every ABox assertionαsuch thatV |=α, we have thatV⁰ |=α, and for each ground ABox assertionαsuch that hT,Vi |=αwe have thatV⁰ |=α.

AT-mappingM⁰can be constructed trivially by using existing mappings to generate new mappings that take into account the implications ofT as follows: InitializeM⁰=∅, and for each atomic conceptA,(i)for each mappingϕ;A⁰(f(x))∈ Msuch thatT |= A⁰ vA, addϕ;A(f(x))toM⁰, and(ii)for each mappingϕ;P(f(x),g(y))∈ M such thatT |= ∃P v A(resp.,T |= ∃P⁻ v A), addϕ ; A(f(x))(resp., ϕ ; A(g(y))}) toM⁰. However, suchT-mappings will not be efficient since the increased size ofMwill trigger a worst-case exponential blow-up during SQL query generation.

To avoid this situation we aim at creating succinct mappings by resorting to SQL query analysis as follows. LetD=hR,Iibe a DB instance,Σ_Da set of dependencies satisfied by D,SQ = {ϕ₁, . . . , ϕ_n} a set of SQL queries defined overR, andxa list of attribute names, thenmergesql(SQ, Σ_D,x)is the function that returns a new SQL queryϕthat isequivalentunderΣ_Dto the union of allϕ_i ∈SQ projected on x. E.g., ifSQ={’SELECT id FROM employee’,’SELECT id FROM employee JOIN information ON id’, ’SELECT id FROM external’} and ΣD is the inclusion dependency external[id] ⊆ employee[id], thenmergesql(SQ, ΣD,[id]) =

’SELECT id FROM employee’. Usingmergesql, we define two more auxiliary func- tions,mergemcandmergemr, which allow us tomergesets of mappings. LetSCbe a set of mappings for atomic concepts,SR a set of mappings for atomic roles, and Aan atomic concept. Let F be the set of all unique (up to variable renaming) variable terms f(x) that are used in a mapping in SC, then mergemc(SC, ΣD, A) = S

f_i(x_i)∈F{mergesql(SQfi, ΣD,xi) ; A(fi(xi))}, where SQfi is the set of SQL queriesϕof all the mappingsϕ ; A₁(f_i(x)) ∈ S_C wherex_i andxare unifiable.

Let F₁ (resp.,F₂) be the set of all unique (up to variable renaming) variable terms f(x)that are used as first (resp., second) component in the head of a mapping in S_R, thenmergemr(S_R, Σ_D,1, A) =S

fi∈F1{mergesql(SQ_f_i, Σ_D,x_i);A(f_i(x_i))}

(resp.,mergemr(SR, ΣD,2, A) = S

fi∈F2{mergesql(SQf_i, ΣD,xi) ; A(fi(xi))}), whereSQf_i is the set of SQL queriesϕof all the mappingsϕ;P(fi(x),g(y))(resp., ϕ;P(g(y),fi(x))) inSRwherexiandxunify. LetM(A)(resp.,M(P)) be the set of all mappings for the atomic conceptA(resp., for the atomic roleP) inM. Then, given a TBoxT and a setMof mappings, theT-mappings forMw.r.t.T is defined as M⁰=S

P∈V M(P)∪S

A∈V mergemc(mA, ΣD, A), withmA=S

T |=A_ivAM(Ai)∪

mergemr(mr, Σ_D,1, A)∪mergemr(mr⁻, Σ_D,2, A),mr=S

T |=∃PivAM(P_i), and mr⁻ =S

T |=∃P_i⁻vAM(Pi). Intuitively, this is the original role mappings inMunion the merging of all the relevant mappings for eachA(w.r.t. TBox implication).

Example 3. LetDbe the DB in Example 1,Mthe simple version of the mappings in that example (i.e., 4 simple mappings), andT the TBox in Example 2. Then theT- mappings forT isM⁰consisting of all the mappings inMplus the additional mapping m=’SELECT id,dept FROM employee’;Department(dept(dept))which we now explain. Note that∃WORKS-FOR⁻and∃MANAGES⁻are the only subsumees of Departmentw.r.t.T and hence, only their mappings (let them bem1,m2) are considered

(12)

for the mappings ofDepartmentbymergemc. Then given the containment between the queries ofm₁andm₂, the output ofmergemc({m₁, m₂},∅,Department)ism.

6 Conclusions and Future Work

In this paper we focused on issues of redundancy and performance in OBDA systems.

Several directions can be taken starting from the ideas presented here. First, although TBox pre-processing is the best place to first address redundancy, it is also necessary to apply redundancy elimination during reasoning, e.g., during query rewriting. Second, redundancy may also appear during consistency checking, i.e., when an ABox is sound w.r.t. to the TBox; this situation can also be characterized by ABox dependencies and should be addressed. With respect to the evaluation of our proposals, further experimentation is still required. In particular, it is necessary to provide a comprehensive benchmark of the techniques discussed in this paper in comparison to other proposals.

As future work we are now exploring extensions of the techniques to more expressive languages, e.g.,DL-LiteAwhich supports also role inclusions, and SPARQL queries over RDFS ontologies. In this last context, we believe that our techniques can be used to provide high-performance SPARQL end points with sound and complete RDFS entailment regime support, without relying on inference materialization, as is usually done.

Acknowledgements. We thank Dr. Paea LePendu for providing us with the RI data and supporting us in understanding this application. This research has been partially supported by the EU under the ICT Collaborative Project ACSI grant n. FP7-257593.

References

1. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning and efficient query answering in description logics: TheDL-Litefamily. J. of Automated Reasoning, 39(3):385–429, 2007.

2. D. DeHaan, D. Toman, M. P. Consens, and M. T. ¨Ozsu. A comprehensive XQuery to SQL translation using dynamic interval encoding. InProc. of ACM SIGMOD, pages 623–634, 2003.

3. G. Gottlob and C. Ferm¨uller. Removing redundancy from a clause.Artif. Intell., 61(2), 1993.

4. R. Kontchakov, C. Lutz, D. Toman, F. Wolter, and M. Zakharyaschev. The combined approach to query answering inDL-Lite. InProc. of KR 2010, 2010.

5. P. LePendu, N. Noy, C. Jonquet, P. Alexander, N. Shah, and M. Musen. Optimize first, buy later: Analyzing metrics to ramp-up very large knowledge bases. InProc. of ISWC 2010, volume 6496 ofLNCS, pages 486–501. Springer, 2010.

6. H. P´erez-Urbina, B. Motik, and I. Horrocks. Tractable query answering and rewriting under description logic constraints.J. of Applied Logic, 8(2):186–209, 2010.

7. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies.J. on Data Semantics, X:133–173, 2008.

8. M. Rodr´ıguez-Muro.Tools and Techniques for Ontology Based Data Access in Lightweight Description Logics. PhD thesis, KRDB Research Centre for Knowledge and Data, Free Univ.

of Bozen-Bolzano, 2010.

9. R. Rosati and A. Almatelli. Improving query answering overDL-Liteontologies. InProc. of KR 2010, 2010.