Dependencies to Optimize Ontology Based Data Access

(1)

Mariano Rodr´ıguez-Muro and Diego Calvanese KRDB Research Centre for Knowledge and Data

Free University of Bozen-Bolzano Piazza Domenicani 3, Bolzano, Italy {rodriguez,calvanese}@inf.unibz.it

Abstract. Query answering in Ontology Based Data Access (OBDA) exploits the knowledge of an ontology’s TBox to deal with incompleteness of the ABox (or data source). Current query-answering techniques withDL-Literequire exponential size query reformulations, or expensive data pre-processing. Also, these techniques present severe redundancy issues when dealing with ABoxes that are already (partially) complete. It has been shown that addressing redundancy is not only required for tractable implementations of decision procedures, but may also allow for sizable improvements in execution times. Considering the previous observations, in this paper we extend the results aiming at improving query answering performance in OBDA systems that were developed in [9] forDL-LiteF, to the case where also role inclusions are present in the TBox. Specifically, we first show that we can characterize completeness of an ABox by means of dependencies, and that we can use these to optimizeDL-LiteATBoxes. Second, we show that in OBDA systems we can create ABox repositories that appear to be complete w.r.t. a significant portion of anyDL-LiteATBox. The combination of these results allows us to design OBDA systems based onDL-LiteAin which redundancy is minimal, the exponential aspect of query answering is notably reduced and that can be implemented efficiently using existing RDBMSs.

1 Introduction

The current approaches to Ontology Based Data Access (OBDA) with lightweight Description Logics (DLs) of theDL-Litefamily [2] rely on query reformulation. These techniques are based on the idea of using the ontology to rewrite a given query into a new query that, when evaluated over the data sources, returns the certain answers to the original query. Experiments with unions of conjunctive queries (UCQs) have shown that reformulations may be very large, and that the execution of these reformulations suffers from poor performance. This triggered the development of alternative reformulation techniques [6,10], in which the focus has been on the reduction of the number of generated queries/rules. These techniques have shown some success, however query reformulation in all of them is still worst-case exponential in the size of the original query.

Alternative approaches [5] use the expansion of the extensional layer of the ontology (i.e., the ABox) w.r.t. the intensional knowledge (i.e., the TBox) to avoid query reformulation almost entirely. However, the cost of data expansion imposes severe limitations on the

?This work has been supported by the EU FP7-ICT Project ACSI (257593).

(2)

system. We believe that approaching the problem of the high cost of query answering in OBDA systems requires a change of focus: namely, from the ’number of queries’

perspective, to the perspective that takes into account the ’duplication in the answers’

appearing in query results under SQL multiset semantics. Duplication in results is a sign of redundancy in the reasoning process; it not only generated not only by the reformulation procedures as traditionally thought, since also techniques based on ABox expansion show this problem. Instead, redundancy is the consequence of ignoring the semantics of the data sources. In particular, when the data in a source (used to populate the ABox of the ontology) already satisfies an inclusion assertion of the TBox (i.e., is completew.r.t. such an inclusion assertion), then using that inclusion assertion during query answering might generate redundant answers [8]. As noted in [4], the runtime of decision procedures might change from exponential to polynomial if redundancy is addressed, and this is also the case in OBDA query answering. In [9], we addressed both problems, redundancy and the exponential blow-up of query reformulations, for the DL DL-Lite_F. We followed two complementary directions and in this paper we extend both to deal also with the case where role inclusions are present in the TBox.

Specifically, in [9], we first presented an approach to take into account completeness of the data with respect toDL-Lite_FTBoxes. We characterized completeness usingABox dependenciesand showed that it is possible to use dependencies to optimize the TBox in order to avoid redundant computationsindependently of the reasoning technique.

Second, we focused on how we can optimally complete ABoxes in OBDA systems by relying on the fact that in OBDA systems it is possible to manipulate not only the data, but also the mappings and database schema. This allows us to conceive procedures to store an ABox in a source in such a way that itappearsto be complete with respect to a significant portion of the TBox, but without actuallyexpandingthe data. We presented two such procedures, one for general and one for ’virtual’ OBDA systems, both designed to take advantage of the features of modern RDBMSs effectively. These results allow for the design of systems that can delegate reasoning tasks (e.g., dealing with hierarchies, existentially quantified individuals, etc.) to stages of the reasoning process where these tasks can be handled most effectively. The result is a (sometimes dramatic) reduction of the exponential runtime and an increase in the quality of the answers due to the reduction of duplication. Here, we extend the TBox optimization procedure and one of the ABox completion mechanisms toDL-Lite_Aontologies, in which role inclusions are allowed.

The rest of the paper is organized as follows: Section 2 gives technical preliminaries.

Section 3 presents our extension toDL-LiteAof the general technique for optimizing TBoxes w.r.t. dependencies. Section 4 introduces data dependencies in OBDA systems, describing why it is natural to expect completeness of ABoxes. Section 5 presents our extension of one of the techniques for completing ABoxes in OBDA systems to allow forDL-Lite_Aontologies. Section 6 concludes the paper.

2 Preliminaries

In the rest of the paper, we assume a fixed vocabularyV ofatomic concepts, denotedA (possibly with subscripts), andatomic roles, denotedP, representing unary and binary relations, respectively, and an alphabetΓ of(object) constants.

(3)

Databases. In the following, we regard adatabase (DB)as a pairD=hR,Ii, whereR is a relational schema andIis an instance ofR. The active domainΓDofDis the set of constants appearing inI, which we callvalue constants. An SQL queryϕover a DB schemaRis a mapping from a DB instanceIofRto a set of tuples.

DL-Liteontologies. We introduce the DLDL-LiteA, on which we base our results. InDL- LiteA, abasic role, denotedR, is an expression of the formPorP⁻, and abasic concept, denotedB, is an expression of the formAor∃R. Anontologyis a pairO=hT,Ai whereT is a TBox andAan ABox. A TBox is a finite set of(positive) inclusions B₁vB₂orR₁vR₂,disjointness assertionsB₁v ¬B2, andfunctionality assertions (functR). An ABox is a finite set ofmembership assertionsA(c)orP(c, c⁰), where c, c⁰∈Γ. Moreover,DL-LiteAimposes the syntactic restriction that a rolePdeclared functional, via(functP), or inverse functional, via(functP⁻), cannot be specialized, i.e., cannot appear in the right-hand side of a role inclusion assertionRvPorRvP⁻. Queries over ontologies. Anatomis an expression of the formA(t)orP(t, t⁰), where tandt⁰areatom terms, i.e., variables or constants inΓ. An atom isgroundif it contains no variables. Aconjunctive query(CQ)qover an ontologyOis an expression of the formq(x)←β(x,y), wherexis a tuple of distinct variables, calleddistinguished,yis a tuple of distinct variables not occurring inx, callednon-distinguished, andβ(x,y)is aconjunctionof atoms with variables inxandy, whose predicates are atomic concepts and roles ofO. We callq(x)theheadof the query andβ(x,y)itsbody. Aunion of CQs(UCQ) is a set of CQs (called disjuncts) with the same head. Given a CQQwith bodyβ(z)and a tuplevof constants of the same arity asz, we call aground instance ofQthe setβ[z/v]of ground atoms obtained by replacing inβ(z)each variable with the corresponding constant fromv.

Semantics. Aninterpretation I = (∆Î,·Î)consists of a non-empty interpretation domain∆Îand aninterpretation function·Îthat assigns to each constantcan element cÎof∆Î, to each atomic conceptAa subsetAÎ of∆Î, and to each atomic roleP a binary relation over∆Î. Moreover, basic roles and basic concepts are interpreted as follows:(P⁻)Î ={(o2, o₁)|(o₁, o₂)∈ PÎ}and(∃R)Î ={o| ∃o⁰.(o, o⁰) ∈ RÎ}.

An interpretationI is amodelofB₁vB₂ifBÎ₁ ⊆B₂Î, ofR₁vR₂ifRÎ₁ ⊆RÎ₂, of B₁ v ¬B₂ifB₁Î∩BÎ₂ =∅, and of(functR)if for eacho, o₁, o₂∈∆Îwe have that (o, o₁)∈RÎand(o, o₂)∈RÎimplieso₁=o₂. Also,Iis a model ofA(c)ifcÎ ∈AÎ, and ofP(c, c⁰)if(cÎ, c^0I)∈PÎ. InDL-LiteA, we adopt the Unique Name Assumption (UNA), which enforces that for each pair of constantso1,o2, ifo16=o2, thenoÎ₁ 6=oÎ₂. For aDL-Lite_Aassertionα(resp., a setΘofDL-Lite_Aassertions),I |=α(resp.,I |=Θ) denotes thatI is a model ofα(resp.,Θ). Amodel of an ontologyO =hT,Aiis an interpretationIsuch thatI |=T andI |=A. An ontology issatisfiableif it admits a model. An ontologyOentailsan assertionα, denotedO |=α, if every model ofOis also a model ofα. Similarly, for a TBoxT and an ABoxAinstead ofO. Thesaturation of a TBoxT, denotedsat(T), is the set ofDL-LiteAassertionsαs.t.T |=α. Notice thatsat(T)is finite, hence a TBox.

LetΓAdenote the set of constants appearing in an ABoxA. Theanswerto a CQ Q=q(x)←β(x,y)overO=hT,Aiin an interpretationI, denotedans(Q,O,I), is the set of tuplesc∈ΓA× · · · ×ΓAsuch that there exists a tuplec⁰∈ΓA× · · · ×ΓA

such that the ground atoms inβ[(x,y)/(c,c⁰)]are true inI. The answer to an UCQ

(4)

QinIis the union of the answers to each CQ inQ. Thecertain answerstoQinO, denotedcert(Q,O), is the intersection of everyans(Q,O,I)for all modelsIforO.

Theanswer toQover an ABoxA, denotedeval(Q,A), is the answers toQoverA viewed as a DB instance. Aperfect reformulationofQw.r.t. a TBoxT is a queryQ⁰such that for every ABoxAsuch thathT,Aiis satisfiable,cert(Q,hT,Ai) =eval(Q⁰,A).

Mappings. We adopt the definitions for ontologies with mappings from [7]. First we extend interpretations to be able to create object constants from the value constants in a DBD. Given an alphabetΛoffunction symbolswe define the setτ(Λ, ΓD)ofobject termsas the set of all terms of the formf(d1, . . . , dn), wheref ∈Λ, the arity off is n, andd1, . . . , dn ∈ΓD. We setΓ =ΓD∪τ(Λ, ΓD), and we extend the interpretation function so that for eachc ∈τ(Λ, ΓD)we have thatc^I ∈∆^I. We extend queries by allowing the use of predicate arguments that arevariable terms, i.e., expressions of the formf(t), wheref ∈Λwith aritynandtis ann-tuple of variables or value constants.

Given a TBoxT and a DBD, amapping (assertion)mforT is an expression of the formϕ(x);ψ(t)whereϕ(x)is an SQL query overDwith answer variablesx, and ψ(t)is a CQ over T without non-distinguished variables using variable terms over variables inx. We call the mappingsimpleif the body ofψ(t)consists of a single atom, andcomplexotherwise. A simple mappingis foran atomic conceptA(resp., atomic role P) if the atom in the body ofψ(t)hasA(resp.,P) as predicate symbol. In the following, we might abbreviate the queryψin a mapping by showing only its body. Avirtual ABox Vis a tuplehD,Mi, whereDis a DB andMa set of mappings, and anontology with mappingsis a tupleOM=hT,Vi, whereT is a TBox andV =hD,Miis a virtual ABox in whichMis a set of mappings forT.

An interpretationI satisfiesa mapping assertionϕ(x) ;ψ(t)w.r.t. a DBD = hR,Iiif for every tuplev∈ϕ(I)and for every ground atomXinψ[x/v]we have that:

ifXhas the formA(f(c)), then(f(c))Î ∈AÎ, and ifXhas the formP(f₁(c₁),f₂(c₂)), then ((f₁(c₁))Î,f₂(c₂)Î) ∈ PÎ. An interpretation I is a model ofV = hD,Mi, denotedI |=V, if it satisfies every mapping inMw.r.t.D. A virtual ABoxVentailsan ABox assertionα, denotedV |=α, if every model ofVis a model ofα.Iis a model of OM=hT,ViifI |=T andI |=V. As usual,OMissatisfiableif it admits a model.

We note that, in an ontology with mappingsOM=hT,hD,Vii, we can always replace Mby a set ofsimplemappings, while preserving the semantics ofOM. It suffices to spliteach complex mappingϕ;ψinto a set of simple mappings that share the same SQL queryϕ(see [7]). In the following, we assume to deal only with simple mappings.

Dependencies. ABox dependencies are assertions that restrict thesyntactic form of allowed ABoxes. In this paper, we focus on unary and binary inclusion dependencies only. Aunary (resp., binary) inclusion dependencyis an assertion of the formB1vAB2, whereB1andB2are basic concepts (resp.,R1 vA R2, whereR1andR2are basic roles). In the following, for a basic roleRand constantsc,c⁰,R(c, c⁰)stands forP(c, c⁰) ifR=P and forP(c⁰, c)ifR =P⁻. An ABoxAsatisfiesan inclusion dependency σ, denotedA |=σ, if the following holds:(i)ifσisA1vAA2, then for allA1(c)∈ A we haveA2(c)∈ A;(ii)ifσis∃RvAA, then for allR(c, c⁰)∈ Awe haveA(c)∈ A;

(iii)ifσisAvA∃R, then for allA(c)∈ Athere existsc⁰such thatR(c, c⁰)∈ A;(iv)if σis∃R1vA∃R2, then for allR1(c, c⁰)∈ Athere existsc⁰⁰such thatR2(c, c⁰⁰)∈ A;

(v)ifσisR1 vA R2, then for allR1(c, c⁰)∈ Awe haveR2(c, c⁰)∈ A. An ABoxA

(5)

satisfies a set of dependenciesΣ, denotedA |=Σ, ifA |=σfor eachσ ∈Σ. A set of dependenciesΣentailsa dependencyσ, denotedΣ|=σ, if for every ABoxAs.t.

A |=Σwe also have thatA |=σ. Thesaturationof a setΣof dependencies, denoted sat(Σ), is the set of dependenciesσs.t.Σ|=σ.Given two queriesQ1,Q2, we say that Q1is contained inQ2relative toΣifeval(Q1,A)⊆eval(Q2,A)for each ABoxA s.t.A |=Σ.

3 Optimizing TBoxes w.r.t. Dependencies

In aDL-LiteAontologyO=hT,Ai, the ABoxAmay beincompletew.r.t. the TBoxT, i.e., there may be assertionsB₁vB₂inT s.t.A 6|=B₁v_AB₂. When computing the certain answers to queries overO, the TBoxT is used to overcome such incompleteness.

However, an ABox may already be (partially) complete w.r.t.T, e.g., an ABoxAsatisfy- ingA1vAA2is complete w.r.t.A1vA2. While ignoring completeness of an ABox is

’harmless’ in the theoretical analysis of reasoning overDL-Lite_Aontologies, in practice, it introducesredundancy, which manifests itself ascontainment w.r.t. dependencies among the disjuncts (CQs) of the perfect reformulation, making the contained disjuncts redundant. For example, letT andAbe as before, and letQbeq(x)←A2(x), then any perfect reformulation ofQmust includeq1=q(x)←A1(x)andq2=q(x)←A2(x) as disjuncts. However, sinceq1is contained inq2relative toA1vAA2, we have thatq1

will not contribute new tuples w.r.t. those contributed byq2.

It is possible to use information about completeness of an ABox, expressed as a set of dependencies, to avoid redundancy in the reasoning process. One place to do this is during query reformulation, using techniques based on conjunctive query containment (CQC) with respect to dependencies to avoid the generation of redundant queries. However, this approach is expensive, since CQC is an NP-complete problem (even ignoring dependencies), and such optimizations would need to be performed every time a query is reformulated. We show now how we can improve efficiency by pre-processing the TBox before performing reformulation. In particular, given a TBoxT and a setΣof dependencies, we show how to compute a TBoxT⁰that is smaller thanT and such that for every queryQthe certain answers are preserved ifQis executed over an ABox that satisfiesΣ. Specifically, our objective is to determine when an inclusion assertion ofT isredundant w.r.t.Σ, and to do so we use the following auxiliary notions.

Definition 1. LetT be a TBox,B,Cbasic concepts,R,Sbasic roles, andΣa set of dependencies overT. AT-chain fromBtoCinT (resp., aΣ-chain fromBtoCinΣ) is a sequence of inclusion assertions(BivB⁰_i)ⁿ_i=0inT (resp., a sequence of inclusion dependencies(BivAB_i⁰)ⁿ_i=0inΣ), for somen≥0, such that:B0=B,B⁰_n=C, and for1≤i≤n, we have thatB_i−1⁰ andBiare basic concepts s.t., either (i)B_i−1⁰ =Bi, or (ii)B_i−1⁰ =∃R⁰andBi=∃R⁰⁻, for some basic roleR⁰. AT-chain fromRtoSinT (resp., aΣ-chain fromRtoSinΣ) is a sequence of inclusion assertions(RivR⁰_i)ⁿ_i=0 inT (resp., a sequence of inclusion dependencies(R_ivAR⁰_i)ⁿ_i=0inΣ), for somen≥0, such that:R₀=R,R⁰_n =Sand for1≤i≤n, we have thatR⁰_i−1=R_i.

Intuitively, when there is aT-chain fromBtoC, the existence of an instance ofBin a model implies the existence of an instance ofC. For aΣ-chain, this holds for ABox assertions. We useT-chains andΣ-chains to characterize redundancy as follows.

(6)

Definition 2. LetT be a TBox,B,C basic concepts,R,Sbasic roles, andΣ a set of dependencies. The inclusion assertionB vC(resp.,RvS) isdirectly redundant inT w.r.t.Σ if (i)Σ |=B vA C(resp.,Σ |=R vA S) and (ii) for everyT-chain (BivB_i⁰)ⁿ_i=0withB_n⁰ =BinT (resp., for everyT-chain(Bi vB⁰_i)ⁿ_i=0withB_n⁰ =∃R and for everyT-chain(RivR⁰_i)^m_i=0withR⁰_m=R), there is aΣ-chain(BivAB_i⁰)ⁿ_i=0 (resp., aΣ-chain(B_ivAB_i⁰)ⁿ_i=0and aΣ-chain(R_ivAR⁰_i)^m_i=0). Then,BvC(resp., R v S) isredundant inT w.r.t.Σif (a) it is directly redundant, or (b) there exists B⁰ 6=B(resp.,R⁰6=R) s.t. (i)T |=B⁰ vC(resp.,T |=R⁰ vS), (ii)B⁰ vC(resp., R⁰ vS) is not directly redundant inT w.r.t.Σ, and (iii)B vB⁰ (resp.,R vR⁰) is directly redundant. inT w.r.t.Σ.

Given a TBoxT and a set of dependenciesΣ, we apply our notion of redundancy w.r.t.Σto the assertions in the saturation ofT to obtain a TBoxT⁰that is equivalent to T for certain answer computation.

Definition 3. Given a TBoxT and a set of dependenciesΣ overT, theoptimized version ofT w.r.t.Σ, denotedoptim(T, Σ), is the set of inclusion assertions{α ∈ sat(T)|αis not redundant insat(T)w.r.t.sat(Σ)}.

Correctness of using T⁰ = optim(T, Σ)instead of T when computing the certain answers to a query follows from the following theorem.

Theorem 1. Let T be a TBox andΣa set of dependencies overT. Then for every ABoxAsuch thatA |=Σand every UCQQoverT, we have thatcert(Q,hT,Ai) = cert(Q,hoptim(T, Σ),Ai).

Proof. First we note that during query answering, only the positive inclusions are relevant, hence we ignore disjointness and functionality assertions. Sincesat(T)adds toT only entailed assertions,cert(Q,hT,Ai) = cert(Q,hsat(T),Ai), for everyQ andA, and we can assume w.l.o.g. thatT = sat(T). Moreover,cert(Q,hT,Ai)is equal to the evaluation ofQoverchase(T,A). (We refer to [2] for the definition of chase for aDL-LiteFontology.) Hence it suffices to show that for everyBvC(resp., RvR) that is redundant with respect toΣ,chase(T,A) =chase(T \ {BvC},A).

We show this by proving that ifBvC(resp.,RvS) is redundant (hence, removed by optim(T, Σ)), then there is always achase(T,A)in whichBvC(resp.,RvS) is never applicable. Assume by contradiction thatBvC(resp.,RvS) is applicable to some assertionB(c)(resp.,R(c, c⁰)) during some step inchase(T,A). We distinguish two cases that correspond to the cases of Definition 2.

(a) Case where B v C (resp.,R v S) is directly redundant, and henceΣ |= B vA C (resp.,Σ |= R vA S). We distinguish two subcases:(i)B(c)∈ A(resp., R(c, c⁰)∈ A). SinceA |=B vAC(resp.,A |=RvA S), we haveC(c)∈ A(resp., S(c, c⁰)∈ A), and henceBvCis not applicable toB(c)(resp.,RvSis not applicable toR(c, c⁰)). Contradiction.(ii)B(c)∈ A/ (resp.,R(c, c⁰)∈ A). Then there is a sequence/ of chase steps starting from some ABox assertionB⁰(c⁰)(resp.,R(a, a⁰)orB(a)) that generatesB(c)(resp.,R(c, c⁰)). Such a sequence requires aT-chain(BivB_i⁰)ⁿ_i=0with B0 =B⁰andB_n⁰ =B(resp., aT-chain(Ri vR⁰_i)ⁿ_i=0withR0 =R⁰andR⁰_n =R, or aT-chain(Bi vB_i⁰)ⁿ_i=0withB0 = B andB⁰_n =∃R), such that eachBi v B_i⁰

(7)

(resp., eachRi v R⁰_i or eachBi v B_i⁰) is applicable inchase(T,A). Then, by the second condition of direct redundancy, there is aΣ-chain(Bi vA B_i⁰)ⁿ_i=0 (resp., a Σ-chain(RivAR⁰_i)ⁿ_i=0or aΣ-chain(BivAB⁰_i)ⁿ_i=0). SinceA |=B0vAB⁰₀(resp., A |=R0vR₀⁰ orA |=B0 vA B₀⁰ ) we have thatB₀⁰(c⁰)∈ A(resp.,R⁰₀(a, a⁰)∈ A orB₀⁰(a)∈ A) and henceB0vB⁰₀is not applicable toB⁰(c⁰)(resp.,R0vR⁰₀is not applicable toR⁰(a, a⁰), orB₀vB⁰₀is not applicable toB⁰(a)). Contradiction.

(b) Case whereB vChas been removed by Definition 2(b), and hence there exists B⁰ 6= Bsuch thatT |= B v B⁰ (resp.,R⁰ 6= Rs.t.T |= R v R⁰). First we note that any two oblivious chase sequences forT andAproduce results that are equivalent w.r.t. query answering. Then it is enough to show that there exists somechase(T,A)in whichB⁰vC(resp.,R⁰ vS) is always applied beforeBvC(resp.,RvS) and in whichB vC(resp.,RvS) is never applicable. Again, we distinguish two subcases:

(i)B(c)∈ A(resp.,R(c, c⁰)∈ A). Then, sinceBvB⁰is directly redundant, we have thatΣ|=B vAB⁰. SinceA |=Σ, we have thatB⁰(c)∈ A(resp.,R⁰(c, c⁰)∈ A), and given thatB⁰vC(resp.,R⁰vS) is always applied beforeB vC(resp.,RvS),C(c) (resp.,S(c, c⁰)) is added tochase(T,A)before the application ofB vC(resp.,RvS), henceB vC(resp.,R vS) is in fact not applicable. Contradiction.(ii)B(c)∈ A/ (resp.,R(c, c⁰)∈ A). Then, arguing as in Case/ (a).(ii), usingBvB⁰instead ofBvC (resp.,RvR⁰instead ofRvS), we can derive a contradiction. ut

Complexity and implementation. Due to space limitations, we cannot provide a full description of how to compute optim(T, Σ). We just note that the checks that are required byoptim(T, Σ)can be reduced to computing reachability between two nodes in a DAG that represents the reachability relation of the chains inT andΣ. This operation can be done in linear time.

Consistency checking. Consistency checking may also suffer from redundancy when the ABox is already (partially) complete w.r.t.T. In this case, we need to consider, in addition to inclusion dependencies, alsofunctionalanddisjointnessdependencies.

Due to spaces limitations we cannot provide more details, and just note that using these dependencies it is possible to extend the definitions to generate TBoxes that avoid redundant consistency checking operations.

4 Dependencies in OBDA Systems

The purpose of the current section is to complement our argument w.r.t. completeness of ABoxes by discussing when and why we can expect completeness in OBDA systems.

We start by observing that in OBDA systems, ABoxes are constructed, in general, from existing data that resides in some form of data repository. In order to create an ABox, the system requires some form of mappings from the source to the ontology. These may be explicit logical assertions as the ones used in this paper, or they may be implicitly defined through application code. Therefore, the source queries used in these mappings become crucial in determining the structure of the ABox. In particular, any dependencies that hold over the results of these queries will be reflected in the OBDA system as ABox dependencies.

(8)

Example 1. LetRbe a DB schema with the relation schemaemployeewith attributes id,dept, andsalary, that stores information about employees, their salaries, and the department they work for. LetMbe the following mappings:

SELECT id,dept FROM employee ; Employee(emp(id))∧

WORKS-FOR(emp(id),dept(dept)) SELECT id,dept FROM employee

WHERE salary > 1000

; M anager(emp(id))∧

MANAGES(emp(id),dept(dept)) whereEmployeeandManagerare atomic concepts andWORKS-FORandMANAGES are atomic roles. Then foreveryinstanceIofR, the virtual ABoxV =hhR,Ii,Mi satisfies the following dependencies:

ManagervAEmployee ManagervA∃MANAGES ∃WORKS-FORvA Employee

∃MANAGESvAManager EmployeevA ∃WORKS-FOR In particular, the dependency in Column 1 follows from the containment relation between the two SQL queries used in the mappings, and the remaining dependencies follow from the fact that we populateWORKS-FOR(resp.,MANAGES) using the same SQL query used to populateEmployee(resp.,Manager).

Turning our attention to the semantics of the data sources, we note that any given DB is based on some conceptual model. At the same time, if we associate the data of any given DB to the concepts and roles of a TBoxT, it follows that this data issemantically relatedto these concepts and roles, and that the conceptual model of the DB has some common aspects with the semantics ofT. It is precisely these common aspects that get manifested as dependencies between queries in the mappings and that give rise to completeness in ABoxes. Therefore, the degree of completeness of an ABox in an OBDA system is in direct relation with the closeness of the semantics of the conceptual model of the DB and the semantics of the TBox, and with the degree in which the DB itself complies to the conceptual model that was used to design it.

Example 2. To illustrate the previous observations we extend Example 1. First we note that the intended meaning of the data stored inRis as follows:(i)employees with a salary higher than 1,000 are managers,(ii)managersmanagethe department in which they are employed, and(iii)every employee works for a department. Then, any TBox that shares some of this semantics will present redundancy. For example, ifT is

ManagervEmployee Managerv ∃MANAGES Employeev ∃WORKS-FOR

∃MANAGES⁻vDepartment ∃WORKS-FOR⁻vDepartment then the first row of assertions is redundant w.r.t. Σ. Instead, the semantics of the assertions of the second row is not captured by the mappings. In an OBDA system with such components, we should reason only w.r.t.Department. This can be accomplished by optimizingT w.r.t. Σusing the technique presented in Section 3.

5 Dependency Induction

We focus now on procedures to complete ABoxes with respect to TBoxes. The final objective is to simplify reasoning by diverting certain aspects of the process (e.g.,

(9)

dealing with concept/role hierarchies and domain and range assertions) from the query reformulation stage to other stages of the query answering process where they can be handled more efficiently. We call these proceduresdependency induction procedures since their result can be characterized by a set of dependencies that hold in the ABox(es) of the system. Formally, given an OBDA systemO=hT,Vi, whereV=hhR,Ii,Mi, we call adependency induction procedurea procedure that usesOto compute a virtual ABoxV⁰such that the number of assertions inT for whichV⁰ is complete is higher than those forV. An example of a dependency induction procedure isABox expansion, a procedure in which the data inIischasedw.r.t.T. The critical point in dependency induction procedures is the trade-off between the degree of completeness induced, the system’s performance, and the cost of the procedure. In [9] we presented two dependency induction mechanism that provide good trade-offs. Both of them are designed for the case in which the data sources are RDBMSs. In the current paper we extend one of these procedures, thesemantic indextechnique, to theDL-Lite_Asetting. In particular, given a DL-Lite_ATBoxT and a virtual ABoxV, the extended semantic index is able to generate a virtual ABoxV⁰where, ifT |=B vA(resp.,T |=R1vR2), thenV⁰ |=B vAA (resp.,Σ|=R1vAR2). Hence,V⁰is complete for allDL-Lite_Ainferences except those involving mandatory participation assertions, e.g.,Bv ∃R.

Semantic Index. This technique applies in the context of general OBDA systems in which we are free to manipulateanyaspect of the system to improve query answering.

The basic idea is to encode the impliedis-arelationships ofT in the values ofnumeric indexesthat we assign to concept and role names. ABox membership assertions are then inserted in the DB using these numeric values s.t. one can retrieve most of the implied instances of any concept or role by posing simple range queries to the DB (which are very efficient in modern RDBMSs). Our proposal is related to techniques for managing large transitive relations in knowledge bases (e.g., the is-a hierarchy) [1], however, our interest is not in managing hierarchies but in querying the associated instance data. Our proposal is also related to a technique for XPath query evaluation known asDynamic Intervals[3], however, while the latter deals with XML trees, we have to deal with hierarchies that are DAGs. Formally, a semantic index is defined as follows.

Definition 4. Given a DL-Lite_ATBoxT and its vocabularyV,a semantic index for T is a pair of mappingshidx,rangeiwith idx : V → Nandrange : V → 2^N×N, such that, for each pairE1,E2of atomic concepts or atomic roles inV, we have that T |=E1vE2iff there is a pairh`, hi ∈range(E2)such that`≤idx(E1)≤h.

Using a semantic indexhidx,rangeifor a TBoxT, we constructV=hR,Iiwith the completeness properties described above by proceeding as follows. We define a DB schemaRwith a universal-like relationTC[c1,idx]for storing ABox concept assertions, and a relationTR[c1,c2,idx]for storing ABox role assertions, s.t.c1and c2have typeconstant andidxhas typenumeric. Given an ABoxA, we constructI such that for eachA(c)∈ Awe havehc,idx(A)i ∈TC and for eachP(c, c⁰)∈ Awe havehc, c⁰,idx(P)i ∈TR. The schema and the index allow us to define, for each atomic conceptAand each atomic roleP, a set of range queries overDthat retrieves most constantsc,c⁰such thatO |=A(c)orO |=P(c, c⁰). E.g., ifrange(A) ={h2,35i}, we define’SELECT c1 FROM TC WHERE idx >= 2 AND idx <= 35’. We use these

(10)

queries to define the mappings of the system as follows¹:(i)for each atomic concept Aand eachh`, hi ∈range(A), we add the mappingσ_`≤idx≤h(TC);A(c1);(ii)for each atomic rolePand eachh`, hi ∈range(P), we add the mappingσ`≤idx≤h(TR); P(c1, c2); (iii) for each pair of atomic roles P,P⁰ such that T |= P⁰⁻ v P and eachh`, hi ∈ range(P⁰)we add the mappingσ`≤idx≤h(TR) ; P(c2, c1);(iv)for each atomic concept A, each atomic roleP s.t. T |= ∃P v A(resp., ∃P⁻ v A) and each h`, hi ∈ range(P), we add the mappingσ_`≤idx≤h(T_R) ; A(c₁) (resp., σ_`≤idx≤h(T_R);A(c₂));(v)last, we replace any pair of mappingsσ_`≤idx≤h(T_C); A(c₁)andσ_`⁰_≤idx≤h⁰(T_C) ; A(c₁)such that`⁰ ≤ hand` ≤ h⁰ by the mapping σ_min(`,`0)≤idx≤max(h,h⁰)(T_C);A(c₁)(similarly for role mappings).

A semantic index can be trivially constructed by assigning to each concept and role a unique (arbitrary) value and a set of ranges that covers all the values of their subsumees. However, this is not effective for optimizing query answering since the size ofMdetermines exponentially the size of the final SQL query. To avoid an exponential blow-up, we createhidx,rangeiusing the implied concept and role hierarchy as follows.

LetT be a TBox, andDCthe minimal DAG that represents theimplied is-arelation between all atomic concepts ofT (i.e., thetransitive reductsof the concept hierarchy)². Then we can constructidxby initializing a counteri= 0, and visiting the nodes inDC

in a depth-first fashion starting from the root nodes. At each step and given the nodeN visited at that step, ifidx(N)is undefined, setidx(N) =iandi=i+1, else ifidx(N)is defined, backtrack until the next node for whichidxis undefined. Now, to generaterange we visit the nodes inD_Cstarting from the leafs and going up. For each nodeNin the visit, ifNis a leaf inD_C, then we setrange(N) ={hidx(N),idx(N)i}, and ifNis not a leaf, then we setrange(N) =merge({hidx(N), idx(N)i} ∪S

Ni|Ni→N∈DCrange(N_i)), wheremergeis a function that, given a setrof ranges, returns the minimal setr⁰of ranges that has equal coverage asr, e.g.,merge({h5,7i,h3,5i,h9,10i}) = {h3,7i,h9,10i}.

We proceed exactly in the same way with the DAGDRrepresenting the role hierarchy.

Example 3. LetA,B,C,D be atomic concepts, letR,S,M be atomic roles, and consider the TBoxT ={BvA, CvA, CvD, Dv ∃R,∃RvD, SvR, MvR}.

Let the DAGsDCandDRforT be the ones depicted in Fig. 1. The technique generates idxandrangeas indicated in Fig. 1, generates the mappings in Fig. 3, and for any ABox, the technique generates a virtual ABoxVthat satisfies all the dependenciesΣin Fig. 2.

Then, we will have that in query answering, rewriting is only necessary w.r.t.Dv ∃R, which would be the output ofoptim(T, Σ).

Our evaluation of the semantic index technique, described in [9], shows its effective- ness in improving the cost and efficiency of query answering.

6 Conclusions and Future Work

In this paper we focused on issues of redundancy and performance in OBDA systems.

Several directions can be taken starting from the ideas presented here. First, although

1Here we use relational algebra expressions instead of SQL to simplify the exposition.

2We assume w.l.o.g. thatT does not contain a cyclic chain of basic concept or role inclusions.

Under such an assumptionDCis unique.

(11)

1,{(1,3)}

A B 2,{(2,2)}

C 3,{(3,3)}

4,{(3,4)}

D S 6,{(6,6)}

5,{(5,7)}

R M 7,{(7,7)}

Fig. 1.DC,DRand the values foridxandrange.

BvAA CvAA CvAD ∃RvAD SvAR M vAR Fig. 2.The induced dependencies.

σ1≤idx≤3(TC) ; A(c1) σ2≤idx≤2(TC) ; B(c1) σ3≤idx≤3(TC) ; C(c1) σ3≤idx≤4(TC) ; D(c1)

σ5≤idx≤7(TR) ; R(c1, c2) σ6≤idx≤6(TR) ; S(c1, c2) σ7≤idx≤7(TR) ; M(c1, c2) σ5≤idx≤7(TR) ; D(c1) Fig. 3.The mappings created by the technique.

TBox pre-processing is the best place to first address redundancy, it is also necessary to apply redundancy elimination during reasoning, e.g., during query rewriting. Second, redundancy may also appear during consistency checking, i.e., when an ABox is sound w.r.t. to the TBox; this can also be characterized with dependencies. With respect to evaluation, in [9] we presented a preliminary experiments that show that the semantic index technique can provide excellent performance with a fraction of the cost of ABox expansion, however, further experimentation is still required. In particular, it is necessary to provide a comprehensive benchmarks of the techniques discussed in this paper in comparison to other proposals. We are also exploring the context of SPARQL queries over RDFS ontologies; here we believe that our techniques can be used to provide high- performance SPARQL end points with sound and complete RDFS entailment regime support, without relying on inference materialization, as is usually done.

References

1. R. Agrawal, A. Borgida, and H. V. Jagadish. Efficient management of transitive relationships in large data and knowledge bases. InProc. of ACM SIGMOD, pages 253–262, 1989.

2. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning and efficient query answering in description logics: TheDL-Litefamily. J. of Automated Reasoning, 39(3):385–429, 2007.

3. D. DeHaan, D. Toman, M. P. Consens, and M. T. ¨Ozsu. A comprehensive XQuery to SQL translation using dynamic interval encoding. InProc. of ACM SIGMOD, 2003.

4. G. Gottlob and C. G. Ferm¨uller. Removing redundancy from a clause.Artif. Intell., 61, 1993.

5. R. Kontchakov, C. Lutz, D. Toman, F. Wolter, and M. Zakharyaschev. The combined approach to query answering inDL-Lite. InProc. of KR 2010, 2010.

6. H. P´erez-Urbina, B. Motik, and I. Horrocks. Tractable query answering and rewriting under description logic constraints.J. of Applied Logic, 8(2):186–209, 2010.

7. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies.J. on Data Semantics, X:133–173, 2008.

8. M. Rodr´ıguez-Muro.Tools and Techniques for Ontology Based Data Access in Lightweight Description Logics. PhD thesis, KRDB Research Centre, Free Univ. of Bozen-Bolzano, 2010.

9. M. Rodriguez-Muro and D. Calvanese. Dependencies: Making ontology based data access work in practice. InProc. of AMW 2011, 2011.

10. R. Rosati and A. Almatelli. Improving query answering overDL-Liteontologies. InProc. of KR 2010, 2010.