Ontology-Based Access to Probabilistic Data

(1)

Jean Christoph Jung and Carsten Lutz Universit¨at Bremen, Germany {jeanjung,clu}@informatik.uni-bremen.de

Abstract. We propose a framework for querying probabilistic data in the presence of an ontology, arguing that the interplay of probabilities and ontologies is fruitful in applications such as managing data that was extracted from the web. The prime inference problem is computing answer probabilities, and we show that it can be implemented using standard probabilistic database systems, similar to traditional ontology- based data access. We demonstrate that query rewriting into first-order logic is an important tool for our framework. First, it is used to establish aPTimevs. #Pdichotomy for the data complexity of this problem by lifting a corresponding result from probabilistic databases. Then, we use it to characterize which pairs of query and TBox are inPTime. Finally, it is shown that non-existence of such a rewriting implies #P-hardness.

1 Introduction

In recent years,ontology based data access(OBDA) has become an active area of description logic research. In OBDA, an ontology provides a semantics forincom- pletedata with the aim of facilitating the computation of more complete answers to queries. There are applications, though, in which it is necessary to query data that is not only incomplete, but also uncertain. For instance, data extracted from web sources [24] such as an estate agents’ web page is typically incomplete.

It is also uncertain because web sources tend to be unreliable and extraction tools are based on heuristic decisions and thus significantly error prone. In this paper, we propose an extension of OBDA that captures uncertain data through a probabilistic data model and replaces the computation of certain answers with computing the probabilities of certain answers. In brief, our approach relates to probabilistic database systems (PDBMSs) in the same way that traditional OBDA relates to RDBMSs.

Framework. We consider probabilistic ABox formalisms that are inspired by data models from the currently very active area of probabilistic databases [7,32].

Specifically,pABoxesenrich classical ABoxes with probabilities that are attached to probabilistic events, and with event expressions that are attached to ABox assertions. For example, a pABox assertion SoccerPlayer(messi) can be associated with an event expression e₁∨e₂, where e₁ and e₂ represent events such as ‘web extraction tool x correctly analyzed webpage y stating that Messi is a soccer player’. Event e1 can then be associated with probability .7, and e2

with .9. Events are assumed to be probabilistically independent, which results

(2)

in a straightforward semantics that is similar to well-known probabilistic versions of datalog [28,12]. Ontologies are represented by description logic TBoxes. In this setting, which we callontology based access to probabilistic data (pOBDA), we are interested in computing the probabilities of certain answers to conjunctive queries (CQs). Note that uncertainty occurs only in the data, but neither in the ontology nor in the query. We believe that pOBDA is of general interest and potentially useful for a wide range of applications including the management of data extracted from the web, machine translation, and dealing with data that arises from sensor networks. All these applications can potentially benefit from a fruitful interplay between ontologies and probabilities; in particular, the ontology can help to reduce the uncertainty of the data.

Contributions. The main aim of this paper is to study the data complexity of pOBDA. More precisely, we pursue anon-uniformapproach as recently initiated in [27], which aims at fully classifying the data complexity of every pair (q,T) that consists of a CQ q and a TBox formulated in a fixed ‘master DL’. As a central tool of our analysis, we usequery rewriting into first-order (FO) queries, which is an important technique for traditional OBDA [6]. We start with showing that FO-rewritings from traditional OBDA are useful also in the context of pOBDA: for any pABoxA, the probability that a tupleais a certain answer to qover a pABoxArelative toT is identical to the probability thatais an answer to the FO-rewriting qT of q and T over A viewed as a probabilistic database.

This lifting of FO-rewritings to the probabilistic case immediately implies that one can implement pOBDA based on existing PDBMSs such as MayBMS, Trio, and MystiQ [1,35,5].

Lifting also allows us to carry over the dichotomy betweenPTimeand #P- hardness for computing the probabilities of answers to unions of conjunctive queries (UCQs) over probabilistic databases recently obtained by Dalvi, Suciu, and Schnaitter [8] to our pOBDA framework, provided that we restrict ourselves to TBoxes formulated in (the core version of) DL-Lite and to ipABoxes, which are a special case of pABoxes in which all ABox assertions are probabilistically independent. Based on a careful syntactic analysis, we provide a concrete characterization of those CQsqand DL-Lite TBoxesT for which computing answer probabilities is in PTime. We then proceed to showing that query rewriting is a complete tool for proving PTime data complexity in pOBDA, in the following sense: we replace DL-Lite with the strictly more expressive description logic ELI where, in contrast to DL-Lite, rewritings into first-order queries do not exist for every CQ q and TBox T; we then prove that if some (q,T) does not have a rewriting, then computing answer probabilities for q relative to T is #P-hard. Thus, if it is possible at all to prove that some (q,T) has PTime data complexity, then this can always be done using query rewriting. Both in DL-Lite andELI, the class of queries and TBoxes withPTimedata complexity is relatively small. This negative result is relativized by the fact that answer probabilities can often be efficiently approximated. In particular, all pairs (q,T) admit approximation in terms of a fully polynomial randomized approximation scheme (FPRAS)wheneverqis FO-rewritable relative toT. We provide a brief

(3)

discussion of such approximations and refer to the full version [19] of this paper for more information.

Related Work.The probabilistic ABox formalism studied in this paper is inspired by the probabilistic database models in [9], but can also be viewed as a variation of probabilistic versions of datalog and Prolog, see [28,12] and references therein.

There have recently been other approaches to combining ontologies and uncertainty for data access [11,14], with a different semantics; the setup considered by Gottlob, Lukasiewicz, and Simari in [14] is close in spirit to the framework studied here, but also allows probabilities in the TBox and has a different, rather intricate semantics based on Markov logic. In fact, we deliberately avoid probabilities in the ontology because (i) this results in a simple and fundamental, yet useful formalism that still admits a very transparent semantics and (ii) it enables the use of standard PDBMSs for query answering in the presence of ontologies.

Note that the analogous property of being implementable using state-of-the-art RDBMSs is a favorable feature of traditional OBDA [6]. There has also been a large number of proposals for enriching description logic TBoxes (instead of ABoxes) with probabilities, see [25,26] and the references therein. Our running application example is web data extraction, in the spirit of [16] to store extracted web data in a probabilistic database. Note that it has also been proposed to inte- grate both probabilities and ontologies directly into the data extraction tool [13].

We believe that both approaches can be useful and could even be orchestrated to play together.

This paper is a condensed version of [19]. For the missing proofs and some additional material we refer the reader to the long version of [19].

2 Preliminaries

We use standard notation for the syntax and semantics of description logics (DLs) and refer to [3] for full details. As usual, C, D denote (potentially) com- posite concepts, A, B concept names, r, s role names, R and S role names or their inverse, anda, bindividual names. WhenR=r⁻, thenR⁻ denotesr. We consider the ontology languages DL-Lite and ELI. Regarding the former, we concentrate on the dialect DL-Litecore, where TBoxes are finite sets ofconcept inclusions (CIs) B vB⁰ andBuB⁰ v ⊥with B and B⁰ concepts of the form

∃r,∃r⁻,>orA. InELI, aTBox is a finite set of CIsCvDwhereCandDare (potentially) compound concepts of the form>,A,C⁰uD⁰,∃r.C⁰, and∃r⁻.C⁰. As usual, anABox is a finite set of concept assertions A(a) and role assertions r(a, b). We use Ind(A) to denote the set of individual names used in the ABoxA and writer⁻(a, b)∈ Aforr(b, a)∈ A. The semantics of DLs is based on interpretationsI= (∆Î,·Î); we adopt the unique name assumption (UNA), i.e., we requireaÎ6=bÎ for all individualsa6=b.

Conjunctive queries (CQs) take the form ∃y.ϕ(x,y), withϕ a conjunction of atoms of the formA(t) andr(t, t⁰) and wherex,ydenote (tuples of) variables andt, t⁰denoteterms, i.e., a variable or an individual name. We call the variables

(4)

in xtheanswer variables and those inythequantified variables. The set of all variables in a CQqis denoted byvar(q) and the set of all terms inqbyterm(q).

A CQqisn-aryif it hasnanswer variables andBooleanif it is 0-ary. Whenever convenient, we treat a CQ as a set of atoms and sometimes write r⁻(t, t⁰) ∈q instead ofr(t⁰, t)∈q.

Let I be an interpretation and q a CQ with answer variables x1, . . . , xk. For individual names a = a1· · ·ak, an a-match for q in I is a mapping π : term(q)→∆Î such thatπ(xi) =ai for 1≤i≤k,π(a) = aÎ for all individual names a ∈ term(q), π(t) ∈ AÎ for all A(t)∈ q, and (π(t1), π(t2)) ∈ rÎ for all r(t1, t2)∈q. We writeI |=q[a] if there is ana-match ofqin Iand let ans(q,I) denote the set of alla withI |=q[a]. For a TBoxT and an ABoxA, we write T,A |=q[a] if I |=q[a] for all models I ofT andA. The setcertT(q,A) of all certain answers consists of all tuplesaoverInd(A) withT,A |=q[a].

3 Probabilistic OBDA

We introduce our framework for probabilistic OBDA, starting with the definition of a rather general, probabilistic version of ABoxes. LetEbe a countably infinite set ofatomic (probabilistic) events. An event expressionis built up from atomic events using the Boolean operators ¬, ∧, ∨. We use expr(E) to denote the set of all event expressions over E. A probability assignment for E ⊆ E is a map E→[0,1].

Definition 1 (pABox). Aprobabilistic ABox (pABox)is of the form(A, e, p) withAan ABox, ea mapA →expr(E), and pa probability assignment for E_A, the atomic events inA.

Example 1. We consider as a running example an information extraction tool that is gathering data from the web, see [16] for a similar setup. Assume we are gathering data about soccer players and the clubs they play for in the current 2012 season, and we want to represent the result as a pABox.

(1) The tool processes a newspaper article stating that ‘Messi is the soul of the Argentinian national soccer team’. Because the exact meaning of this phrase is unclear (it could refer to a soccer player, a coach, a mascot), it generates the assertionPlayer(messi) associated with the atomic event expressione₁ with p(e₁) = 0.7. The evente₁ represents that the phrase was interpreted correctly.

(2) The tool finds the Wikipedia page on Lionel Messi, which states that he is a soccer player. Since Wikipedia is typically reliable and up to date, but not always correct, it updates the expression associated withPlayer(messi) toe1∨e2

and associatese2 withp(e2) = 0.95.

(3) The tool finds an HTML table on the homepage of FC Barcelona saying that the top scorers of the season are Messi, Villa, and Pedro. It is not stated whether the table refers to the 2011 or the 2012 season, and consequently we generate the ABox assertionsplaysfor(x,FCbarca) forx∈ {messi,villa,pedro}all associated with the same atomic event expressione3withp(e3) = 0.5. Intuitively, the evente3 is that the table refers to 2012.

(5)

(4) Still processing the table, the tool applies the background knowledge that top scorers are typically strikers. It generates three assertions Striker(x) with x ∈ {messi,villa,pedro}, associated with atomic events e₄, e⁰₄, and e⁰⁰₄. It sets p(e4) = p(e⁰₄) =p(e⁰⁰₄) = 0.8. The probability is higher than in (3) since being a striker is a more stable property than playing for a certain club, thus this information does not depend so much on whether the table is from 2011 or 2012.

(5) The tool processes the twitter message ‘Villa was the only one to score a goal in the match between Barca and Real’. It infers that Villa plays either for Barcelona or for Madrid, generating the assertions playsfor(villa,FCbarca) and playsfor(villa,realmadrid). The first assertion is associated with the evente5, the second one with¬e5. It setsp(e5) = 0.5.

We now definte the semantics of OBDA over pABoxes. EachE⊆E_Acan be viewed as a truth assignment that makes all events in E true and all events in E_A\E false.

Definition 2. Let (A, e, p) be a pABox. For each E ⊆ E_A, define a corresponding non-probabilistic ABox AE := {α ∈ A | E |= e(α)}. The function p represents a probability distribution on 2^E^A, by setting for eachE⊆E_A:

p(E) =Y

e∈E

p(e)· Y

e∈EA\E

(1−p(e)).

The probability of an answera∈Ind(A)ⁿ to ann-ary conjunctive queryqover a pABoxA and TBoxT is

p_A,T(a∈q) = X

E⊆EA:a∈certT(q,AE)

p(E).

For Boolean CQs q, we write p(A,T |= q) instead of p_A,T(() ∈ q), where () denotes the empty tuple.

Example 2. Consider again the web data extraction example discussed above.

To illustrate how ontologies can help to reduce uncertainty, we use the DL-Lite TBox

T ={ ∃playsforvPlayer Playerv ∃playsfor

∃playsfor⁻vSoccerClub StrikervPlayer } Consider the following subcases considered above.

(1) + (3) The resulting pABox comprises the following assertions with associated event expressions:

Player(messi) e₁ playsfor(messi,FCbarca) e₃ playsfor(villa,FCbarca) e3 playsfor(pedro,FCbarca) e3

with p(e1) = 0.7 and p(e3) = 0.5. Because of the statement ∃playsfor vPlayer, usingT (instead of an empty TBox) increases the probability ofmessito be an answer to the queryPlayer(x) from 0.7 to 0.85.

(6)

(5) The resulting pABox is

playsfor(villa,FCbarca) e5 playsfor(villa,realmadrid) ¬e5

with p(e5) = 0.5. Although Player(villa) does not occur in the data, the probability of villa to be an answer to the queryPlayer(x) is 1 (again by the TBox- statement ∃playsforvPlayer).

(3)+(4) This results in the pABox

playsfor(messi,FCbarca) e3 Striker(messi) e4

playsfor(villa,FCbarca) e3 Striker(villa) e⁰₄ playsfor(pedro,FCbarca) e3 Striker(pedro) e⁰⁰₄

withp(e3) = 0.5 andp(e4) =p(e⁰₄) =p(e⁰⁰₄) = 0.8. Due to the last three CIs inT, each ofmessi,villa,pedrois an answer to the CQ∃y.playsfor(x, y)∧SoccerClub(y) with probability 0.9.

Related Models in Probabilistic Databases. Nowadays, there is an abundance of probabilistic data models that provide compact representation of distributions over potentially large sets of possible worlds, see [15,30,2] and the references therein. Our pABoxes can be viewed as an open world version of the probabilistic data model studied by Dalvi and Suciu in [9]. It is as a less succinct version ofc- tables, a traditional data model for probabilistic databases due to Imielinski and Lipski [18]. Since we are working with an open world semantics, pABoxes instead represent a distribution overpossible world descriptions. Each such description may have any number of models. Note that our semantics is similar to the semantics of (“type 2”) probabilistic first-order and description logics [17,26].

Dealing with Inconsistencies. Of course, some of the ABoxesAE might be inconsistent w.r.t. the TBox T used. In this case, it may be undesirable to let them contribute to the probabilities of answers. For example, if we use the pABox

Striker(messi) e₁ Goalie(messi) e₂

with p(e₁) = 0.8 and p(e₂) = 0.3 and the TBox GoalieuStriker v ⊥, then messi is an answer to the query SoccerClub(x) with probability 0.24 while one would probably expect it to be zero (which is the result when the empty TBox is used). We follow Antova, Koch, and Olteanu and advocate a pragmatic solution based on rescaling [2]. More specifically, we remove those ABoxesAE that are inconsistent w.r.t.T and rescale the remaining set of ABoxes so that they sum up to probability one. In other words, we set

pb_A,T(a∈q) = p_A,T(a∈q)−p(A,T |=⊥) 1−p(A,T |=⊥)

where⊥is a Boolean query that is entailed exactly by those ABoxesAthat are inconsistent w.r.t.T. The rescaled probabilitypb_A,T(a∈q) can be computed in PTimewhen this is the case both forp_A,T(a∈q) andp(A,T |=⊥). Note that

(7)

rescaling results in some effects that might be unexpected such as reducing the probability of messi to be an answer toStriker(x) from 0.8 to≈0.74 when the above TBox is added.

In the remainder of the paper, for simplicity we will only admit TBoxesT such that all ABoxesAare consistent w.r.t.T.

4 Query Rewriting

The aim of this section is to show that FO-rewriting, a prominent approach to traditional OBDA, are fruitful also in the case of computing probabilities of certain answers inprobabilisticOBDA. In particular, we use it to lift thePTime vs. #P dichotomy result on probabilistic databases recently obtained by Dalvi, Suciu, and Schnaitter [8] to probabilistic OBDA in DL-Lite.

4.1 Lifting FO-Rewritings to probabilistic OBDA

A first-order query q_T is anFO-rewriting of a CQqand an FO-TBoxT (i.e., a first-order theory) if cert_T(q,A) =ans(q_T,I_A) for every ABoxA, where I_A to denotes the ABoxAviewed as an interpretation. The query rewriting approach to traditional OBDA consists of computing, given a CQ q and a TBox T, an FO-rewritingq_T and then handing it over for execution to a relational database system that stores the ABoxA.

The following observation states that FO-rewritings from traditional OBDA are also useful in probabilistic OBDA. We use p^d_A(a ∈ q) to denote the probability that a is an answer to the query q given the pABox A viewed as a probabilistic database in the sense of Dalvi and Suciu [8]. More specifically,

p^d_A(a∈q) = X

E⊆EA|a∈ans(q,IAE)

p(E)

The following is immediate from the definitions.

Theorem 1 (Lifting). Let T be an FO-TBox, A a pABox, q an n-ary CQ, a ∈ Ind(A)ⁿ a candidate answer for q, and q_T an FO-rewriting of q relative toT. Thenp_A,T(a∈q) =p^d_A(a∈q_T).

From an application perspective, Theorem 1 enables the use of probabilistic database systems such as MayBMS, Trio, and MystiQ for implementing probabilistic OBDA [1,35,5]. Note that it might be necessary to adapt pABoxes in an appropriate way in order to match the data models of these systems. However, such modifications do not impair applicability of Theorem 1.

From a theoretical viewpoint, Theorem 1 establishes query rewriting as a useful tool for analyzing data complexity in probabilistic OBDA. We say that a CQ q is in PTime relative to a TBox T if there is a polytime algorithm that, given an ABox A and a candidate answer a ∈ Ind(A)ⁿ to q, computes p_A,T(a ∈ q). We say that q is #P-hard relative to T if the afore mentioned

(8)

problem is hard for the counting complexity class #P [33]. We pursue a non- uniform approach to the complexity of query answering in probabilistic OBDA, as recently initiated in [27]: ideally, we would like to understand the precise complexity of every CQq relative to every TBoxT, against the background of some preferably expressive ‘master logic’ used forT.

Unsurprisingly, pABoxes are too strong a formalism to admitany tractable queries worth mentioning. Ann-ary CQ q is trivial for a TBox T iff for every ABoxA, we havecert_T(A, q) =Ind(A)ⁿ.

Theorem 2. Over pABoxes, every CQqis#P-hard relative to every first-order TBoxT for which it is nontrivial.

Theorem 2 motivates the study of more lightweight probabilistic ABox formalisms. While pABoxes (roughly) correspond to c-tables, which are among the most expressive probabilistic data models, we now move to the other end of the spectrum and introduce ipABoxes as a counterpart oftuple independent databases [9,12]. Argueably, the latter are the most inexpressive probabilistic data model that is still useful.

Definition 3 (ipABox). An assertion-independent pABox (or ipABox) is a probabilistic ABox in which all event expressions are atomic and where each atomic event expression is associated with at most one ABox assertion.

To save notation, we write ipABoxes in the form (A, p) whereAis an ABox and pis a mapA →[0,1] that assigns a probability to each ABox assertion. In this representation, the events are implicit (one atomic event per ABox assertion) and we writep(α) to denote the probability of the event associated with the assertion α. Analogously to pABoxes, all events (and thus all assertions) are independent.

Note that the answer probability of a∈Ind(A)ⁿ to ann-ary conjunctive query qover an ipABox (A, p) relative to a TBoxT simplifies to

p_A,T(a∈q) = X

A⁰⊆A:a∈certT(q,A⁰)

p(A⁰) wherep(A⁰) =Q

α∈A⁰p(α)·Q

α∈A\A⁰(1−p(α)).

Reconsidering our web data extraction example, it turns out that cases (1) and (4) yield ipABoxes, whereas cases (2), (3), and (5) do not. We refer to [32]

for a discussion of the usefulness of ipABoxes/tuple independent databases. For the remainder of the paper, we assume that only ipABoxes are admitted unless explicitly noted otherwise.

4.2 Lifting the PTimevs. #P Dichotomy

We now use Theorem 1 to lift aPTimevs. #Pdichotomy recently obtained in the area of probabilistic databases to probabilistic OBDA in DL-Lite. Note that, for any CQ and DL-Lite TBox, an FO-rewriting is guaranteed to exist [6]. The central observation is that, by Theorem 1, computing the probability of answers to a CQ q relative to a TBox T over ipABoxes is exactly the same problem as computing the probability of answers to q_T over (ipABoxes viewed as) tuple

(9)

r s

t t⁰ A

q1

s

r

t A

q2

s s

r q3

Fig. 1.Example queries

independent databases. We can thus analyze the complexity of CQs/TBoxes over ipABoxes by analyzing the complexity of their rewritings. In particular, standard rewriting techniques produce for each CQ and DL-Lite TBox an FO- rewriting that is a union of conjunctive queries (a UCQ) and thus, together with Theorem 1, Dalvi, Suciu and Schnaitter’sPTime vs. #P dichotomy for UCQs over tuple independent databases [8] immediately yields the following.

Theorem 3 (Abstract Dichotomy). Letq be a CQ andT a DL-Lite TBox.

Thenq is in PTimerelative to T orq is#P-hard relative toT.

Note that Theorem 3 actually holds for every DL that enjoys FO-rewritability.

Although interesting from a theoretical perspective, Theorem 3 is not fully satis- factory as it does not tell uswhichCQs are inPTimerelative to which TBoxes.

In the remainder of this section, we carry out a careful inspection of the FO- rewritings obtained in our framework and of the dichotomy result obtained by Dalvi, Suciu and Schnaitter, which results in a more concrete formulation of the dichotomy stated in Theorem 3 and provides a transparent characterization of the PTime cases. For simplicity, we concentrate on CQs that are connected, Boolean, and do not contain individual names.

For two CQsq, q⁰and a TBoxT, we say thatqT-impliesq⁰and writeqv_T q⁰ when certT(q,A)⊆certT(q⁰,A) for all ABoxes A; we say that q andq⁰ are T- equivalent and writeq≡T q⁰ ifqvT q⁰ andq⁰vT q; we say thatqisT-minimal if there is noq⁰ (qsuch thatq≡T q⁰. WhenT is empty, we simply drop it from the introduced notation, writing for exampleqvq⁰ and speaking ofminimality.

To have more control over the effect of the TBox, we will generally work with CQsqand TBoxesT such thatqisT-minimal. This is without loss of generality because for every CQ q and TBox T, we can find a CQ q⁰ that is T-minimal and such that q ≡_T q⁰ [4]; note that the answer probabilities relative toT are identical forqandq⁰.

We now introduce a class of queries that will play a crucial role in our analysis.

Definition 4 (Simple Tree Queries). A CQ q is a simple tree if there is a variable x_r ∈var(q) that occurs in every atom in q, i.e., all atoms in q are of the form A(xr),r(xr, y), or r(y, xr) (y =xr is possible). Such a variable xr is called a root variable.

As examples, consider the CQs in Figure 1, which are all simple tree queries.

The following result shows why simple tree queries are important. A UCQqbis reduced if for all disjunctsq, q⁰ ofq,b qvq⁰ impliesq=q⁰.

(10)

Theorem 4. Let q be a CQ and T a DL-Lite TBox such that q isT-minimal and not a simple tree query. Thenq is#P-hard relative to T.

Proof.(sketch) LetqT be a UCQ that is an FO-rewriting ofqrelative toT. By definition of FO-rewritings, we can w.l.o.g. assume that q occurs as a disjunct ofqT. The following is shown in [8]:

1. if a minimal CQ does not contain a variable that occurs in all atoms, then it is #P-hard over tuple independent databases;

2. if a reduced UCQqbcontains a CQ that is #P-hard over tuple independent databases, thenqbis also hard over tuple independent databases.

Note that since q is T-minimal, it is also minimal. By Points 1 and 2 above, it thus suffices to show that qT can be converted into an equivalent reduced UCQ such thatq is still a disjunct, which amounts to proving that there is no disjunctq⁰ inq_T such that qvq⁰ andq⁰ 6vq. The details of the proof, which is

surprisingly subtle, are given in the appendix. o

To finish anlyzing the dichotomy, it thus remains to analyze simple tree queries. We say that a role R isT-generated in a CQ q if one of the following holds: (i) there is an atom R(xr, y) ∈ q and y 6= xr; (ii) there is an atom A(xr) ∈ q and T |=∃R vA; (iii) there is an atom S(x, y)∈ q with xa root variable and such thaty6=xoccurs only in this atom, andT |=∃Rv ∃S. The concrete version of our dichotomy result is as follows. Its proof is based on a careful analysis of FO-rewritings and the results in [10].

Theorem 5 (Concrete Dichotomy).LetT be a DL-Lite TBox. AT-minimal CQ qis in PTimerelative toT iff

1. qis a simple tree query, and

2. if r and r⁻ are T-generated in q, then {r(x, y)} v_T q or q is of the form {S1(x, y), . . . , Sk(x, y)} for rolesS1, . . . , Sk.

Otherwise, qis#P-hard relative to T.

As examples, consider again the queriesq₁,q₂, andq₃in Figure 1 and let T_∅ be the empty TBox. All CQs are T_∅-minimal, q₁ and q₂ are in PTime, and q₃ is

#P-hard (all relative to T_∅). Now consider the TBoxT ={∃sv ∃r}. Then q₁ isT-minimal and still inPTime;q₂ isT-minimal, and is now #P-hard because bothsands⁻ isT-generated. The CQq3 can be madeT-minimal by dropping ther-atom, and is inPTimerelative toT.

Approximations.Theorems 4 and 5 show that only very simple pairs (q,T) can be answered in PTime. Hence, it is reasonable to consider the approximation of answer probabilities. Of particular relevance arefully-polynomial randomized approximation schemes (FPRASes), see [22,34] for a definition and more details.

In [9] it is observed that there is an FPRAS for every UCQ over a probabilistic database. As a consequence of Theorem 1, there is thus also an FPRAS for every pair (q,T) such that q is FO-rewritable relative to T; in particular, there are FPRASes for every CQ and DL-Lite ontology even if additional features such

(11)

as role inclusions are admitted. This observation clearly gives hope for practical feasibility of probabilistic OBDA. In the full paper, we also consider the existence of FPRASes for more expressive ontology languages [19].

5 Beyond Query Rewriting

A CQ qis FO-rewritable relative to a TBoxT if there is an FO-rewriting of q relative toT. The aim of this section is to establish that, in a sense, FO-rewriting is acomplete tool for provingPTime results for CQ answering in probabilistic OBDA: we show that whenever a CQqis not FO-rewritable relative to a TBox T, thenqis #P-hard relative toT; thus, when a query is inPTimerelative to a TBoxT, then this canalwaysbe shown via FO-rewritability. To achieve this goal, we selectELIas the TBox language because it properly generalizes DL-Lite (as in the previous sections we disregard⊥) and, unlike DL-Lite, also embraces non FO-rewritable CQs/TBoxes. Note that, in traditional OBDA, there is a drastic difference in data complexity of CQ-answering between DL-Lite and ELI: the former is in AC0 while the latter isPTime-complete.

We focus on Boolean CQs q that are rooted, i.e., q involves at least one individual name and is connected. This is a natural case since, for any non- Boolean connected CQ q(x) and potential answer a, the probabilityp_A,T(a ∈ q(x)) thatais a certain answer toqw.r.t.AandT is identical to the probability p(A,T |=q[a]) thatAandT entail the rooted Boolean CQq[a]. The following theorem says that there is no hope forPTimealgorithms in the case a query q is not FO-rewritable relative toT.

Theorem 6. If a Boolean rooted CQqis not FO-rewritable relative to anELI- TBoxT, thenq is#P-hard relative to T.

As a by-product of Theorem 6 we obtain the following dichotomy.

Theorem 7 (ELI dichotomy).Let qbe a rooted Boolean CQ andT anELI- TBox. Then qis in PTimerelative to T or#P-hard relative toT.

6 Conclusion

We have introduced a framework for ontology-based access to probabilistic data that can be implemented using existing probabilistic database system, and we have analyzed the data complexity of computing answer probabilities in this framework. There are various opportunities for future work. For example, it would be interesting to extend theconcrete dichotomy: on the one hand, it can be extended to CQs that involve constants and are not necessarily connected;

on the other hand, one could study more expressive versions of DL-Lite that, for example, allow role hierarchy statements in the TBox. It would also be worth- while to add means to express uncertainty to the TBox formalism instead of admitting it only in the ABox; this is done for example in [28,12], but it remains to be seen whether the semantics used there is appropriate for our purposes.

(12)

References

1. Antova, L., Jansen, T., Koch, C., Olteanu, D.: Fast and simple relational processing of uncertain data. In: Proc. of ICDE. 983–992 (2008)

2. Antova, L., Koch, C., Olteanu, D.: 10¹⁰⁶worlds and beyond: efficient representation and processing of incomplete information. VLDB J. 18(5), 1021–1040 (2009) 3. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.

(eds.): The Description Logic Handbook. Cambridge University Press (2003) 4. Bienvenu, M., Lutz, C., Wolter, F.: Query containment in description logics recon-

sidered. In: Proc. of KR (2012)

5. Boulos, J., Dalvi, N.N., Mandhani, B., Mathur, S., R´e, C., Suciu, D.: MYSTIQ:

a system for finding more answers by using probabilities. In: Proc. of SIGMOD.

891–893 (2005)

6. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning and efficient query answering in description logics: TheDL-Litefamily.

J. Autom. Reasoning 39(3), 385–429 (2007)

7. Dalvi, N.N., R´e, C., Suciu, D.: Probabilistic databases: diamonds in the dirt. Com- mun. ACM 52(7), 86–94 (2009)

8. Dalvi, N.N., Schnaitter, K., Suciu, D.: Computing query probability with incidence algebras. In: Proc. of PODS. 203–214. ACM (2010)

9. Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)

10. Dalvi, N.N., Suciu, D.: The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59(6), 30 (2012)

11. Finger, M., Wassermann, R., Cozman, F.G.: Satisfiability inELwith sets of probabilistic ABoxes. Proc. of DL. CEUR-WS, Vol. 745 (2011)

12. Fuhr, N., R¨olleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997) 13. Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi, G., Schallhart, C., Sellers, A.J., Wang, C.: Diadem: domain-centric, intelligent, automated data extraction methodology. In Proc. of WWW. 267–270. ACM (2012) 14. Gottlob, G., Lukasiewicz, T., Simari, G.I.: CQ answering in probabilistic datalog+/−ontologies. In Proc. of RR. Vol. 6902 of LNCS, 77–92. Springer (2011) 15. Green, T.J., Tannen, V.: Models for incomplete and probabilistic information.

IEEE Data Engineering Bulletin 29(1), 17–24 (2006)

16. Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. In Proc. of VLDB. 965–976. ACM (2006)

17. Halpern, J.Y.: An analysis of first-order logics of probability. Artif. Intell. 46(3), 311–350 (1990)

18. Imielinski, T., Jr., W.L.: Incomplete information in relational databases. J. of the ACM 31(4), 761–791 (1984)

19. Jung, J.C., Lutz, C.: Ontology-based access to probabilistic data with OWL QL.

In: Proc. of ISWC. pp. 182–197. Springer (2012)

20. Jerrum, M., Valiant, L.G., Vazirani, V.V.: Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci. 43, 169–188 (1986) 21. Karger, D.R.: A randomized fully polynomial time approximation scheme for the

all-terminal network reliability problem. SIAM J. Comput. 29(2), 492–514 (1999) 22. Karp, R.M., Luby, M.: Monte-carlo algorithms for enumeration and reliability prob-

lems. In Proc. of FoCS. 56–64. IEEE Computer Society (1983)

(13)

23. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The com- bined approach to query answering in DL-Lite. In Proc. of KR. AAAI Press (2010) 24. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey

of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)

25. Lukasiewicz, T., Straccia, U.: Managing uncertainty and vagueness in description logics for the semantic web. J. Web Sem. 6(4), 291–308 (2008)

26. Lutz, C., Schr¨oder, L.: Probabilistic description logics for subjective uncertainty.

In Proc. of KR. AAAI Press (2010)

27. Lutz, C., Wolter, F.: Non-uniform data complexity of query answering in description logics. In: Proc. of KR. AAAI Press (2012)

28. Raedt, L.D., Kimmig, A., Toivonen, H.: Problog: a probabilistic prolog and its application in link discovery. In Proc. of IJCAI. 2468–2473. AAAI Press (2007) 29. Rossman, B.: Homomorphism preservation theorems. J. ACM. 55(3). 1–54 (2008).

30. Sarma, A.D., Benjelloun, O., Halevy, A.Y., Widom, J.: Working models for uncertain data. In: Proc. of ICDE. IEEE Computer Society (2006)

31. Straccia, U.: Top-k retrieval for ontology mediated access to relational databases.

In: Information Sciences. 108, 1–23 (2012).

32. Suciu, D., Olteanu, D., R´e, C., Koch, C.: Probabilistic Databases. Synthesis Lec- tures on Data Management, Morgan & Claypool Publishers (2011)

33. Valiant, L.G.: The complexity of enumeration and reliability problems. SIAM J.

Comput. 8(3), 410–421 (1979)

34. Vazirani, V.V.: Approximation algorithms. Springer (2001)

35. Widom, J.: Trio: A system for integrated management of data, accuracy, and lin- eage. In Proc. of CIDR. 262–276 (2005)

36. Zenklusen, R., Laumanns, M.: High-confidence estimation of smalls-treliabilities in directed acyclic networks. Networks 57(4), 376–388 (2011)