Computing the Skyline of a Relational Table Based on a Query Lattice

(1)

Based on a Query Lattice

Nicolas Spyratos, Tsuyoshi Sugibuchi, Ekaterina Simonenko, and Carlo Meghini

Laboratoire de Recherche en Informatique, Universit´e Paris-Sud 11, France {Nicolas.Spyratos, Tsuyoshi.Sugibuchi, Ekaterina.Simonenko}@lri.fr

Istituto di Scienza e Tecnologie della Informazione del CNR, Pisa, Italy [email protected]

Abstract. We propose a novel approach to computing the skyline set of a relational tableR, with respect to preferences expressed over one or more numerical attributes. Our approach is based on what we call the query lattice of R, and our basic algorithm constructs the skyline set as the union of the answers to a subset of queries from that lattice - hencewithoutdirectly accessing the table R. Therefore, in contrast to all existing techniques, our approach is independent of how the table R is implemented or how its tuples are indexed. We demonstrate the generality of our approach by computing the skyline set of the join of two tables based on the product of their individual query lattices - thereforewithout performing the join. The paper presents basic concepts and algorithms leaving experimentation and performance evaluation to a forthcoming paper.

Keywords: skyline, relational table, query lattice

1 Introduction

In many multicriteria decision-making applications, dominance analysis is an important aspect. As an example, consider a person looking for a vacation package using two criteria, or “attributes”: hotel rating and price. Intuitively, a package P =hr, pi is better than a package P⁰ =hr⁰, p⁰i ifP is better than P⁰ in one attribute and not worse than P⁰ in the other attribute. If this is the case then we say thatP dominatesP⁰.

For example, consider the following three packages:

– P1=h2,100i, P2=h3,130i, P3=h2,120i

Since a higher rating and a lower price are more preferable,P1dominatesP3. On the other hand,P1andP2don’t dominate each other becauseP1has a lower rating than P2 and P2 has a higher price. Similarly, P2 andP3 don’t dominate each other becauseP2has a higher rating thanP3andP3 has a lower price.

A package, or “tuple” that is not dominated by any other tuple is said to be a skyline tuple or to be in the skyline. The tuples in the skyline are the best

(2)

possible trade-offs among the attribute values appearing in the tuples. Thus in our example, packagesP1andP2 are in the skyline, whileP3 is not.

In order to conduct a skyline analysis, two items must be specified:

1. A set of attributes over which preferences are expressed (such as Hotel Rating and Price, in our example).

2. An ordering of the attribute values (either total or partial) according to which preference is expressed (such as the ordering of the integers for Hotel Rating to express that ratingris preferred to ratingr⁰ifr > r⁰; and similarly for Price to express that pricepis preferred to pricep⁰ ifp < p⁰).

In recent years, skyline analysis has gained considerable interest in the area of information systems in general, and in the area of databases in particular. How- ever, skyline analysis (i.e. computing non dominated points) existed well before the concept appeared in database research; it is known as the maximum vector problem or the Pareto optimum [11][18]. The popularity of skyline analysis in the area of information systems is mainly due to its applicability for decision making applications. Indeed, as information systems store larger and larger volumes of data today, data management and in particular query processing present difficult challenges. From the user viewpoint, large volumes of data imply answers of large size. By returning the best tuples (in terms of user preferences), the skyline query relieves the user from having to deal with answers of large size in order to find the best tuples.

The skyline operator was first introduced in [3], where the authors also present two basic algorithms: the Block Nested Loops (BNL) and the Divide and Conquer (D&C). In order to improve the performance of BNL algorithm, the SFS (sort-filter-skyline) algorithm was proposed in [5]. SFS runs on data sorted according to a monotonic function (namely, entropy descending). Such sorting guarantees the non-dominance of each object by those that follow in the order. Therefore, once an object is put into the buffer window, it can be reported as part of the skyline. Not only this makes the SFS algorithm progressive, but also allows to reduce the number of comparisons needed, since SFS compares only against the non-dominated tuples, whereas BNL often compares against dominated tuples [7]. Despite this improvement, all objects have to be scanned by the algorithm at least once. Succeeding approaches tend to avoid scanning the complete data set. Namely, SaLSa [1] uses the minimal coordinate of each object as a sorting function, and during the filter-scan step checks if all remain- ing objects are dominated by a so-called stop object, which can be determined inO(1) from the data accessed so far. The shortcoming is that the performance of SaLSa algorithm is affected by the data distribution and increasing dimen- sionality, since in higher dimensions instances of the problem the pruning power of the stop object is limited [29].

More generally, all sort-based techniques share the same drawback, namely the number of computations during the filter-scan step, as every input object should be compared with the skyline points in the buffer (which can potentially become large).

(3)

An alternative to the sort-based techniques is the use of indexes, which allows to avoid scanning all the input objects. The basic idea is to rely on an index in order to determine dominance between tuples, and to exclude tuples from further processing as early as possible. Two index-based algorithms, Bitmap and Index were first introduced in [21]. Bitmap uses the bitmap encoding of the data so that the dominating points are determined by a bit-wise “and” operation.

The Index approach partitions the objects into a set of lists. Each list is sorted by minimum coordinate and indexed by a B-tree. The objects are accessed in batches defined by the values of the minimal coordinate, while the algorithm computes local skylines in each “batch” of the lists and then merges them into a global skyline. However, besides the computation cost of Bitmap, and the neces- sity to construct a B-tree for every combination of dimensions having potential interest for the user, the order in which skyline points are returned by these algorithms is fixed and depends on the data distribution, so it cannot be adapted to the users preferences.

Two other index-based skyline algorithms, NN (nearest-neighbor) [13] and BBS (branch-and-bound skyline) [19], are based on the observation that the object closest to the origin has to be part of the skyline. Nearest neighbor search is used to retrieve such point by using the R-tree. The pitfall of the NN algorithm is that in order to iteratively find the next nearest neighbors it divides the data set into overlapping partitions, and therefore duplicates have to be removed by traversing the R-tree multiple times. To avoid that, BBS rather accesses partially dominated nodes of the R-tree.

The main drawback of all index-based approaches is that not all data can be indexed (namely when data is dynamically produced). Also, R-trees and other multidimensional indexes have their own limitations, namely the curse of dimen- sionality.

Concerning related work specifically targeting the multi-relational skyline (or skyline join), two progressive algorithms are proposed in [9]. The idea is to com- bine the join with nested-loop and sort-merge algorithms. However, each relation has to accessed multiple times in order to compute the skyline for each join value, and then the global skyline. In addition, each input object has to be scanned at least once.

More recent work also considers how to compute the skyline over the join of two or more relational tables without actually computing the join [23]. Apart from its applicability to computing skylines in a centralized database, the interest of such work lies in the fact that it is also applicable to distributed environments.

Earlier work related to this topic can be found in [1][2][8][10][20][25][26].

A lattice-based approach to single-relation skyline computation is introduced in [15], with the aim of proposing a data-distribution independent algorithm.

A lattice structure is used to answer skyline queries over dimensions with low- cardinality domains, or those that can be mapped to low-cardinality domains (such asPrice, that can be mapped to price ranges). The principle is to organize all the values combinations into a lattice based on the dominance relationship, and then to retrieve those that (a) are present in the input data set, and (b)

(4)

are not reachable by the dominance relationship from another element of the lattice, also belonging to the data set. However, no early pruning is done, so the entire data set has to be read twice in order to determine the skyline tuples, and the skyline join problem is not investigated. Also, the mapping of the values of a domain to a set of ranges has to be carefully tuned in order to deliver a meaningful skyline result, which is not discussed in the paper.

Several variants of skyline were introduced in [19], such as constrained, sub- space and dynamic skyline queries (see also [6][16][22][27][28]). Skyline queries have also been studied in various other domains, outside traditional databases.

These include probabilistic skyline computations over uncertain data [17](e.g.

data in sensor networks); skyline computations over incomplete data [12](e.g.

data with missing values); over data whose attributes have partially-ordered domains [4](e.g. preferences expressed by users online); over stream data[14]; or even bandwidth-constrained skyline computations over mobile devices [24].

In this paper, we present a novel approach to computing skylines which represents a major deviation from existing approaches. Indeed, instead of accessing individual tuples in a database table, our approach relies on the definition of skyline as the union of the answers to a set of queries. In doing so, our basic algorithm avoids accessing the table directly: access to the table is through queries, hence independent of how the table is implemented or how its tuples are indexed.

Given a relational tableR, our approach is based on what we call the query lattice ofR; and our basic algorithm constructs the skyline set as the union of the answers to a subset of queries from that lattice - hencewithoutdirectly accessing the tableR. We demonstrate the generality of our approach by computing the skyline of the join of two tables based on the product of their individual query lattices - therefore without performing the join. The paper presents basic concepts and algorithms leaving experimentation and performance evaluation to a forthcoming paper.

The paper is organized as follows. In section 2 we give some preliminary definitions and introduce our notation. In section 3 we present our basic algorithm for computing skylines through queries. In section 4 we apply our approach to computing skylines over joins, thus demonstrating the generality of the approach.

Finally, in section 5, we offer some concluding remarks and discuss further research.

2 Basic definitions

LetRbe a relational table, withA1, . . . , Anas attributes. LetB={B1, . . . , Bk}, k≤n,be the set of preference attributes,that is a set of attributes of the table whose domains are numeric and over which preferences are declared.

A preference over Bi is an expression of one of two forms: Bi → min or Bi →max.If the preferenceBi→min is expressed by a user of the table, then this is interpreted as follows: given two valuesxandyin the domain ofBi, xis preferred toyorxpreceeds yiffx < y; and similarly, if the preferenceBi→max

(5)

is expressed by a user, then this is interpreted as follows: given two valuesxand y in the domain ofBi, xis preferred to y orxpreceeds yiffx < y;

In order to simplify the presentation, and without loss of generality, we shall consider only one form of preference, namely Bi →min. However, all methods discussed in this paper can be applied with any combination of the preferences Bi →min andBi →max.Therefore, from now on, given two valuesxandy in the domain ofBi,we shall say thatxis preferred to y orxpreceeds y iffx < y.

Definition 1 (Pareto domination) LetB={B1, . . . , B_k} be a set of preference attributes of a relational table R and let s and t be tuples of R. We say that s is equivalent to t,denoted as s≡t iffs.B_i =t.B_i for all B_i∈ B. Moreover, we say that s Pareto dominates t, denoted ass <_{P a}t,iffs6≡t and for allB_i ∈ B, s.Bi≤t.Bi.

In order to simplify the presentation we will simply say “dominates” instead of “Pareto dominates”, and we shall drop the subscript in the notation, writing s < tinstead ofs <P at.

We shall callPareto preference query,or simplypreference query overR,any expression of the form (B1=b1)∧. . .∧(Bk=bk), where eachbi is a value in the domain of attribute Bi.For simplicity of notation we shall denote a preference query simply byhb1, . . . , b_ki.

Note that hb1, . . . , bki denotes also a tuple in the projection of R over the preference attributes; however, context will always disambiguate. Also note that a preference query hb1, . . . , b_ki returns the set of tuples inR whose projection over the preference attributes is the tuple hb1, . . . , b_ki; therefore the answer to each query of the formhb₁, . . . , b_kiis a Pareto equivalence class.

It is easy to verify that Pareto domination is irreflexive (i.e., s < sis false for each tuples) and transitive (i.e.,s < tand t < uimplys < u for all tuples s, tandu), hence a strict order overR.A partial order≤overRcan be defined from Pareto domination as follows:

s≤tiff (s < tor s=t)

for all tuplessandtin R.We shall say thatsPareto preceeds t,or simply that spreceeds t,whenevers≤t.

Clearly, Pareto precedence defines a partial order also over preference queries.

Moreover, given preference queriessandt we define the following operations:

– s⊗t=hmin{s.B₁, t.B₁}, . . . ,min{s.B_k, t.B_k}i – s⊕t=hmax{s.B1, t.B₁}, . . . ,max{s.Bk, t.B_k}i

It is easy to check that these operations make the set of preference queries over Rinto a complete lattice, with⊗defining the least upper bound and⊕defining the greatest lower bound of any two preference queries. We notice that this lattice may be infinite, as some of the domains of the preference attributes may be infinite. However, if we require that each bi in a preference query be in the active domain ofBi (i.e.if we require that eachbi appear inR), then the lattice

(6)

becomes finite and therefore it has a top and a bottom query, denoted as>and

⊥,respectively. These extreme elements are given by:

>=hm1, . . . , mki

⊥=hM1, . . . , M_ki

where mi and Mi are the minimum value and the maximum value appearing in the active domain of Bi, respectively. We shall call this (finite) lattice the query lattice,of R and we shall denote it as (Q,≤), where Qis the (finite) set of preference queries over R.

In this paper, we shall use the query lattice for two purposes: (a) as a tool for computing skylines of relational tables, and (b) as a means for comparing our approach to existing approaches. First, however, let’s define skylines formally.

Definition 2 The skylineof a tableR over preference attributesB,denoted by SKY(R,B), is the set of tuples fromR defined as follows:

SKY(R,B) ={t∈R| 6 ∃s∈R: s≤t}

In other words, the skyline of Ris the set of non-dominated tuples ofR.

As customary, given a queryqand a relationR,we will let ans(q, R) stand for the answer of q over R,that is the set of tuples obtained by asking query q against relation R.Moreover, we define a skyline query of R over preference attributes B,to be a queryq(R,B) overR whose answer is a skyline ofR over B, that is:

ans(q(R,B), R) =SKY(R,B)

Skyline queries will play a central role in our approach to skyline computation.

3 Computing the skyline of a relational table

In this Section, we present an algorithm for computing the skyline of a given table R. Our algorithm obtains the skyline by constructing a skyline query. To find the skyline query, our algorithm traverses part of the query lattice and collects a set of non-dominated queries whose disjunction is the skyline query.

In contrast to existing methods that actually construct the skyline, and as such are sensitive to the ways the tableR is implemented or how its tuples are accessed, our algorithm obtains the skyline while making no assumptions on how the relationRis accessed by the system while processing the skyline query.

Our algorithm uses the notion of successors of a query, defined as follows.

First, for each preference attribute Bi and value bi in the domain of Bi, let’s denote assucc(bi) the successor ofbiin the linear order of the domain ofBi.For instance, letBi be the Hotel Rating attribute of our earlier example, having as domain the interval [1,5]. As this interval is linearly ordered by the <relation, we havesucc(1) = 2,succ(2) = 3,and so on. This makessucca partial function over the domain ofBi,undefined for the maximumMi.Now, we extend thesucc function to preference queries as follows.

(7)

Letqbe a preference queryq=hb1, . . . , bkisuch thatq6=⊥.This means that bj6=Mj for at least onej∈[1, k], whereMj is the maximum value in the active domain ofBj.With no loss of generality, we shall assume that themvalues ofq that are not maximal, where 1≤m≤k,occur in the firstmpositions ofq,(i.e., q =hb1, . . . , bm, Mm+1, . . . , Mki). Then, the successors of q, succ(q), is defined to be the set of queriessucc(q) ={q1, . . . , q_m}, such that, for all 1≤i≤m,

qi=hb1, . . . , bi−1,succ(bi), bi+1, . . . , bm, Mm+1, . . . , Mki

Clearly, the succ function is undefined on the bottom of the lattice ⊥. The following Lemma gives two important properties of the successors of q for the establishment of the correctness of the following algorithm for computing a skyline query of the tableR.

Lemma 1. Let qbe a preference query. Then, for each queryq⁰ ∈succ(q) : 1. q≤q⁰

2. there is no queryq⁰⁰ such that q < q⁰⁰< q⁰.

We are now ready to give the algorithm for computing a skyline query of a table R.

AlgorithmSkyline Query over a Single Table (SQST)

Input A non-empty table R, a non-empty setB = {B₁, . . . , B_k} of preference attributes in R,and the projections ofRoverB₁, . . . , B_k.

Data We use the following variables for accumulating data during the execution of the algorithm:

– The variableF is a set variable calledfrontier; it is initialized to empty, and it is used to accumulate the queries whose disjunction will be the result of the algorithm (i.e., whose disjunction will be the skyline query).

– The variable C is a set variable containing the set of all current candidate queries; it is initialized to the top>of the query lattice.

– The variable S is a set variable used to accumulate successors of current candidate queries.

– The variable C⁰ is a (auxiliary) set variable for accumulating candidate queries for the next while-loop iteration in the algorithm; it is initialized to empty at the beginning of each iteration.

Output A set of preference queries overR,whose disjunction is a skyline query.

Method

C← {>};F ← ∅ whileC6=∅ do

for allc∈C such thatans(c, R)6=∅do C←C\{c};F ←F∪ {c}

(8)

end for C⁰← ∅

for allc∈C do

for alls∈succ(c) such that no query inF Pareto dominatessdo C⁰←C⁰∪ {s}

end for end for C←C⁰ end while return F

Informally, the algorithm works as follows:

– If the top query>of the lattice is non-empty (i.e.if its answer overRis non- empty) then the algorithm terminates;>is the output of the algorithm, and the answer of>overR is the skyline.

– Otherwise, each successor querycof>might be a candidate for contributing to the skyline, and this is checked as follows:

• if the querycis non-empty then it is added toF (i.e.to the variable that accumulates the queries contributing to the skyline);

• otherwise, each successor of c that is not dominated by a query in F becomes a candidate, by being added to the variableC⁰ (i.e.to the variable that accumulates all candidate queries for the next iteration). The so selected successors are finally transferred to the variableC.

This process is repeated until there is no more candidate left (i.e.until C is empty).

Upon termination of the algorithm,F contains all queries q such that: (a) the answer toqis non-empty, and (b)qis not dominated by another query. The disjunction of all queries inF is then the skyline query, and the answer of the skyline query over Ris the skyline ofR.

R

HID Price Rating h1 200 2 h2 150 5 h3 100 3 h4 300 2 h5 350 1 h6 100 8

Fig. 1.Example relation and query lattice.

Let us illustrate the algorithm by using the example tableR in Fig 1. We assume the preference attributes to be P rice and Rating and the preference

(9)

to be P rice →min and Rating →min. The projections of R over P rice and Rating,sorted in ascending order, are as follows:

– P rice: 100,150,200,300,350 – Rating: 1,2,3,5,8

The diagram in Fig 1 shows a part of the query lattice derived fromR. In this diagram, queries having non-empty answers are emphasized by bold letters.

Queries contributing to the skyline are enclosed by rounded rectangles. Gray triangles in the diagram represent areas dominated by queries in the skyline. As we can see in the diagram, the top of the query lattice is>=h100,1i. Therefore, we start the first iteration of the algorithm with C={h100,1i} andF =∅. At the end of each subsequent iteration of the while-loop, the contents ofCandF change as follows:

– end of 1st iteration:C={h150,1i,h100,2i}, F =∅

– end of 2nd iteration:C={h200,1i,h150,2i,h100,3i}, F =∅

In the third iteration, the queryh100,3i ∈Chas a non-empty result therefore h100,3i leaves C and enters F. We then consider the successors of the queries left inC.h150,3i ∈succ(h150,2i) is dominated byh100,3i ∈F thereforeh150,3i is omitted fromC⁰,which accumulates candidate queries for the next iteration.

– end of 3rd iteration:C={h300,1i,h200,2i}, F ={h100,3i}

– end of 4th iteration:C={h350,1i}, F ={h100,3i,h200,2i}

– end of 5th iteration:C=∅, F ={h100,3i,h200,2i,h350,1i}

(the algorithm stops here)

The correctness of the algorithm is easily established by observing that the algorithm explores the lattice completely (this is guaranteed by the second property of thesucc function in the above Lemma), retaining only maximal queries (this is guaranteed by the test on dominance performed by the algorithm and also by the fact that the successors of a query are all dominated by it, as stated by the first property of thesucc function in the above Lemma).

Formally, we will denote asSQST(R,B) the result of the SQST algorithm having R andB as the input non-empty relation and preference attributes, respectively. On the basis of the above observations, we state the following:

Proposition 1 For every relationR and preference attributesB overR, ans(_

SQST(R,B), R) =SKY(R,B)

We note that, as all queries inF are conjunctive and any two queries inF differ in at least one conjunct, the answers making up the skyline actually form a partition of the skyline. This is an interesting observation when combined with the notion of rank of a query in the query lattice.

(10)

Definition 3 (Rank of a query) The rank of a query q in the query lattice is defined as follows: if q is the root query then rank(q) = 0 else rank(q) is the maximum length of path from the root query to q

Clearly, the higher the rank of a query the less the tuples in its answer are preferred.

Now, as the skyline ofR is partitioned by the answers to the queries inF, one can ask new kinds of queries. For example, one can ask the query: “give me the best tuples from the skyline”. The answer to this query will be the answer to the query of lowest rank inF. Similarly, one can ask the query: “give me the ranks of all tuples in the skyline”. This query will return a set of ranks, thereby giving a useful information as to how far are the tuples of the skyline from the most preferred tuples. A detailed discussion of the relationship between “most preferred” and “non-dominated” tuples is given in a forthcoming paper.

4 Skylines of joins

We now consider the computation of the skyline over the join of two tables R1

andR2.To this end, we introduce the necessary concepts.

LetR₁ andR₂ be relations withA¹₁, . . . , Aⁿ₁¹ andA¹₂, . . . , Aⁿ₂² as attributes, respectively, and letB_i ={B¹_i, . . . , B^k_iⁱ}, k_i ≤n_i, be a set of attributes ofR_i, called thepreference attributes ofR_i,fori= 1,2.

We shall denote as (Q1,≤1) and (Q2,≤2) the query lattices overR1andR2, respectively. Moreover, ⊗i and⊕i will stand for the least upper bound and the greatest lower bound of any two preference queries overRi,respectively.

LetJ_i={J_i¹, . . . , J_i^l},be a set of attributes ofR_idisjoint form the preference attributesBi,fori= 1,2.AjoinoverJ1andJ2is a relational equijoinR1./eR2, whose join expression e is given by J₁¹ =J₂¹, . . . , J₁^l =J₂^l. In order to simplify the model and with no loss of generality, we will consider the join attributes to be the same for the two relations, that isJ1=J2,and moreover to consist of a single attributeJ,that iseis given byJ =J.

Intuitively, a preference query over a joinR₁./_eR₂is a (k₁+k₂)-tuple whose firstk₁ elements make up a query inQ₁ and whose lastk₂elements make up a query in Q₂. In order to simplify notation, we will commit a slight abuse and writehq₁, q₂ito represent a preference query overR₁./_eR₂,whereq₁∈Q₁and q₂∈Q₂.As a consequence, the set of preference queries overR₁./_eR₂,that we shall denote as Q./, is given by the Cartesian product of the set of preference queries over R1 andR2,that is:

Q./=Q1×Q2

Now, let ≤_./ be the Pareto preference relation over Q_./. As we have already seen in the previous Section, (Q_./,≤_./) is a complete lattice. Moreover, it is not difficult to see that:

(11)

Proposition 2 (Q./,≤./)is the product of(Q1,≤1)and(Q2,≤2).That is, let- tingqandq⁰ be preference queries inQ./ such thatq=hq1, q2iandq⁰=hq₁⁰, q₂⁰i, whereq1, q₁⁰ ∈Q1 andq2, q⁰₂∈Q2, we have:

1. q≤_./q⁰ iffq₁≤₁q⁰₁ andq₂≤₂q⁰₂ 2. q⊗./q⁰ =hq1⊗1q₁⁰, q2⊗2q₂⁰i 3. q⊕./q⁰ =hq1⊕1q₁⁰, q2⊕2q₂⁰i

From now on we shall simplify notation by omitting subscripts, unless this creates ambiguity.

We shall calljoin values the set of tuplesV =X₁∩X₂, where:

Xi =πJ(Ri) fori= 1,2

By definition of join, a tuple t in Ri contributes to the join if and only if its projection over the join attributeJ is inV,that ist.J∈V,fori= 1,2.Likewise, a queryqinQ_imay contribute to the skyline of the join only if it occurs in the join. For each join valuev∈V,we define thev−partition ofR_i, denotedS_i(v), as follows :

S_i(v) =π_B_i(σ_J=v(R_i)) fori= 1,2

In practice, each v−partition includes the queries that contribute to the join R1./ R2.In particular:

πB1∪B2(R1./ R2) = [

v∈V

S1(v)×S2(v)

v−partitions play an important role in determining the skyline of the joinR₁./

R2 without computing the join.

Proposition 3 A queryhq1, q2i ∈πB₁∪B₂(SKY(R1./ R2,B1∪ B2))iff for some join value v∈V, qi ∈πB_i(SKY(Si(v),Bi))for i= 1,2 and for no other v⁰∈V there exists skylines q_i⁰∈πBi(SKY(Si(v⁰),Bi))such that q_i⁰≤qi fori= 1,2.

Proof: (→) If for some join value v∈V, q_i ∈π_B_i(SKY(S_i(v),Bi))for i= 1,2 then hq1, q₂i ∈ π_B₁_∪B₂(R₁ ./ R₂), and moreover for 1 in Proposition 2 there exists no other query hq₁⁰, q₂⁰i ∈ π_B₁_∪B₂(R₁ ./ R₂) where q⁰_i ∈ S_i(v) for i = 1,2 such that hq₁⁰, q₂⁰i ≤ hq₁, q₂i. If for no other v⁰ ∈ V there exists skylines q⁰_i ∈ π_B_i(SKY(S_i(v⁰),B_i)) such that q_i⁰ ≤ q_i for i = 1,2 then again from 1 in Proposition 2 there exists no hq₁⁰, q₂⁰i such that hq⁰₁, q⁰₂i ≤ hq1, q2i. Hence hq1, q2i ∈π_B₁_∪B₂(SKY(R1./ R2,B1∪ B2)).

(←) Suppose not. Then, either (a) for no join valuev∈V, qi∈π_B_i(SKY(Si(v),Bi)) fori= 1,2or (b) there exists a join valuev⁰∈V and skylinesq_i⁰ ∈π_B_i(SKY(Si(v⁰),Bi)) such thatq⁰_i≤qi fori= 1,2.In case (a), there are two sub-cases: (a1) for no join value v ∈V, qi ∈Si(v) fori = 1,2. In this case, hq1, q2i 6∈ π_B₁_∪B₂(R1 ./ R2), against the hypothesis. (a2) for some join value v ∈ V, qi ∈ Si(v), but either q1 6∈ πB₁(SKY(S1(v),B1)) or q2 6∈ πB₂(SKY(S2(v),B2)). Then let q⁰_i be such that q⁰_i ∈πBi(SKY(Si(v),Bi)). Such q⁰_i are guaranteed to exist because Si(v)is finite and partially ordered by Pareto preference. By hypothesis, eitherq⁰₁6=q1or

(12)

q⁰₂6=q2. Then hq₁⁰, q₂⁰i ∈π_B₁_∪B₂(R1 ./ R2) and by 1 in Proposition 2hq⁰₁, q⁰₂i ≤ hq1, q2i,thereforehq1, q2i 6∈π_B₁_∪B₂(SKY(R1./ R2,B1∪B2)),against the hypothesis. (b) If there exists a join valuev⁰∈V and skylinesq_i⁰ ∈πB_i(SKY(Si(v⁰),Bi)) such thatq⁰_i≤qi fori= 1,2thenhq₁⁰, q₂⁰i ∈πB₁∪B₂(R1./ R2)and by 1 in Propo- sition 2 hq₁⁰, q₂⁰i ≤ hq1, q2i,thereforehq1, q2i 6∈πB1∪B2(SKY(R1./ R2,B1∪ B2)), against the hypothesis.

We now provide an algorithm for computing the skyline queries over the join of two tablesR1 andR2 (without computing the join.). The algorithm exploits Proposition 3, and uses the SQST algorithm for computing skylines over v- partitions of the given tables. As a result, the algorithm returns two sets of queries, one overR₁,the other overR₂,for selecting from each table the tuples that will generate the skyline of the joinR₁./ R₂,and only those.

AlgorithmThe Skyline Queries over a Join (SQJ)

Input Non-empty relations R1 andR2,non-empty setsB1andB2 of preference attributes in R1 andR2,respectively, and the join valuesV.

Data Gis a set variable initialized to empty and used to accumulate all candidate skyline queries, resulting from the Cartesian products of the skyline queries over the same join value. Ris a set variable where the skyline of Gis computed.Pi

and Fi (for i = 1,2) are auxiliary set variables, used to store v-partitions and final results, respectively.

Output Two sets of queries, one overR₁and the other overR₂. Method

G← ∅

for allv∈V do

P₁←SQST(S₁(v),B₁) P2←SQST(S2(v),B2) G←G∪(P1×P2× {v}) end for

R←SQST(G,B1∪ B2) F1←π_B₁_∪{J}(R) F2←π_B₂_∪{J}(R) return F1, F2

We note that the algorithm operates in three passes:

1. In the first pass, it gathers (in G) all candidate results by looping over all join values. This is in fact required by the first condition of Proposition 3, which states thathq1, q2iis in a skyline query of the join if bothq1 and q2

are skyline queries over thev-partitions for the same join value v.

2. In the second pass, it removes from Gthe compound queries that are dominated by some other compound query, as required by the second condition of Proposition 3.

(13)

3. Finally, it slices the compound queries vertically, in order to obtain queries over R1 (these are stored in F1) and queries over R2 (in F2). Notice that the join attributeJ must be transferred all along in order to generate the correct queries.

Clearly, there is no other way of proceeding since it is necessary to obtain all candidate queries in order to apply the second condition of Proposition 3.

Formally, we will denote asSQJ(R₁, R₂,B₁,B₂)_i thei−th result of theSQJ algorithm having R₁ and R₂ as the input non-empty relations and B₁ and B2 as preference attributes, respectively. For readability, we will abbreviate SQJ(R1, R2,B1,B2)i asSQJ_i.

On the basis of the above observations and of Proposition 3, we therefore state the following:

Proposition 4 For every pair of relationsR1 andR2 and preference attributes B1 andB2 over them,

SQST(ans(_

SQJ₁, R1)./ans(_

SQJ₂, R2),B1∪B2) =SKY(R1./ R2,B1∪B2) Let us demonstrate the SQJ algorithm by using table R1 (hotels) and R2

(restaurants) in Fig. 2. Suppose we want to find best combinations of hotels and restaurants in the same “Location”, by minimizing “Price”, “Rating”, “Dis- tance” and “Location” (we took this example from [23]). In this case, the preference attributes areB1={P rice, Rating},B2={Distance, Ranking}and the join attribute isJ ={Location}. From the intersection of values appearing in theLocationattribute inR₁ andR₂, we can obtain join valuesV ={A, B, C}.

In the algorithm, firstly we gather candidate skyline queries for each join value. In the example, for a join valueA∈V, we obtain v-partition S1(A) and S2(A). Then we compute skylines P1, P2 (emphasized in tables S1,S2 by bold letters) from S1(A), S2(A) by the applying the SQST algorithm. Finally we make a Cartesian productP1×P2× {A}and append it toGthat accumulates candidate skyline queries.

After iterating this candidate gathering process for every join value, we apply SQST toGto obtain the skyline (emphasized in tablesGby bold letters) over the joinR1./ R2. It is important to note the difference of size betweenR1./ R2

and G. In this example,R1 ./ R2 produces 12 tuples but Gcontains 8 tuples.

Therefore, we can compute the skyline fromGwith less cost than by computing directly fromR1./ R2.

In the example,hh5, r5iis not a skyline in the join because for the join value B,we haveh2dominatingh5andr1dominatingr5.On the other hand, neither h6 norr4 are skyline in their table, but they form a skyline in the join because they are skylines in their join group and there is no other group in which both are dominated (h6 is dominated by a query h3 in the A group, whereas r4 is dominated byr2 in theC group, and there is no single group in which both are dominated.)

(14)

R₁(Hotels)

HID Price Rating Location

h₁ 100 8 A

h₂ 150 5 B

h₃ 200 1 A

h₄ 400 2 A

h₅ 300 7 C

h₆ 350 3 B

S₁

Price Rating S₁(A)

100 8

P₁^A={h100,8i,h200,1i}

200 1

400 2

S₁(B) 150 5

P₁^B={h150,5i,h350,3i}

350 3

S₁(C) 300 7 P₁^C={h300,7i}

R₂(Restaurants)

RID Distance Ranking Location

r₁ 150 4 B

r₂ 250 2 C

r₃ 500 1 A

r₄ 400 3 B

r₅ 200 5 C

r6 500 6 A

S₂

Distance Ranking

S2(A) 500 1

P₂^A={h500,1i}

500 6

S2(B) 150 4

P₂^B={h150,4i,h400,3i}

400 3

S2(C) 250 2

P₂^C={h250,2i,h200,5i}

200 5

G

Price Rating Distance Ranking Location

100 8 500 1 A

P₁^A×P₂^A× {A}

200 1 500 1 A

150 5 150 4 B

P₁^B×P₂^B× {B}

150 5 400 3 B

350 3 150 4 B

350 3 400 3 B

300 7 250 2 C

P₁^C×P₂^C× {C}

300 7 200 5 C

F1

Price Rating Location

100 8 A

200 1 A

150 5 B

350 3 B

300 7 C

F2

Distance Ranking Location

500 1 A

150 4 B

400 3 B

250 2 C

Fig. 2.An example of skyline over the join of two tables

5 Concluding Remarks

We have seen a novel approach to computing the skyline of a relational table with respect to preferences expressed over one or more attributes with ordered domains. Our approach is based on what we called the query lattice of the table, and our basic algorithm constructs the skyline as the union of the answers to a subset of queries from that lattice — hencewithout directly accessing the table R. Therefore, in contrast to all existing techniques, our approach is independent of how the table R is implemented or how its tuples are indexed. We have demonstrated the generality of our approach by computing the skyline over the join of two tables based on the product of their individual query lattices — thereforewithout performing the join.

We note that our method is applicable to a computational geometry setting as well. Indeed, a discrete (finite)n-dimensional Euclidean spaceS can be thought of as a relational tableT(S) in which: (a) the attributes are the ndimensions

(15)

of S and (b) each tuple of T(S) represents a point in S. Moreover, the answer to a query q = hb1, . . . , bki from the query lattice of T(S) is the set of all points of S such that (B1 = b1)∧. . .∧(Bk = bk), whereB1, . . . , Bk are the corresponding dimensions; in other words, the answer to q is the set of points having the same coordinate values over the dimensionsB1, . . . , Bk. Additionally, our method can be applied to the Cartesian product of two or more spaces through the product lattice of the individual spaces. We are currently pursuing work in two different directions, namely refining skyline analysis and applying our approach to a distributed setting:

1. Refining skyline analysis

As we mentioned in section 3, the skyline query returned by our basic algorithm is the disjunction of a set of queries from the query lattice, say q₁∧. . .∧q_m; and the answers to these queries actually partition the skyline into disjoint subsets. Moreover, these queries have different ranks, in general.

Therefore it now becomes possible to ask finer queries regarding the skyline such as “give me the best tuples of the skyline” (meaning the answer to the queryqi of highest rank), or “return the skyline by presenting the answers toqi’s in increasing order of rank”, and so on. In this respect, we would also like to investigate in more detail the relationship between “most preferred tuple” and “non-dominated tuple”.

2. Applying our method to a distributed setting

As information systems store bigger and bigger volumes of data today, data management and in particular query processing presents difficult challenges.

From the user viewpoint, big volumes of data imply answers of large size.

By returning the best tuples (in terms of user preferences), the skyline query relieves the user from having to deal with answers of large size; and having the possibility to further refine the skyline (as mentioned above) further contributes in that direction. However, in recent years, data management and data storage have become increasingly distributed, and distribution presents additional challenges for query processing. Adapting the skyline operator to a distributed setting is one of the research lines that we are currently pursuing.

We believe that our approach to skyline computation through query lattices is particularly well suited in a distributed environment, where computation can be distributed and recomposed in the form of the product lattice.

References

1. I. Bartolini, P. Ciaccia, and M. Patella. SaLSa: computing the skyline without scanning the whole sky. In Proceedings of CIKM, 2006.

2. I. Bartolini, P. Ciaccia, and M. Patella. Efficient sort-based skyline evaluation.

ACM Trans. Database Syst., 33(4), 2008.

3. S. B¨orzs¨onyi, D. Kossmann, and K. Stocker. The skyline operator. In Proceed- ings of ICDE, pages 421-430, 2001.

4. C.-Y. Chan, P.-K. Eng, K.-L. Tan. Stratified computation of skylines with partially-ordered domains. In Proc. of SIGMOD 2005, pp. 03–214, 2008.

(16)

5. J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting. In Proceedings of ICDE, pages 717-816, 2003.

6. B. Cui, H. Lu, Q. Xu, L. Chen, Y. Dai, Y. Zhou. Parallel Distributed Processing of Constrained Skyline Queries by Filtering. In Proc. of ICDE 2008, pp. 546- 555, 2008.

7. P. Godfrey, R. Shipley, and Jarek Gryz. Algorithms and analyses for maximal vector computation. The VLDB Journal, vol 16(1), pp. 5–28, 2007.

8. W. Jin, M. Ester, Z. Hu, and J. Han. The multi-relational skyline operator. In Proceedings of ICDE, 2007.

9. W. Jin, M. Morse, J. Patel, M. Ester, and Z. Hu. Evaluating skylines in the presence of equi-joins. In Proc. of ICDE, 2010.

10. N. Koudas, C. Li, A. K. H. Tung, and R. Vernica. Relaxing join and selection queries. In Proceedings of VLDB, 2006.

11. H.T. Kung, F. Luccio, F.P. Preparata. On finding the maxima of a set of vectors. Journal of the ACM, vol. 22(4), pp. 469–476, 1975.

12. M.E. Khalefa, M.F. Mokbel, J.J. Levandoski. Skyline Query Processing for Incomplete Data. In Proc. of ICDE 2008, pp. 556-565, 2008.

13. D. Kossmann , F. Ramsak , S. Rost. Shooting Stars in the Sky: An Online Algorithm for Skyline Queries. In Proc. of VLDB 2002, pp. 275–286, 2002.

14. X. Lin , Y. Yuan , W. Wangnicta , S. Wales. Stabbing the sky: Efficient skyline computation over sliding windows. In Proc. of ICDE 2005, pp. 502–513, 2005.

15. M. Morse , J. Patel , H. V. Jagadish. Efficient Skyline Computation over Low- Cardinality Domains. In Proc. of VLDB 2007, pp. 267–278, 2007.

16. J. Pei , W. Jin , M. Ester , Y. Tao. Catching the best views of skyline: A semantic approach based on decisive subspaces. In Proc. of VLDB 2005, pp.

253–264, 2005.

17. J. Pei , B. Jiang , X. Lin , Y. Yuan. Probabilistic skylines on uncertain data.

In Proc. of VLDB 2007, pp. 15–26, 2007.

18. F.P. Preparata, M.I. Shamos. Computational Geometry Computational Geom- etry, Springer-Verlag, 1985.

19. D. Papadias, Yufei Tao, Greg Fu, Bernhard Seeger. An Optimal and Progressive Algorithm for Skyline Queries. In Proc. of SIGMOD 2003, pp. 467–478, 2003.

20. V. Raghavan, E. Rundensteiner. Progressive result generation for multi-criteria decision support queries. In Proceedings of ICDE, 2010.

21. K.-L. Tan, P.-K. Eng, B.C. Ooi. Efficient Progressive Skyline Computation. In Proc. of VLDB 2001, pp. 301-310, 2001.

22. Y. Tao, X. Xiao, and J. Pei. Subsky: Efficient computation of skylines in subspaces. In Proceedings of ICDE, 2006.

23. A. Vlachou, C. Doulkeridis, N. Polyzotis. Skyline query processing over joins.

In Proceedings of SIGMOD 2011, pp. 73–84, 2011.

24. A. Vlachou, K. Nørv˚ag. Bandwidth-constrained distributed skyline computation. In Proc. of MobiDE 2009, pp. 17–24, 2009.

25. D. Sun, S. Wu, J. Li, and A. K. H. Tung. Skyline-join in distributed databases.

In ICDE Workshops, 2008.

26. Q. Wan, R. C.-W. Wong, I. F. Ilyas, M. T. ¨Ozsu, and Y. Peng. Creating competitive products. PVLDB, vol. 2(1), 898-909, 2009.

27. P. Wu , C. Zhang , Y. Feng , B.Y. Zhao , D. Agrawal , A.E. Abbadi. Parallelizing skyline queries for scalable distribution. In Proc. of EDBT 2006, pp. 112-130, 2006.

28. Y. Yuan , X. Lin , Q. Liu , W. Wang , J.X. Yu , Q. Zhang. Efficient Computation of the Skyline Cube. In Proc. of VLDB 2005, pp. 241–252, 2005.

(17)

29. S. Zhang , N. Mamoulis , S.W. Cheung. Scalable skyline computation using object-based space partitioning. In Proc. of SIGMOD 2009, pp. 483-494, 2009.