Optimal Top-k Generation of Attribute Combinations based on Ranked Lists

(1)

Optimal Top-k Generation of

Attribute Combinations based on Ranked Lists

Jiaheng Lu,

Renmin University of China

Joint work with Pierre Senellart, Chunbin Lin, Xiaoyong

Du, Shan Wang, and Xinxing Chen

(2)

Motivation & Problem Statement

Goal: Select a combination of three players

including forward, center, and guard positions.

Methods

select players with the highest score in each group calculate the average scores of players across all games

… …

Limitation:

overlook team spirit Game

IDScore

(3)

Motivation & Problem Statement

consider their combined scores in the same game

Select the top-k combinations according to top-m aggregate scores

Top-k,m Problem

Tuple aggregation function

Instance aggregation function

(4)

Motivation & Problem Statement

Top-1,2: select the top-1 combination of players

according to top-2 aggregate scores for games where they played together.

F2C1G1 is the best combination, since (21.51 + 18.76) is the highest overall score.

(5)

 Difference between top-k queries and top-k,m queries

Top-k Top-k,m

Return the top-k tuples Return the top-k

combinations of attributes

Can be transformed into a SQL

Cannot be transformed into a

SQL

(6)

Application

XML keyword refinement

Example Example

Q = {DB;UC Irvine; 2002}

Groups:

G1 = {"DB"; "database"}, G2={"UCI";"UC Irvine"}

G3 = {"2002"}.

Answer:

Q’={DB, UCI, 2002}

Consider a top-1,2 query

(7)

Application (Cont.)

• Evidence combination mining in medical databases

• Package recommendation systems

• …

(8)

 Motivation & Problem Statement

 Top-k,m Query Processing

 Experimental Results

 Conclusion

Outline

(9)

Top-k,m Query Processing

Access Model: Sorted Accesses

(a, 9.0) (b, 8.7) (c, 8.7) (d, 7.4) (i, 5.3)

……

(i, 8.8)

(f, 6.9) (a, 7.5)

(d, 4.7)

(c, 7.9)

……

(10)

Top-k,m Query Processing

Access Model: Random Accesses

(a, 9.0) (b, 8.7) (c, 8.7) (d, 7.4) (i, 5.3)

……

(i, 8.8)

(f, 6.9) (a, 7.5)

(d, 4.7)

(c, 7.9)

……

(11)

Top-k,m Query Processing

Baseline Method: ETA

Compute top-m tuples for each

combination

Threshold Algorithm (TA)

Calculate aggregate score for each

combination

Return the top-k combinations

(12)

Top-k,m Query Processing

Upper and Lower bounds Algorithm: ULA

Consider top-m seen match instances

Lower Bound

Consider threshold value and top-m match instances

Upper Bound

Compute the upper and lower bounds for

each combination

Termination condition:

k combinations meet the hit-condition

(13)

(G1, 9.3) (G2, 8.3)

(G5,7.8) (G11,7.3)

(G2, 7.9) (G1,7.0)

(G4,8.0) (G8,7.3)

(G8, 3.0) (G4, 2.6)

(G4, 1.8) (G2, 1.5)

(G11, 4.2) (G5, 3.3)

(G2, 4.4) (G1, 2.3)

Upper and Lower bounds Algorithm: ULA

A1B1

(14)

(G1, 9.3) (G2, 8.3)

(G5,7.8) (G11,7.3)

(G2, 7.9) (G1,7.0)

(G4,8.0) (G8,7.3)

(G8, 3.0) (G4, 2.6)

(G4, 1.8) (G2, 1.5)

(G11, 4.2) (G5, 3.3)

(G2, 4.4) (G1, 2.3)

Upper and Lower bounds Algorithm: ULA

A1B1 U: 34.4 L: 32.5 A1B2 U: 34.6

L: 22.2 A2B1 U: 31.4

L: 20.5 A2B2 U: 31.6

L: 9.8 L: 32.5

A1B1 （ G1, 9.3+7.0 ） , （ G2, 7.9+8.3 ）

U: 31.4

A2B1 Threshold value=7.8+7.9=15.7, 15.7*2=31.4

(15)

(G1, 9.3) (G2, 8.3)

(G5,7.8) (G11,7.3)

(G2, 7.9) (G1,7.0)

(G4,8.0) (G8,7.3)

(G8, 3.0) (G4, 2.6)

(G4, 1.8) (G2, 1.5)

(G11, 4.2) (G5, 3.3)

(G2, 4.4) (G1, 2.3)

Upper and Lower bounds Algorithm: ULA

A1B1 U: 32.5 L: 32.5 A1B2 U: 31.2

L: 22.2

A1B1

(16)

Can we run fast?

(17)

Optimization heuristics (1)

Pruning combinations without computing the bounds

(A3,B2) is dominated by (A2,B1)

6.3<7.1 and 8.0<8.2

(18)

Optimization heuristics (2)

Reducing the number of accesses

Avoiding both sorted and random accesses for specific lists

(A1,B1)and(A1,B2) cannot be part of answers, all sorted

accesses and random accesses on list A1 are unnecessary.

(19)

Optimization heuristics (3)

Reducing the number of accesses

Reducing random accesses across two lists

(A1,B1,C1)and(A1,B1,C2) cannot be part of answers, random

accesses between A1 and B1 are unnecessary.

(20)

Optimization heuristics (4)

Reducing the number of accesses

Eliminating random accesses for specific tuples

Random access from L

_e

to L

_t

for tuple x is useless

(21)

Prune dominated combinations

Compute upper and lower bounds for unterminated combinations

Terminate combinations by reducing number of accesses

Until k

combinations meet hit-

condition

Top-k,m Query Processing

ULA+

(22)

Interesting theoretical results

Optimality properties

Instance Optimality Instance Optimality

 If wild guesses are not allowed, and the size of each group is treated as a constant, then ULA and ULA+ are instance-optimal.

The upper bound of the optimality ratio is tight The upper bound of the optimality ratio is tight

for every instance there exist two constants a and b

such that cost(A) <= a*cost(A’) + b

(23)

Interesting theoretical results (Cont.)

Optimality properties

No Instance Optimal Algorithms No Instance Optimal Algorithms

 If wild guesses are allowed, Then there is no deterministic algorithm that is instance-optimal.

(24)

 Motivation & Problem Statement

 Top-k,m Query Processing

 Experimental Results

 Conclusion

Outline

(25)

Experimental Results

Experimental Setup

Language: Java; OS: Windows XP; CPU: 2.0GHz; Disk:320GB

Data sets

(26)

Experimental Results

Experimental results on NBA and YQL datasets

ULA+ outperforms ETA by 1-2 orders of magnitude both in running

time and access number.

(27)

Experimental Results

Performance of optimization to reduce combinations

More than 60% combinations are pruned without computing their bounds

(28)

Experimental Results

Performance of different optimizations

Combination of all optimizations has the most powerful pruning capability.

(29)

Experimental Results

Experimental results on XML DBLP dataset

XULA and XULA+ perform better than XETA and scale well in both

running time and number of accesses.

(30)

U. Güntzer etc, VLDB2000 S. Nepal etc, ICDE1999

Fagin etc, PODS 2001

Top-k with both random and sorted accesses

R. Fagin etc, JCSS2003 N. Mamoulis etc, TDS2007

Fagin etc, PODS 2001

Top-k with only sorted accesses

Related Works

(31)

I. F. Ilya etc, VLDB2002

Top-k with no need for exact aggregate score

C. Li etc, SIGMOD2006 M. L. Yiu etc, DKE2008

Ad-hoc top-k queries

Related Works

N. Bruno etc, ICDE2002

K. C. C. Chang etc, SIGMOD2002

Top-k with sorted access on restricted lists

(32)

Related Works

T. Deng, W, Fan and F. Geerts, On the Complexity of Package

Recommendation Problems PODS 2012

Top-k Package recommendation

(33)

 Motivation & Problem Statement

 Top-k,m Query Processing

 Experimental Results

 Conclusion

Outline

(34)

Conclusion

• Propose a new problem called top-k,m query evaluation

• Developed a family of efficient algorithms, including ULA and ULA+

• Study the optimality properties of our algorithms

• Apply top-k,m query to the context of XML

keyword query refinement

(35)