Optimal Top-k Generation of
Attribute Combinations based on Ranked Lists
Jiaheng Lu,
Renmin University of China
Joint work with Pierre Senellart, Chunbin Lin, Xiaoyong
Du, Shan Wang, and Xinxing Chen
Motivation & Problem Statement
Goal: Select a combination of three players
including forward, center, and guard positions.
Methods
select players with the highest score in each group calculate the average scores of players across all games
… …
Limitation:
overlook team spirit Game
IDScore
Motivation & Problem Statement
consider their combined scores in the same game
Select the top-k combinations according to top-m aggregate scores
Top-k,m Problem
Tuple aggregation function
Instance aggregation function
Motivation & Problem Statement
Top-1,2: select the top-1 combination of players
according to top-2 aggregate scores for games where they played together.
F2C1G1 is the best combination, since (21.51 + 18.76) is the highest overall score.
Difference between top-k queries and top-k,m queries
Top-k Top-k,m
Return the top-k tuples Return the top-k
combinations of attributes
Can be transformed into a SQL
Cannot be transformed into a
SQL
Application
XML keyword refinement
Example Example
Q = {DB;UC Irvine; 2002}
Groups:
G1 = {"DB"; "database"}, G2={"UCI";"UC Irvine"}
G3 = {"2002"}.
Answer:
Q’={DB, UCI, 2002}
Consider a top-1,2 query
Application (Cont.)
• Evidence combination mining in medical databases
• Package recommendation systems
• …
Motivation & Problem Statement
Top-k,m Query Processing
Experimental Results
Conclusion
Outline
Top-k,m Query Processing
Access Model: Sorted Accesses
(a, 9.0) (b, 8.7) (c, 8.7) (d, 7.4) (i, 5.3)
……
(i, 8.8)
(f, 6.9) (a, 7.5)
(d, 4.7)
(c, 7.9)
……
Top-k,m Query Processing
Access Model: Random Accesses
(a, 9.0) (b, 8.7) (c, 8.7) (d, 7.4) (i, 5.3)
……
(i, 8.8)
(f, 6.9) (a, 7.5)
(d, 4.7)
(c, 7.9)
……
Top-k,m Query Processing
Baseline Method: ETA
Compute top-m tuples for each
combination
Threshold Algorithm (TA)
Calculate aggregate score for each
combination
Return the top-k combinations
Top-k,m Query Processing
Upper and Lower bounds Algorithm: ULA
Consider top-m seen match instances
Lower Bound
Consider threshold value and top-m match instances
Upper Bound
Compute the upper and lower bounds for
each combination
Termination condition:
k combinations meet the hit-condition
(G1, 9.3) (G2, 8.3)
(G5,7.8) (G11,7.3)
(G2, 7.9) (G1,7.0)
(G4,8.0) (G8,7.3)
(G8, 3.0) (G4, 2.6)
(G4, 1.8) (G2, 1.5)
(G11, 4.2) (G5, 3.3)
(G2, 4.4) (G1, 2.3)
Upper and Lower bounds Algorithm: ULA
A1B1
(G1, 9.3) (G2, 8.3)
(G5,7.8) (G11,7.3)
(G2, 7.9) (G1,7.0)
(G4,8.0) (G8,7.3)
(G8, 3.0) (G4, 2.6)
(G4, 1.8) (G2, 1.5)
(G11, 4.2) (G5, 3.3)
(G2, 4.4) (G1, 2.3)
Upper and Lower bounds Algorithm: ULA
A1B1 U: 34.4 L: 32.5 A1B2 U: 34.6
L: 22.2 A2B1 U: 31.4
L: 20.5 A2B2 U: 31.6
L: 9.8 L: 32.5
A1B1 ( G1, 9.3+7.0 ) , ( G2, 7.9+8.3 )
U: 31.4
A2B1 Threshold value=7.8+7.9=15.7, 15.7*2=31.4
(G1, 9.3) (G2, 8.3)
(G5,7.8) (G11,7.3)
(G2, 7.9) (G1,7.0)
(G4,8.0) (G8,7.3)
(G8, 3.0) (G4, 2.6)
(G4, 1.8) (G2, 1.5)
(G11, 4.2) (G5, 3.3)
(G2, 4.4) (G1, 2.3)
Upper and Lower bounds Algorithm: ULA
A1B1 U: 32.5 L: 32.5 A1B2 U: 31.2
L: 22.2
A1B1
Can we run fast?
Optimization heuristics (1)
Pruning combinations without computing the bounds
(A3,B2) is dominated by (A2,B1)
6.3<7.1 and 8.0<8.2
Optimization heuristics (2)
Reducing the number of accesses
Avoiding both sorted and random accesses for specific lists
(A1,B1)and(A1,B2) cannot be part of answers, all sorted
accesses and random accesses on list A1 are unnecessary.
Optimization heuristics (3)
Reducing the number of accesses
Reducing random accesses across two lists
(A1,B1,C1)and(A1,B1,C2) cannot be part of answers, random
accesses between A1 and B1 are unnecessary.
Optimization heuristics (4)
Reducing the number of accesses
Eliminating random accesses for specific tuples
Random access from L
eto L
tfor tuple x is useless
Prune dominated combinations
Compute upper and lower bounds for unterminated combinations
Terminate combinations by reducing number of accesses
Until k
combinations meet hit-
condition
Top-k,m Query Processing
ULA+
Interesting theoretical results
Optimality properties
Instance Optimality Instance Optimality
If wild guesses are not allowed, and the size of each group is treated as a constant, then ULA and ULA+ are instance-optimal.
The upper bound of the optimality ratio is tight The upper bound of the optimality ratio is tight
for every instance there exist two constants a and b
such that cost(A) <= a*cost(A’) + b
Interesting theoretical results (Cont.)
Optimality properties
No Instance Optimal Algorithms No Instance Optimal Algorithms
If wild guesses are allowed, Then there is no deterministic algorithm that is instance-optimal.
Motivation & Problem Statement
Top-k,m Query Processing
Experimental Results
Conclusion
Outline
Experimental Results
Experimental Setup
Language: Java; OS: Windows XP; CPU: 2.0GHz; Disk:320GB
Data sets
Experimental Results
Experimental results on NBA and YQL datasets
ULA+ outperforms ETA by 1-2 orders of magnitude both in running
time and access number.
Experimental Results
Performance of optimization to reduce combinations
More than 60% combinations are pruned without computing their bounds
Experimental Results
Performance of different optimizations
Combination of all optimizations has the most powerful pruning capability.
Experimental Results
Experimental results on XML DBLP dataset
XULA and XULA+ perform better than XETA and scale well in both
running time and number of accesses.
U. Güntzer etc, VLDB2000 S. Nepal etc, ICDE1999
Fagin etc, PODS 2001
Top-k with both random and sorted accesses
R. Fagin etc, JCSS2003 N. Mamoulis etc, TDS2007
Fagin etc, PODS 2001
Top-k with only sorted accesses
Related Works
I. F. Ilya etc, VLDB2002
Top-k with no need for exact aggregate score
C. Li etc, SIGMOD2006 M. L. Yiu etc, DKE2008
Ad-hoc top-k queries
Related Works
N. Bruno etc, ICDE2002
K. C. C. Chang etc, SIGMOD2002
Top-k with sorted access on restricted lists
Related Works
T. Deng, W, Fan and F. Geerts, On the Complexity of Package
Recommendation Problems PODS 2012
Top-k Package recommendation