Top-k Queries over Uncertain Scores

(1)

Qing Liu, Debabrota Basu, Talel Abdessalem, St´ephane Bressan

(2)

Top-k Queries over Uncertain Scores Introduction

Introduction

I Modern recommendation systems leverage some forms of collaborative user (crowd) sourced collection of information.

I Crowdsourcing Platforms

I easily announce their needs to the crowd / get access to the information they need

I choose the highest quality / most competitively priced

I Examples: TripAdvisor

I collaborative user or crowdsourced collection of information, e.g., user generated ratings and reviews, to recommend travel plans and hotels, vacation rentals and restaurants.

(3)

Introduction

(4)

Introduction

(5)

Introduction

(6)

Introduction

I Crowdsourcing and Collaborative Economy:

I communities or crowds rent, share, sell products or services

(7)

Introduction

(8)

Introduction

(9)

Introduction

(10)

Introduction

I Independent collection of information → uncertainty and diversity.

I Objects (services, vacation rentals and restaurants...) have uncertain scores (quality, price...).

(11)

Introduction

(12)

Introduction

(13)

Introduction

(14)

Introduction

(15)

Introduction

I Ranking is one of the building blocks of recommendation.

I A top-k query returns the sequence of the k objects with the highest scores, given a database of objects ranked by their scores for the feature of interest.

I Price of the apartments.

I With uncertain scores, a top-k query can only return an uncertain result.

(16)

Introduction

20000 3000 4000 5000 6000

0.2 0.4 0.6 0.8 1 1.2 1.4x 10⁻³

(17)

Introduction

20000 3000 4000 5000 6000

0.2 0.4 0.6 0.8 1 1.2 1.4x 10⁻³

(18)

Top-k Queries over Uncertain Scores Related Work

Related Work

I Soliman, Hyas and Ben-David [Soliman and Ilyas, 2009] study top-k queries over objects with uncertain scores given as probability distributions.

I In this paper, we consider probabilistic top-k queries under the top-k semantics as in [Soliman and Ilyas, 2009].

(19)

Top-k Queries over Uncertain Scores Related Work

Related Work

I Soliman, Hyas and Ben-David [Soliman and Ilyas, 2009] study top-k queries over objects with uncertain scores given as probability distributions.

I In this paper, we consider probabilistic top-k queries under the top-k semantics as in [Soliman and Ilyas, 2009].

(20)

Top-k Queries over Uncertain Scores Problem Definition

Problem Definition

I O: a set ofn objects;

I s(o_i): the score of an object o_i ∈ O;

I Xi: a random variable, equals tos(oi);

I fi: bounded continuous probability density function of Xi;

I π^(k)= [o₁,· · · , o_k]: sequence of k objects inO;

I P r(π^(k)): probability ofπ^(k) be the top-ksequence; P r(π^(k)) =

Z ∞

−∞

Z _x₁

−∞

· · · Z _x_k

−∞

f1(x1)· · ·fn(xn) dxn· · ·dx1

(1)

I (Objective:) Probabilistic top-k sequence: the π^(k) that maximizes P r(π^(k)).

(21)

Problem Definition

I s(o_i): the score of an objecto_i ∈ O;

Z ∞

−∞

Z _x₁

−∞

· · · Z _x_k

−∞

f1(x1)· · ·fn(xn) dxn· · ·dx1

(1)

(22)

Problem Definition

Z ∞

−∞

Z _x₁

−∞

· · · Z _x_k

−∞

f1(x1)· · ·fn(xn) dxn· · ·dx1

(1)

(23)

Problem Definition

Z ∞

−∞

Z _x₁

−∞

· · · Z _x_k

−∞

f1(x1)· · ·fn(xn) dxn· · ·dx1

(1)

(24)

Problem Definition

I π^(k)= [o₁,· · · , o_k]: sequence of kobjects inO;

Z ∞

−∞

Z _x₁

−∞

· · · Z _x_k

−∞

f1(x1)· · ·fn(xn) dxn· · ·dx1

(1)

(25)

Problem Definition

I P r(π^(k)): probability ofπ^(k) be the top-ksequence;

P r(π^(k)) = Z ∞

−∞

Z _x₁

−∞

· · · Z _x_k

−∞

f₁(x₁)· · ·f_n(x_n) dx_n· · ·dx₁ (1)

(26)

Problem Definition

I P r(π^(k)): probability ofπ^(k) be the top-ksequence;

P r(π^(k)) = Z ∞

−∞

Z _x₁

−∞

· · · Z _x_k

−∞

f₁(x₁)· · ·f_n(x_n) dx_n· · ·dx₁

(1)

(27)

Top-k Queries over Uncertain Scores Solutions

Solutions

I Naive: calculateP r(π^(k)) for every possible sequenceπ^(k) and returning the π^(k) with the highestP r(π^(k)).

I n!

(n−k)! possible sequences to examine.

I Branch-and-Bound[Soliman et al., 2010]: Prune some π^(k)s.

I Worst case: _(n−k)!^n! possible sequences to examine.

I Soliman’s Algorithm [Soliman et al., 2010]: searches the space of candidate probabilistic top-k sequences using a Markov chain Monte Carlo algorithm.

I In this paper, we explore the variants of Markov chain Monte Carlo algorithms.

(28)

Solutions

I n!

(29)

Solutions

I n!

(30)

Solutions

I n!

(31)

Markov chain Monte Carlo Algorithms

I Soliman’s Algorithm

I Initial state: a rank over thenobjects

I Candidate State:

𝑜1

𝑜2

𝑜3

𝑜4

𝑜5

𝑜6

𝑜7

Pr(𝑜2< 𝑜3) 𝑜1

𝑜3

𝑜2

𝑜4

𝑜5

𝑜6

𝑜7

Pr(𝑜2< 𝑜4) Top-𝑘

𝑜1

𝑜2

𝑜3

𝑜4

𝑜5

𝑜6

𝑜7

Pr(𝑜6> 𝑜5) 𝑜1

𝑜2

𝑜3

𝑜4

𝑜6

𝑜5

𝑜7

Pr(𝑜6> 𝑜4)

I Acceptance Probability: α= min(^{P r(π}

(k)

t+1)·P r(πt|πt+1) P r(π_t^(k))·P r(πt+1|πt),1)

(32)

Markov chain Monte Carlo Algorithms

I Candidate State:

𝑜1

𝑜2

𝑜3

𝑜4

𝑜5

𝑜6

𝑜7

Pr(𝑜2< 𝑜3) 𝑜1

𝑜3

𝑜2

𝑜4

𝑜5

𝑜6

𝑜7

𝑜1

𝑜2

𝑜3

𝑜4

𝑜5

𝑜6

𝑜7

Pr(𝑜6> 𝑜5) 𝑜1

𝑜2

𝑜3

𝑜4

𝑜6

𝑜5

𝑜7

Pr(𝑜6> 𝑜4)

(k)

(33)

Markov chain Monte Carlo Algorithms

I Candidate State:

𝑜1

𝑜2

𝑜3

𝑜4

𝑜5

𝑜6

𝑜7

Pr(𝑜2< 𝑜3) 𝑜1

𝑜3

𝑜2

𝑜4

𝑜5

𝑜6

𝑜7

𝑜1

𝑜2

𝑜3

𝑜4

𝑜5

𝑜6

𝑜7

Pr(𝑜6> 𝑜5) 𝑜1

𝑜2

𝑜3

𝑜4

𝑜6

𝑜5

𝑜7

Pr(𝑜6> 𝑜4)

(k)

(34)

Markov chain Monte Carlo Algorithms

I Swap andSwapEXP Algorithm

I Candidate State:

𝑜1

𝑜2

𝑜3

𝑜4

𝑜5

𝑜6

𝑜7 Top-𝑘

𝑜1

𝑜5

𝑜3

𝑜4

𝑜2

𝑜6

𝑜7 I Acceptance Probability:

Swap: α= min(^{P r(π}

(k) t+1)·_kn¹

P r(π_t^(k))·_kn¹ =^{P r(π}

(k) t+1) P r(π_t^(k)),1) SwapEXP:

α= min(^{P r(π}^c

(k) t+1)

P r(πc _t^(k)) = exp(β(P r(π^(k)_t+1)−P r(π_t^(k)))),1) (P r(πc ^(k)) =C_β⁻¹exp(βP r(π^(k))))

I SwapEXP is more likely to reject the “worse” candidate state.

(35)

Markov chain Monte Carlo Algorithms

I Candidate State:

𝑜1

𝑜2

𝑜3

𝑜4

𝑜5

𝑜6

𝑜7 Top-𝑘

𝑜1

𝑜5

𝑜3

𝑜4

𝑜2

𝑜6

𝑜7

I Acceptance Probability: Swap: α= min(^{P r(π}

(k) t+1)·_kn¹

P r(π_t^(k))·_kn¹ =^{P r(π}

α= min(^{P r(π}^c

(k) t+1)

(36)

Markov chain Monte Carlo Algorithms

I Candidate State:

𝑜1

𝑜2

𝑜3

𝑜4

𝑜5

𝑜6

𝑜7 Top-𝑘

𝑜1

𝑜5

𝑜3

𝑜4

𝑜2

𝑜6

(k) t+1)·_kn¹

P r(π_t^(k))·_kn¹ =^{P r(π}

α= min(^{P r(π}^c

(k) t+1)

(37)

Markov chain Monte Carlo Algorithms

I Candidate State:

𝑜1

𝑜2

𝑜3

𝑜4

𝑜5

𝑜6

𝑜7 Top-𝑘

𝑜1

𝑜5

𝑜3

𝑜4

𝑜2

𝑜6

(k) t+1)·_kn¹

P r(π_t^(k))·_kn¹ =^{P r(π}

α= min(^{P r(π}^c

(k) t+1)

(38)

Markov chain Monte Carlo Algorithms

I ReSampleand ReSampleEXPAlgorithm

I Candidate State:

𝑜1: 9 𝑜₂: 8 𝑜3: 6 𝑜4: 5 𝑜5: 4 𝑜6: 3 𝑜7: 2

Top-𝑘

𝑜1: 9 𝑜₂: 8 𝑜5: 7 𝑜3: 6 𝑜4: 5 𝑜6: 3 𝑜7: 2 I Acceptance Probability:

ReSample: α= min(^{P r(π}

(k)

t+1)·P r(π_t|π_t+1) P r(π^(k)_t )·P r(πt+1|πt),1) ReSampleEXP:

α= min(^{P r(π}_{P r(π}^t^|π^t+1⁾

t+1|πt)·exp(β(P r(π_t+1^(k))−P r(π^(k)_t ))),1).

(39)

Markov chain Monte Carlo Algorithms

I Candidate State:

𝑜1: 9 𝑜₂: 8 𝑜3: 6 𝑜4: 5 𝑜5: 4 𝑜6: 3 𝑜7: 2

Top-𝑘

𝑜1: 9 𝑜₂: 8 𝑜5: 7 𝑜3: 6 𝑜4: 5 𝑜6: 3 𝑜7: 2

I Acceptance Probability: ReSample: α= min(^{P r(π}

(k)

α= min(^{P r(π}_{P r(π}^t^|π^t+1⁾

(40)

Markov chain Monte Carlo Algorithms

I Candidate State:

𝑜1: 9 𝑜₂: 8 𝑜3: 6 𝑜4: 5 𝑜5: 4 𝑜6: 3 𝑜7: 2

Top-𝑘

𝑜1: 9 𝑜₂: 8 𝑜5: 7 𝑜3: 6 𝑜4: 5 𝑜6: 3 𝑜7: 2 I Acceptance Probability:

ReSample: α= min(^{P r(π}

(k)

α= min(^{P r(π}_{P r(π}^t^|π^t+1⁾

(41)

Markov chain Monte Carlo Algorithms

I ReSampleAllAlgorithm

I Candidate State:

𝑜1: 9 𝑜2: 8 𝑜3: 6 𝑜4: 5 𝑜5: 4 𝑜6: 3 𝑜7: 2

Top-𝑘

𝑜3: 10 𝑜2: 9 𝑜5: 8 𝑜1: 6 𝑜7: 4 𝑜6: 3 𝑜4: 2 I Acceptance Probability: ReSample: α= 1

(42)

Markov chain Monte Carlo Algorithms

I Candidate State:

𝑜1: 9 𝑜2: 8 𝑜₃: 6 𝑜4: 5 𝑜5: 4 𝑜6: 3 𝑜7: 2

Top-𝑘

𝑜3: 10 𝑜2: 9 𝑜₅: 8 𝑜1: 6 𝑜7: 4 𝑜6: 3 𝑜4: 2

I Acceptance Probability: ReSample: α= 1

(43)

Markov chain Monte Carlo Algorithms

I Candidate State:

𝑜1: 9 𝑜2: 8 𝑜₃: 6 𝑜4: 5 𝑜5: 4 𝑜6: 3 𝑜7: 2

Top-𝑘

𝑜3: 10 𝑜2: 9 𝑜₅: 8 𝑜1: 6 𝑜7: 4 𝑜6: 3 𝑜4: 2 I Acceptance Probability: ReSample: α= 1

(44)

Top-k Queries over Uncertain Scores Performance Evaluation

Performance Evaluation

I Datasets: synthetic datasets

Table: Distributions

Setting 1 Setting 2 Setting 3

median score G(0.5,0.05) G(0.5,0.2) U[0,1]

width G(0.5,0.05) G(0.5,0.2) U[0,1]

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

I default: uniform score distributions, median score ofoi: ^lⁱ^+u₂ ⁱ, width: ui−li

I Metrics

I Probability of the Probabilistic top-k sequence (higher→ better)

I Convergence of the Markov chains (Gelman-Rubin Convergence Diagnostic)

I Efficiency (Complexity and runtime)

(45)

Performance Evaluation

median score G(0.5,0.05) G(0.5,0.2) U[0,1]

width G(0.5,0.05) G(0.5,0.2) U[0,1]

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

I Metrics

(46)

Performance Evaluation

median score G(0.5,0.05) G(0.5,0.2) U[0,1]

width G(0.5,0.05) G(0.5,0.2) U[0,1]

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

I Metrics

(47)

Performance Evaluation

median score G(0.5,0.05) G(0.5,0.2) U[0,1]

width G(0.5,0.05) G(0.5,0.2) U[0,1]

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

I Metrics

(48)

Performance Evaluation

median score G(0.5,0.05) G(0.5,0.2) U[0,1]

width G(0.5,0.05) G(0.5,0.2) U[0,1]

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

I Metrics

(49)

Performance Evaluation

median score G(0.5,0.05) G(0.5,0.2) U[0,1]

width G(0.5,0.05) G(0.5,0.2) U[0,1]

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

00.10.20.30.40.50.60.70.80.91

0 1 2 3 4 5 6 7 8

I Metrics

(50)

Effectiveness of Six Algorithms

Effectiveness of Six Algorithms (Probability)

0 5 10

x 10⁴ 10⁻⁸

10⁻⁷ 10⁻⁶ 10⁻⁵

Chain Length

Probability

Soliman Swap SwapEXP ReSample ReSampleEXP ReSampleAll

(a) Dataset5

0 5 10

x 10⁴ 10⁻⁸

10⁻⁷ 10⁻⁶ 10⁻⁵

Chain Length

Probability

(b) Dataset21

1 2 3 4 5 6

0.5 1 1.5 2 2.5

(51)

Convergence of Six Algorithms

Convergence of the Markov Chains

0 5 10

x 10⁴ 0

2 4 6 8 10

Chain Length

Gelman−Rubin Diagnostic

(e) Dataset5

0 5 10

x 10⁴ 0

2 4 6 8 10

Chain Length

Gelman−Rubin Diagnostic

(f) Dataset21

0 0.2 0.4 0.6 0.8 1

0 1 2 3 4 5 6

(g) Dataset5 ⁰ ^0.2 ^0.4 ^0.6 ^0.8 ¹

0 0.5 1 1.5 2 2.5

(h) Dataset21

(52)

Efficiency

Table: Worst Case Time Complexity of Generating Next State

Soliman Swap(EXP) ReSample(EXP) ReSampleAll

Time Complexity O(nk) O(1) O(n) O(nlogk)

Table: Runtime Per Step of the Algorithms (seconds)

Runtime Per Step 0.0058 1.9128 0.1163 0.0523 0.0071 0.9056

(53)

Top-k Queries over Uncertain Scores Conclusion

Conclusion

I We explore the design space for Metropolis-Hastings Markov chain Monte Carlo algorithms.

I We verify through extensive experiments that the proposed algorithms are more effective than the state of the art approach.

I ReSampleAllis the best, since it samples directly from the target distribution instead of depending on “local”

information.

(54)

Conclusion

information.

(55)

Conclusion

information.

(56)

Top-k Queries over Uncertain Scores Q/A

Thank you! Questions?

Top-k Queries Uncertain Scores

MCMC

[email protected]

(57)

Top-k Queries over Uncertain Scores References

References I

Soliman, M. A. and Ilyas, I. F. (2009).

Ranking with uncertain scores.

InICDE, pages 317–328.

Soliman, M. A., Ilyas, I. F., and Ben-David, S. (2010).

Supporting ranking queries on uncertain and incomplete data.

The VLDB Journal, 19(4):477–501.