• Aucun résultat trouvé

Coreset and sampling approaches for the analysis of very large data sets Christian Sohler

N/A
N/A
Protected

Academic year: 2022

Partager "Coreset and sampling approaches for the analysis of very large data sets Christian Sohler"

Copied!
98
0
0

Texte intégral

(1)

Coreset and sampling approaches for the analysis of very large data sets

Christian Sohler

(2)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Big Data - Examples

IceCube

 World‘s biggest neutrino observatory

 Understanding of basic concepts in astrophysics

 Supernovas

 Dark matter

 Black holes

Data size

 approx. 10 TeraByte per data set

2

Image: Wikipedia

(3)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Very Large Networks

Social Networks

 Reflect social structures in detail

 Example question: Can we

distinguish democratic countries from totalitarian ones by looking at their Facebook structure?

Data size

 GigaByte upto TeraByte (only the graph)

 Data exchange (movies, pictures, etc.) in the Peta-Byte range

Source: [1] 3

(4)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Big Data - Examples

The World Wide Web

 Giant source of information

 Very fast access through search engines

 Automatic scoring of the quality of web pages

Data size

 500 ExaByte in 2009

(Richard Wray, The Guardian)

4

Image: Danard Vincente

(5)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Big Data Analytics

Primary Requirements

 Sequential processing

 Sublinear memory requirement

 Sublinear, linear or near linear running time

Secondary Requirements

 Energy efficiency

 Easy to distribute/parallelize

 Limited communication

5

Image: Roger Smith

(6)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Outline

Part I: Unstructured Data

 Clustering

 Definition of coresets

 Streaming & distributed algorithms via coresets

 Coreset constructions

Part II: Semi-Structured Data

 Property Testing

 An introductary example

 Local vs. global structure of planar graphs

6

(7)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Cluster analysis

Clustering

 Partition a set of objects into groups such that (ideally) (a) Objects in the same group are similar to each other (b) Objects in different groups are dissimilar to eachother

Representation of objects

 Objects are represented by real valued vectors / sets of points

7

(8)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

k-Means Clustering

Definition of k-Means Clustering

 Input:Set P of n d-dimensional points

 Output: Set C of k points (centers) that minimize cost(P,C) =

Σ

min ||p-c||²

 The corresponding clustering can be obtained by assigning each point to its nearest center

8

cC pP

(9)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

k-Means Clustering

Definition of k-Means Clustering

 Input:Set P of n d-dimensional points

 Output: Set C of k points (centers) that minimize cost(P,C) =

Σ

min ||p-c||²

 The corresponding clustering can be obtained by assigning each point to its nearest center

9

cC pP

9

(10)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

k-Means Clustering

Definition of k-Means Clustering

 Input:Set P of n d-dimensional points

 Output: Set C of k points (centers) that minimize cost(P,C) =

Σ

min ||p-c||²

 The corresponding clustering can be obtained by assigning each point to its nearest center

10

cC pP

(11)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

k-Means Clustering

Definition of k-Means Clustering

 Input:Set P of n d-dimensional points

 Output: Set C of k points (centers) that minimize cost(P,C) =

Σ

min ||p-c||²

 The corresponding clustering can be obtained by assigning each point to its nearest center

11

cC pP

(12)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

k-Means Clustering

Definition of k-Means Clustering

 Input:Set P of n d-dimensional points

 Output: Set C of k points (centers) that minimize cost(P,C) =

Σ

min ||p-c||²

 The corresponding clustering can be obtained by assigning each point to its nearest center

12

cC pP

12

(13)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

13

Coresets – the general idea

Coreset

• A coreset is an approximation of the data with respect to an optimization problem such that a „good“ solution on the coreset is almost as good as a

„good“ solution on the original point set

• There is no common definition

• Many approaches can be viewed as coreset approaches (sampling, etc.)

(14)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

14

Coresets – a definition

Coresets for k-Means Clustering [Har-Peled, Mazumdar, 2004]

A small weighted set S is called (k,ε)-coreset for P, if for every set C of k centers we have

(1-ε) cost(P,C) ≤ cost(S,C) ≤ (1+ε) cost(P,C)

In the above definition, the weight of a point is interpreted as its multiplicity.

(15)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

15

Coresets – a definition

Coresets for k-Means Clustering [Har-Peled, Mazumdar, 2004]

A small weighted set S is called (k,ε)-coreset for P, if for every set C of k centers we have

(1-ε) cost(P,C) ≤ cost(S,C) ≤ (1+ε) cost(P,C)

In the above definition, the weight of a point is interpreted as its multiplicity.

(16)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

16

Coresets – a definition

Coresets for k-Means Clustering [Har-Peled, Mazumdar, 2004]

A small weighted set S is called (k,ε)-coreset for P, if for every set C of k centers we have

(1-ε) cost(P,C) ≤ cost(S,C) ≤ (1+ε) cost(P,C)

In the above definition, the weight of a point is interpreted as its multiplicity.

Properties of coresets

 Fast answering of cost queries

 Optimal solution on the coreset yields a (1+O(ε))−approximation

 α-approximation on the coreset yields an α⋅(1+O(ε))−approximation

(17)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Lemma

 Let Opt be the cost of an optimal k-means solution

 If we move a set of points P by an overall l²-distance of O(ε² Opt) to obtain Q then we have for every set of k centers C:

(1-ε) cost(P,C) ≤ cost(Q,C) ≤ (1+ε) cost(P,C)

Main idea [Har-Peled, Mazumdar, 2004]

• Move points such that many are in the same position

• Replace multiple points by weighted points

17 2

(18)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Mazumdar, 2004]

 There is a coreset of size O(k log n/εd) for the k-means problem.

Proof idea:

 Compute a partition of the space into squares s.t. the diagonal of a square is at most ε² times its l²-distance to the nearest center of an optimal solution

 Move all points to the center of the square

 Since every point in a cell with l²-distance D contributes at least D to the optimal solution, we have that the overall movement is ε² Opt => Coreset!

D 18

ε²D

Closest optimal center

2

2

(19)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Mazumdar, 2004]

 There is a coreset of size O(k log n/εd) for the k-means problem.

Proof idea:

 Compute a partition of the space into squares s.t. the diagonal of a square is at most ε² times its l²-distance to the nearest center of an optimal solution

 Move all points to the center of the square

 Since every point in a cell with l²-distance D contributes at least D to the optimal solution, we have that the overall movement is ε² Opt => Coreset!

D 19

Closest optimal center

2

2

(20)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Mazumdar, 2004]

 There is a coreset of size O(k log n/εd) for the k-means problem.

Proof idea:

 Compute a partition of the space into squares s.t. the diagonal of a square is at most ε² times its l²-distance to the nearest center of an optimal solution

 Move all points to the center of the square

 Since every point in a cell with l²-distance D contributes at least D to the optimal solution, we have that the overall movement is ε² Opt => Coreset!

D 20

Closest optimal center

2

2

7

(21)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Mazumdar, 2004]

 There is a coreset of size O(k log n/εd) for the k-means problem.

 Smallest cells have diagonal length ε²Opt/n

 Cells size increases exponentially

 Furthest point has distance at most Opt

 => O(k log n/εd) cells

21

ε²Opt/n

(22)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Mazumdar, 2004]

 There is a coreset of size O(k log n/εd) for the k-means problem.

Final observation:

 The construction works, if we replace the optimal solution by an O(1)-approximation

22

ε²Opt/n

(23)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

A different construction [Frahling, S., 2005]

 Idea: Distribute the „movement cost“ equally among all cells

 Movement cost = l² cell diameter ⋅ #points in cell Algorithm

 Guess the cost of an optimal solution, fix appropriate δ

 Call a cell δ-heavy, if l² cell diameter ⋅ #points in cell > δ Opt

 Put a coreset point in each minimal δ-heavy cell

 Move each point to a coreset point in the smallest heavy cell it is contained in Advantage

 We do not need to compute an O(1) approximation

 Gives first algorithm for insertion/deletion streams

2

2

(24)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

24

Opt

(25)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

25

(26)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

26

(27)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

27

(28)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

28

(29)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

29

(30)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

30

18

(31)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

31

18

(32)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

21

5 7

(33)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

21

5 7

(34)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

21

6 7

(35)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

21

6 7

(36)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Frahling, S., 2005]

 The construction just presented gives a coreset for the k-means problem of size O(k log n/εd) .

(37)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Kushal, 2005]

 There is a coreset of size O(k³/εd+1) for the k-means problem.

1D Coreset:

 Partition the points into O(k²/ε²) intervals such that for each interval the 1-means cost is bounded by O(ε² Opt/k²)

37

(38)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Kushal, 2005]

 There is a coreset of size O(k³/εd+1) for the k-means problem.

1D Coreset:

 Partition the points into O(k²/ε²) intervals such that for each interval the 1-means cost is bounded by O(ε² Opt/k²)

 Replace all points Q in each interval by at most two points S, s.t.

38

2 2 1 1 1 2 2

(39)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Kushal, 2005]

 There is a coreset of size O(k³/εd+1) for the k-means problem.

1D Coreset:

 Partition the points into O(k²/ε²) intervals such that for each interval the 1-means cost is bounded by O(ε² Opt/k²)

 Replace all points Q in each interval by at most two points S, s.t.

if the nearest center is not contained in the interval, then cost(Q,c) = cost(S,c)

39

2 2

(40)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Kushal, 2005]

 There is a coreset of size O(k³/εd+1) for the k-means problem.

1D Coreset:

 Partition the points into O(k²/ε²) intervals such that for each interval the 1-means cost is bounded by O(ε² Opt/k²)

 Replace all points Q in each interval by at most two points S, s.t.

if the nearest center is not contained in the interval, then cost(Q,c) = cost(S,c)

40

2 2

(41)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Kushal, 2005]

 There is a coreset of size O(k³/εd+1) for the k-means problem.

1D Coreset:

 Partition the points into O(k²/ε²) intervals such that for each interval the 1-means cost is bounded by O(ε² Opt/k²)

 Replace all points Q in each interval by at most two points S, s.t.

if the nearest center is not contained in the interval, then cost(Q,c) = cost(S,c)

 Error only within the intervals that contain a center

41

2 2

(42)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Theorem [Har-Peled, Kushal, 2005]

 There is a coreset of size O(k³/εd+1) for the k-means problem.

Coresets for higher dimensions:

 Move points to O(1/εd-1) lines through each center of O(1)-approximation

(overall l²-movement is ε² Opt)

 Compute a coreset for each line

First Technique: Bounded Movement of Points

42 2

(43)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Kushal, 2005]

 There is a coreset of size O(k³/εd+1) for the k-means problem.

Coresets for higher dimensions:

 Move points to O(1/εd-1) lines through each center of O(1)-approximation

(overall l²-movement is ε² Opt)

 Compute a coreset for each line

43 2

(44)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

First Technique: Bounded Movement of Points

Theorem [Har-Peled, Kushal, 2005]

 There is a coreset of size O(k³/εd+1) for the k-means problem.

Coresets for higher dimensions:

 Move points to O(1/εd-1) lines through each center of O(1)-approximation

(overall l²-movement is ε² Opt)

 Compute a coreset for each line using the 1D construction

44 2

(45)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Second Technique: Sampling

Lemma

Let 0<ε<1. Let P be a point set in the unit ball. Let C be a fixed set of k centers inside the unit ball. Let S be a sample set taken from P uniformly at random. Then

Pr[ |

cost(P,C) - |P|/|S| cost(S,C)

|

> ε n

]

< 2 exp(- ε² |S| / 12 )

Proof:

 Apply Hoeffding‘s bound

(46)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Second Technique: Sampling

Theorem [Chen, 2009]

Let 0<ε<1. There is an (ε,k)-coreset for the k-means problem of size O(dk² log n/ε²).

Algorithm:

• Partition space around O(1)-approximation into

exponentially increasing rings

• From each ring take a uniform sample

~

(47)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Second Technique: Sampling

Theorem [Chen, 2009]

Let 0<ε<1. There is an (ε,k)-coreset for the k-means problem of size O(dk² log n/ε²).

Algorithm:

• Partition space around O(1)-approximation into

exponentially increasing rings

• From each ring take a uniform sample

• Weight the sample appropriately

~

(48)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Second Technique: Sampling

Theorem [Chen, 2009]

Let 0<ε<1. There is an (ε,k)-coreset for the k-means problem of size O(dk² log n/ε²).

Algorithm:

• Partition space around O(1)-approximation into

exponentially increasing rings

• From each ring take a uniform sample

• Weight the sample appropriately

~

1

1

4

4

(49)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Second Technique: Sampling

Theorem [Chen, 2009]

Let 0<ε<1. There is an (ε,k)-coreset for the k-means problem of size O(dk² log n/ε²).

Proof:

• Define a net of candidate solutions and apply sampling lemma to all solutions from the net

• Charge sampling error to the contribution of the points inside the ring

~

Min. contribution of points in the outer ring Scale such that outer ball is unit ball

(50)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Technique 2.5: Importance Sampling

Idea [Feldman, Monemizadeh, S., 2005]

 Sample uniformly the inner ring

 Sample all remaining points according to Pr[p] = cost({p},COpt) / Opt and assign weight 1/Pr[p].

Sample points with prob.

proportional to distance

(51)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Technique 2.5: Importance Sampling

Idea [Feldman, Monemizadeh, S., 2005]

 Sample uniformly the inner ring

 Sample all remaining points according to Pr[p] = cost({p},COpt) / Opt and assign weight 1/Pr[p].

 Analyze using Hoeffding‘s bound for all solutions on an appropriate net

Sample points with prob.

proportional to distance

(52)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Technique 2.5: Importance Sampling

Idea [Feldman, Monemizadeh, S., 2005]

 Sample uniformly the inner ring

 Sample all remaining points according to Pr[p] = cost({p},COpt) / Opt and assign weight 1/Pr[p].

 Analyze using Hoeffding‘s bound for all solutions on an appropriate net

Removing the inner ring

[Schulman, Langberg, 2010]

 Pr[p] = cost({p},COpt) /(2 Opt) + 1/(2n)

Sample points with prob.

proportional to distance

(53)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Technique 2.5: Importance Sampling

Definition [Langberg, Schulman, 2010]

The sensitivity γ(p) of a point p∈P with respect to k-means clustering is defined as γ(p) = sup cost({p},C) / cost(P,C).*

The total sensitivity Γ(P) of P is defined as Τ(P) =

Σ

pP γ(p)

* assuming that cost({p},C) / cost(P,C) = 0, if cost({p},C) = 0.

C={c1,..,ck} ⊆Rd

(54)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Technique 2.5: Importance Sampling

Theorem [Feldman, Langberg, 2011]

 If we choose s≥ d*/ε² Γ(P)² points with probability proportional to γ(p) then w.p. at least 2/3 the resulting point set is a (k,ε)-coreset.

 Here, d* is the VC dimension of the range space of the family of k balls.

(55)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Coresets with Offset for k-Means [Feldman, Schmidt, S., 2013]

A pair (S,∆) of a small weighted set S and a value ∆ is called (k,ε)-coreset with offset for P, if for every set C of k centers we have

(1-ε) cost(P,C) ≤ cost(S,C) + ∆ ≤ (1+ε) cost(P,C)

In the above definition, the weight of a point is interpreted as its multiplicity.

55

Third Technique: Reducing the Variance

(56)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Coresets with Offset for k-Means [Feldman, Schmidt, S., 2013]

A pair (S,∆) of a small weighted set S and a value ∆ is called (k,ε)-coreset with offset for P, if for every set C of k centers we have

(1-ε) cost(P,C) ≤ cost(S,C) + ∆ ≤ (1+ε) cost(P,C)

In the above definition, the weight of a point is interpreted as its multiplicity.

Remarks

• ∆ may depend on P, k and ε

• This is not an additive approximation

• The variance in S is smaller than in P (by the amount of ∆)

56

Third Technique: Reducing the Variance

(57)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

The „Magic Formula“ for k-Means

[Zhang, Ramakrishnan, Livny, 96]

Σ ||p-c||² = |P| ⋅ ||c- µ (P)||² + Σ ||p- µ (P)||² Third Technique: Reducing the Variance

57

c

pP pP

µ(P)

µ (P) =

p

Σ

P

p/|P|

(58)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

The „Magic Formula“ for k-Means

[Zhang, Ramakrishnan, Livny, 96]

Σ ||p-c||² = |P| ⋅ ||c- µ (P)||² + Σ ||p- µ (P)||² Third Technique: Reducing the Variance

58

c

pP pP

µ(P)

µ (P) =

p

Σ

P

p/|P|

Exact coreset for 1-means Pair µ(P) with weight |P| and the value ∆=

Σ

||p-µ(P)||²

(59)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Third Technique: Reducing the Variance

Definition

 Let Optk(P) denote the optimal clustering cost with k centers. A set P that satisfies Opt1(P)≤(1+ε)Optk(P) is called pseudorandom.

Intuition

 A set is pseudorandom, if there is no good clustering (it behaves like a random point set).

 In particular, if we take an arbitrary partition into k clusters, their weighted means will be close to the mean of P

(60)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Third Technique: Reducing the Variance

A construction for k>1

 We recursively partition P into subsets {Pi} by applying the following rule:

 Let Pi be the set with the highest cost Opt1(Pi)

If Pi is pseudorandom then replace Pi by µ(Pi)

else find an optimal clustering of Pi into k sets {Qj} and replace Pi by the {Qj}

[ Σ

j Opt1(Qj) decreases by a factor of 1+ε

]

 If

Σ

j Opt1(Qj) ≤ ε² Optk(P) then replace all {Qj} by {µj}

(61)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Third Technique: Reducing the Variance

A construction for k>1

 We recursively partition P into subsets {Pi} by applying the following rule:

 Let Pi be the set with the highest cost Opt1(Pi)

If Pi is pseudorandom then replace Pi by µ(Pi)

else find an optimal clustering of Pi into k sets {Qj} and replace Pi by the {Qj}

[ Σ

j Opt1(Qj) decreases by a factor of 1+ε

]

 If

Σ

j Opt1(Qj) ≤ ε² Optk(P) then replace all {Qj} by {µj}

61

(62)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Third Technique: Reducing the Variance

A construction for k>1

 We recursively partition P into subsets {Pi} by applying the following rule:

 Let Pi be the set with the highest cost Opt1(Pi)

If Pi is pseudorandom then replace Pi by µ(Pi)

else find an optimal clustering of Pi into k sets {Qj} and replace Pi by the {Qj}

[ Σ

j Opt1(Qj) decreases by a factor of 1+ε

]

 If

Σ

j Opt1(Qj) ≤ ε² Optk(P) then replace all {Qj} by {µj}

62

(63)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Third Technique: Reducing the Variance

A construction for k>1

 We recursively partition P into subsets {Pi} by applying the following rule:

 Let Pi be the set with the highest cost Opt1(Pi)

If Pi is pseudorandom then replace Pi by µ(Pi)

else find an optimal clustering of Pi into k sets {Qj} and replace Pi by the {Qj}

[ Σ

j Opt1(Qj) decreases by a factor of 1+ε

]

 If

Σ

j Opt1(Qj) ≤ ε² Optk(P) then replace all {Qj} by {µj}

63

(64)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Third Technique: Reducing the Variance

A construction for k>1

 We recursively partition P into subsets {Pi} by applying the following rule:

 Let Pi be the set with the highest cost Opt1(Pi)

If Pi is pseudorandom then replace Pi by µ(Pi)

else find an optimal clustering of Pi into k sets {Qj} and replace Pi by the {Qj}

[ Σ

j Opt1(Qj) decreases by a factor of 1+ε

]

 If

Σ

j Opt1(Qj) ≤ ε² Optk(P) then replace all {Qj} by {µj}

64

(65)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Third Technique: Reducing the Variance

A construction for k>1

 We recursively partition P into subsets {Pi} by applying the following rule:

 Let Pi be the set with the highest cost Opt1(Pi)

If Pi is pseudorandom then replace Pi by µ(Pi)

else find an optimal clustering of Pi into k sets {Qj} and replace Pi by the {Qj}

[ Σ

j Opt1(Qj) decreases by a factor of 1+ε

]

 If

Σ

j Opt1(Qj) ≤ ε² Optk(P) then replace all {Qj} by {µj}

65

(66)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Third Technique: Reducing the Variance

Theorem [Feldman, Schmidt, S., 2013]

 The constructed coreset has size k .

66

O(1/ε log 1/ε)

(67)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Fourth Technique: Linear Projections

Theorem [Feldman, Schmidt, S., 2013]

 There is a coreset with offset of size (k/ε)O(1) for the k-means problem.

Main idea

 Replace the point set by its projection on an optimal fit O(k/ε²)-subspace and compute a coreset for this subspace (actually, a slightly larger space)

(68)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Fourth Technique: Linear Projections

Theorem [Feldman, Schmidt, S., 2013]

 There is a coreset with offset of size (k/ε)O(1) for the k-means problem.

Main idea

 Replace the point set by its projection on an optimal fit O(k/ε²)-subspace and compute a coreset for this subspace (actually, a slightly larger space)

(69)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Fourth Technique: Linear Projections

Theorem [Feldman, Schmidt, S., 2013]

 There is a coreset with offset of size (k/ε)O(1) for the k-means problem.

Main idea

 Replace the point set by its projection on an optimal fit O(k/ε²)-subspace and compute a coreset for this subspace (actually, a slightly larger space)

(70)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Fourth Technique: Linear Projections

Theorem [Feldman, Schmidt, S., 2013]

 There is a coreset with offset of size (k/ε)O(1) for the k-means problem.

Main idea

 Replace the point set by its projection on an optimal fit O(k/ε²)-subspace and compute a coreset for this subspace (actually, a slightly larger space)

(71)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Fourth Technique: Linear Projections

Theorem [Feldman, Schmidt, S., 2013]

 There is a coreset with offset of size (k/ε)O(1) for the k-means problem.

Main idea

 Replace the point set by its projection on an optimal fit O(k/ε²)-subspace and compute a coreset for this subspace (actually, a slightly larger space)

Let P* be the projection of P on the best fit subspace. Let C be a set of centers. Then cost(P, span(C)) cost(P*,span(C)) + ∆.

(72)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Fourth Technique: Linear Projections

Theorem [Feldman, Schmidt, S., 2013]

 There is a coreset with offset of size (k/ε)O(1) for the k-means problem.

Main idea

 Replace the point set by its projection on an optimal fit O(k/ε²)-subspace and compute a coreset for this subspace (actually, a slightly larger space)

Let P* be the projection of P on the best fit subspace. For any set of centers C, the projections of P and P* on the linear span of C have „distance“ at most ε cost(P,span(C)).

Let P* be the projection of P on the best fit subspace. Let C be a set of centers. Then cost(P, span(C)) cost(P*,span(C)) + ∆.

(73)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Extensions

Other types of centers

 Most techniques extend to different types of centers like subspace or any shapes that are contained in a low dimensional subspace

Other distance measures

 l²- distance can be replaced by lq – distance for constant p and q

73

2 p

(74)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Coresets and Approximation Algorithms

Theorem

 The k-means problem can be approximated within a factor of (1+ε) in time O(T(n,k,ε) + S(n,k,ε)O(dk²)), where

 T(n,k,ε) is the time to compute a (k,ε)-coreset on an input of size n and

 S(n,k,ε) is the size of the coreset.

74

(75)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

75

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(76)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

76

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(77)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

77

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(78)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

78

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(79)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

79

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(80)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

80

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(81)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

81

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(82)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

82

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(83)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

83

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(84)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

84

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(85)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

85

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(86)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

86

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(87)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

87

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(88)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

88

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(89)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

89

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(90)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

90

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

(91)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

91

Coresets and Streaming Algorithms

[Bentley, Saxe, 80; Agarwal, Har-Peled, Varadarajan, 2004]

Assumption

The union of two coresets for A and B is a coreset of AB

It is possible to compute a coreset of a coreset

Compute the clustering on the union of the stored coresets

(92)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Coresets and Distributed Algorithms / Cloud Computing

Scenario:Distributed Input Data

 Data is collected locally

 Coreset will be computed locally

92

(93)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Coresets and Distributed Algorithms / Cloud Computing

Scenario:Distributed Input Data

 Data is collected locally

 Coreset will be computed locally

 Only the coresets are sent to a central computer

93

(94)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Coresets and Distributed Algorithms / Cloud Computing

Scenario:Distributed Input Data

 Data is collected locally

 Coreset will be computed locally

 Only the coresets are sent to a central computer

 Compute the clustering on this single machine

94

(95)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Coresets and Distributed Algorithms / Cloud Computing

Scenario:Distributed Input Data

 Data is collected locally

 Coreset will be computed locally

 Only the coresets are sent to a central computer

 Compute the clustering on this single machine

Improvements

 Implement merge&reduce like structure to avoid congestion

 Distributed computation of clustering

95

(96)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

StreamKM++

[Ackermann, Märtens, Raupach, Swierkot, Lammersen, S., 2011]

 Merge&Reduce based algorithm

 Coresets based on k-means++ seeding

 Coreset tree: Allows fast approximate k-means++ sampling

BICO [Fichtenberger, Gillé, Schmidt, Schwiegelshohn, S., 2013]

 Based on BIRCH‘s architecture

 Modifications, so that the algorithm computes a coreset

 A lot of heuristics to speed up performance

96

Practicality

(97)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Comparison of streaming clustering algorithms

 Census data set from UCI library

 2.4 mio data points

 68 dimensions

97

Practicality

(98)

Prof. Dr. Christian Sohler

Komplexitätstheorie und effiziente Algorithmen

Summary

Coresets

 Bounded movement

 Sampling / Importance sampling

 Variance reduction

 Linear projections

 Easy-to-implement streaming & distributed algorithms

 High practicality

98

Références

Documents relatifs

Nevertheless, there are only few publicly available data generators and none for generating data containing subspace clusters, even though some are used in di- verse subspace

In this section, a non-iterative sampling approach based on the projection of the test function g z,d on the noise subspace of the operator Π is presented.. This approach finds

For sufficiently sparse data, we expect that the ideal normal vectors, defined as the vectors normal to the hy- perplanes generated by the dictionary vectors, will correspond to

In the following sections, we survey the subspace leaning methods applied to background modeling: Principal Components Analysis, Independent Component Analysis ,

An approach for rail monitoring, where eddy current measurements are processed by a subspace approach bor- rowed from the multiple signal classification (MUSIC) method, is

In this section, we develop a test function plug-in method based on the minimization of the variance estimation in the case of the family of indicator functions for Ψ H.. Actually,

 In the class of planar graphs every graph property that is closed under vertex removal is (non-uniformly) testable.. Proof