Unsupervised Learning

(1)

Unsupervised Learning

No goal class (eitherY norG).

We are interested in relations in the data:

Clustering Are the data organized in natural clusters? (Clustering, Segmentation)

EM algorithm for clustering (Dirichlet Process Mixture Models) (Spectral Clustering)

Association Rules Are there some frequent combinations, implication relations? (Market Basket Analysis)later

Other The Elements of Statistical Learning Chapter 14 SOM Self Organizing Maps

PCA Principal Component Analysis Linear Algebra;klinear combinations of features minimizing reconstruction error (=

firstk principal components).

Principal Curves and Surfaces, Kernel and Spare Principal Components

ICA Independent Component Analysis.

(2)

Clustering Example

●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●

●●

●

●●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●

●

●●●●● ●●●●● ●●●●●●●●●

●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●

●●

●

●●●● ●●●●●●●●●●●●●● ●●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●

●

●● ●●●●●●●●●●● ●

● ●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●● ●

●●

●●●

●

●●

●●●●●●●●●●●●

●

●●

●●●●●●●

●

●●●●●●

●●

●●●●●●●●●●●●

−3 −1 1 3

−1012

Scaled Data

pitch

yawn

●

●●●

●●●●●●●●●●●●●●●●●●●●

●●

●

●●

●●●●●●●●●

●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●● ●●●●● ●●●●●●●●●

●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●● ● ●●●●●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●

●●

●

●● ●

●●●●●●●●●●●●●●● ●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●

●

●● ●●●●●●●●●●● ●

● ●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●

●●

●●●●●●●●●●●●●●

●

●●

●●●●●

●

●●●●●●

●●

●●●●●●●●●●●●

−3 −1 1 3

−1012

2 Clusters

pitch

yawn

●

●●

●

●●●●●●●●●

●●

●●●●●●●●●●●●●●

●

●●● ●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●

●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●

●

●●

●

●●●●●●●

●

●●

●

● ●●● ●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●● ●●● ●●●

● ●

●●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●●●

●●

● ●

●●

●●● ● ●● ●● ●●●●● ●●●●●●●

●●

●●●

●●

●●●●

●

●●●●●●●●● ●●●●●●●●●

●

●●●●● ●● ● ●● ●●●●●●●●●●●●

● ●

●

●●

● ●

●●

●

●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●

● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●

●●

●

●●●●●●● ●● ●●●●●●● ●● ●●● ●●●● ●●●●●●

●

●●

●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●

●

● ●

●

●●

● ●

●

●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●

●

● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●●● ●●

●

●●

● ●

●

●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●

●

● ●●●●

●

●●

●

●●

●●● ●●●●●●●● ●●●●●●●●●●●

●●●●●●●●●● ●●●●●●●

●

●●

●

●●● ●●●● ●●●● ●● ●●●●●●●●●●●●●●●

●●

● ●

●●

●●●●● ●●●●

●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●

●●

●

●●

●

●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●●●

●

●●

● ●

●

●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●● ● ●● ●●●● ●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●● ●●●●●●●●●●●●●●

●●

● ●

●●

●

●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●

● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●

●

●●●● ●●●●●●●

●

●●

● ●

●●

●●● ●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●

●●●

●

●●

●

●●

●

●●

●●●

●

●●●●

●● ● ●●●●●●●●●●●●●● ●●● ●●●●● ● ●●●●●●●● ●●●●●●●●●●

●

●●●●● ●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−3 −2 −1 0 1 2 3

−1012

Pitch, Yawn, Roll Clustering

roll, pitch

yawn

We set the color of items, no colour in train data.

We want to assign same color to nearby points.

(3)

K – means !

K–means

1: procedureK–means:(X data,K the number of clusters )

2: select randomlyK centers of clusters µk

3: # either random data points or random points in the feature space

4: repeat

5: foreach data recorddo

6: C(xi)←argmink∈{1,...,K}d(xi, µk)

7: end for

8: foreach clusterk do# find new centersµ_k

9: µ_k =P

x_i:C(x_i)=k x_i

|C(k)|.

10: end for

11: until no chance in assignment

12: end procedure

(4)

K – means

K–means

Thet iterations of K–means algorithm takeO(tkpN) time.

To find global optimum is NP-hard.

The result depends on initial values.

May get stuck in local minimum.

May not be robust to data sampling (for example, different bootstrap samples may give very different clustering results).

Each record must belong to some cluster. Sensitive to outliers.

Later: Soft k–meanswith the membership to a cluster probabilistic, not 0–1.

An example of use of EM algorithm to estimate unobserved cluster variable in the model of mixture ofk gaussians.

(5)

Distance measures

the most common distance measures:

Euclidian d(x_i,x_j) =q Pp

r=1(x_ir)−x_jr))² Hamming (Manhattan) d(x_i,x_j) =Pp

r=1|xir−x_jr| overlap (překrytí)

categorical variables d(xi,xj) =Pp

r=1(1−δ(xir,xjr))) cosine similarity s(x_i,x_j) =

Pp r=1(x_ir·x_jr)

pPp

r=1(xjr·xjr)·Pp r=1(xir·xir)

cosine distance d(xi,xj) = 1−

Pp r=1(xir·xjr)

pPp

r=1(xjr·xjr)·Pp r=1(xir·xir)

(6)

Distance – key issue, application dependend

The result depends on the choice of distance measured(xi, µk).

The choice is application dependend.

Scaling of the data is recommended.

Weights for equally important attributes are: w_j = _ˆ¹

d j where dˆ_j = 1

N²

N

X

i1=1 N

X

i2=1

d_j(x_i₁,x_i₂) = 1 N²

N

X

i1=1 N

X

i2=1

(x_i₁−x_i₂)²

Total distance as a weighted sum of attribute distances.

Distance may be specified directly by a symmetric matrix, 0 at the diagonal, should fulfill triangle inequality

d(xi,x`)≤d(xi,xr) +d(xr,x`).

(7)

Alternative Ideas

Scaling may remove natural clusters Weighting Attributes

Consider internet shop offering socks and computers.

Compare: number of sales, standardized data, $

Socks Computers

0246810

Socks Computers

0.00.20.40.60.81.01.2

Socks Computers

050010001500

(8)

Number of Clusters

We may focus on theWithin cluster variationmeasure:

W(C) = 1 2

K

X

k=1

X

C(i)=k

X

C(i^|)=k

d(x_i,x_i|)

Notice thatW(C) is decreasing also for uniformly distributed data.

We look for small drop ofW(C) as a function ofK or maximal difference betweenW(C) on our data and on the uniform data.

Totalcluster variation is the sum ofbetweencluster variation and within cluster variation

T(C) = 1 2

N

X

i,i^|=1

d(xi,x_i|) =W(C) +B(C)

= 1

2

K

X

k=1

X

C(i)=k

( X

C(i^|)=k

d(x_i,x_i|)) +1 2

K

X

k=1

X

C(i)=k

( X

C(i^|)6=k

d(x_i,x_i|))

(9)

GAP function for Number of Clusters

denoteWk the expectedW for uniformly distributed data and k clusters, the average over 20 runs

GAP is expectedlog(Wk) minus observedlog(W(k)) K^∗ = argmin{k|G(k)≥G(k+ 1)−s_k+1^| }

s_k^| = sk

r 1 + 1

20 where sk is the standard deviation oflog(Wk)

(10)

Silhouette

For each data samplex_i we define a(i) = _|C¹

i|−1

P

j∈Ci,i6=jd(i,j) if|Ci|>1 b(i) =min_k6=i_|C¹

k|

P

j∈Ckd(i,j) Definition (Silhouette)

Silhouettes is defined

s(i) = max{a(i),b(i)}^b(i)−a(i) if|C_i|>1 s(i) = 0 for|C_i|= 1.

Optimal number of clusters k may be selected by the SC.

Definition (Silhouette Score)

The Silhouette score is

1 N

PN i s(i).

Silhouette is always between

−1≤s(i)≤1.

0.10.0 0.2 0.4 0.6 0.8 1.0

The silhouette coefficient values

Cluster label

0 1

The silhouette plot for the various clusters.2

0.2 0.0 0.2 0.4 0.6 0.8

Feature space for the 1st feature 0.2

0.0 0.2 0.4 0.6 0.8

Feature space for the 2nd feature

The visualization of the clustered data.

Silhouette analysis for KMeans clustering on sample data with n_clusters = 3

Note: One cluster (−1,1),(1,1), other cluster (0,−1.2),(0,−1.1), the point (0,0) is assigned to the first cluster but has a negative silhouette. https://stackoverflow.com/a/66751204

(11)

Country Similarity Example

Data from a political science survey: values are average pairwise dissimilarities of countries from a questionnaire given to political science students.

(12)

K–medoids

1: procedureK–medoids:(X data,K the number of clusters )

2: select randomlyK data samples to be centroids of clusters

3: repeat

4: foreach data recorddo

5: assign to the closest cluster

6: end for

7: foreach clusterk do# find new centroids i_k^∗∈Ck 8: i_k^∗←argmin_{i:C_(i)=k_}P

C(i^|)=kd(xi,x_i|)

9: end for

10: until no chance in assignment

To find a centroid requires quadratic time compared to lineark–means.

We may use any distance, for example number of differences in binary attributes.

Complexity

Thet iterations ofK–medoids take O(tkpN²).

(13)

Clusters of Countries

Survey of country dissimilarities.

Left: dissimilarities

Reordered and blocked according to 3-medoid clustering.

Heat map is coded from most similar (dark red) to least similar (bright red).

Right: Two-dimensional multidimensional scaling plot with 3-medoid clusters indicated by different colors.

(14)

Multidimensional Scaling

The right figure on previous slide was done by Multidimesional scaling.

We know only distances of countries, not a metric space.

We try to keep proximity of countries (least squares scaling).

We choose the number of dimensionsp.

Definition (Multidimensional Scaling)

For a given datax₁, . . . ,x_N with their distance matrixd, we search (z1, . . . ,zN)∈R^pprojections of data minimizing stress function

S_D(z₁, . . . ,z_N) =



 X

i6=`

(d[x_i,x_`]− kz_i−z_`k)²





1 2

.

It is evaluated gradiently.

(15)

Hierarchical clustering – Bottom Up

Start with each data sample in its own cluster. Iteratively join two nearest clusters.

Measures for join

closest points (single linkage)

maximally distant points (complete linkage) average linkage,dGA(CA,CB) = _|C ¹

A|·|C_B|

P

x_i∈CA,x_j∈CBd(xi,xj)

Ward distanceminimizes the sum of squared differences within all clusters.

W(CA,CB) = X

i∈C_A∪C_B

d(xi, µ_A∪B)²−X

i∈C_A

d(xi, µA)²−X

i∈C_B

d(xi, µB)²

= |C_A| · |C_B|

|CA|+|CB|·d(µ_A, µ_B)²

whereµare the centers of clusters (A,B and joined cluster).

It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.

(16)

Dendrograms

Dendrogram is the result plot of a hierarchical clustering.

Cutting the tree of a fixed high splits samples at leaves into clusters.

The length of the two legs of the U-link represents the distance between the child clusters.

Average Linkage Complete Linkage Single Linkage

(17)

Interpretation of Dendrograms – 2 and 9 are NOT close

Samples fused at very bottom are close each other.

3 4 1 6 9 2 8 5 7

0.00.51.01.52.02.53.0

1 2 3

4

5

6

7 8 9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1.5−1.0−0.50.00.5

X1

X2

1 2 3

4

5

6

7 8 9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1.5−1.0−0.50.00.5

1 2 3

4

5

6

7 8 9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1.5−1.0−0.50.00.5

1 2 3

4

5

6

7 8 9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1.5−1.0−0.50.00.5

1 2 3

4

5

6

7 8 9

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1.5−1.0−0.50.00.5

X1

X2

X2 X2

X2

(18)

Mean Shift Clustering

1: procedureMean Shift Clustering:(X data,K(·) the kernel,λthe bandwidth )

2: C ← ∅

3: for each data recorddo

4: repeat# shift each meanx to the weighted average

5: m(x)←

PN

i=1K(xi−x)xi

PN

i=1K(xi−x) 6: untilno chance in assignment

7: add the newm(x) toC

8: end for

9: return prunnedC

Kernels:

flat kernelλball

Gaussian kernelK(xi−x) =e

kxi−xk2 λ2

(19)

Other Distance Measures

Correlation Proximity

5 10 15 20

05101520

Variable Index

Observation 1 Observation 2 Observation 3

1 2

3

Euclidian distance: Observations 1 and 3 are close.

Correlation distance: 1 and 2 look very similar.

ρ_X_,Y =corr(X,Y) =cov(X,Y) σXσY

= E[(X−µ_X)(Y −µ_Y)]

σXσY

(20)

Summary

K-means clustering - the basic one the number of clusters:

GAP Silhouette

The distance is crucial.

Consider standardization or weighting the features.

K-medoids - does need metric, just a distance hierarchical clustering

different distance measures dendrogram

other approaches (mean shift clustering, Self Organizing Maps).