• Aucun résultat trouvé

Unsupervised Learning

N/A
N/A
Protected

Academic year: 2022

Partager "Unsupervised Learning"

Copied!
70
0
0

Texte intégral

(1)

Unsupervised Learning

Pierre Gaillard – ENS Paris

February 27, 2020

(2)

Supervised vs unsupervised learning

Two main categoriesof machine learning algorithms:

- Supervised learning:predict outputYfrom some input dataX. The training data has a known labelY.

Examples:

Xis a picture, andYis a cat or a dog – Xis a picture, andY∈ {0, . . . ,9}is a digit – Xis are videos captured by a robot playing table

tennis, andYare the parameters of the robots to

return the ball correctly ?

?

?

- Unsupervised learning:training data is not labeled and does not have a known result Examples:

– detect change points in a non-stationary time-series – detect outliers

– clustering:group data in homogeneous groups – principal component analysis:compress data without

loosing much information

? ? ?

?

?

?

? ?

?

?

(3)

Reminder of last class on

supervised learning: OLS and

logistic regression

(4)

Least square Linear regression

Given training data(Xi,Yi)fori=1, . . . ,n, withXiRdandYiRlearn a predictorf such that our expected square loss

E[(f(X)−Y)2] is small.

We assume here thatfis alinear combination of the inputx= (x1, . . . ,xd)

fθ(x) =

d i=1

θixi=θx

f(X)

X Y

(5)

Ordinary Least Square

InputX∈Rd, outputY∈R, andis the square loss:ℓ(a,y) = (a−y)2.

The Ordinary Least Square regression (OLS) minimizes theem- pirical risk

bRn(w) = 1 n

n i=1

(Yi−wXi)2

This is minimized inw RdwhenXXw−XY=0,where X=[X1, . . . ,Xn]

Rn×dandY=[Y1, . . . ,Yn]

Rn.

f(X)

X Y

AssumingXis injective (i.e.,XXis invertible) and there is an exact solution b

w=(XX)−1XY.

What happens ifd≫n?

(6)

Ordinary Least Square: how to compute θ b

n

?

If the design matrixXXis invertible, the OLS has the closed form:

θbnarg min

θ

bRn(θ) =( XX)−1

XY.

Question:how to compute it?

-inversion of(XX)can be prohibitive (the cost isO(d3)!)

-QR-decomposition: we writeX=QR, withQan orthogonal matrix andRan upper-triangular matrix. One needs to solve the linear system:

Rbθ=QY, with R=







x . .

. . . .

. x . . . . x

0

x







-iterative approximation with convex optimization algorithms[Bottou, Curtis, and

(7)

Classification

Given training data(Xi,Yi)fori=1, . . . ,n, withXiRdandYi∈ {0,1}learn a classifierf(x)such that

f(Xi)



⩾0 Yi= +1

<0 Yi=0

Linearly separable

X = 0

X>0

X<0 θ

θ

θ

NonLinearly separable

f(X) = 0 f(X)>0

f(X)<0

(8)

Linear classification

We would like to find the best linear classifier such that

fθ(X) =θX



⩾0 Y= +1

<0 Y=0

Empirical risk minimizationwith the binary loss?

bθn=arg min

θ∈Rd

1 n

n i=1

1Yi̸=1θ⊤Xi⩾0. This isnot convexinθ. Very hard to compute!

X = 0

X>0

X<0 θ

θ

θ

5 0 5 0

0.5 1

binary

wX

error

(9)

Logistic regression

Idea:replace the loss with a convex loss ℓ(θX,y) =ylog(

1+eθX)

+ (1−y) log(

1+eθX)

−3 0 1

0 0.5

1 binary

logistic

Hinge

b y

error

X = 0

X>0

X<0 θ

θ

θ

Probabilistic interpretation:based on likelihood maximization of the model:

P(Y=1|X) = 1

1+eθX [0,1]

Satisfied for many distributions ofX|Y: Bernoulli, Gaussian, Exponential, …

(10)

Clustering

(11)

Outline

Reminder of last class on supervised learning: OLS and logistic regression

Clustering

What is clustering?

Clustering algorithms

Dimensionality Reduction Algorithms

(12)

What is clustering?

Goal:group together similar instances Requires data but no labels

Useful when you don’t know what you are looking for

●●

●●

●●

●●

●●

● ●

●●●●

●●

●●● ●

●●

●●

● ●

● ●

x x

x

1 2 3 4 5

1 2 3 4 5

x

y

Types of clustering algorithms:

- model based clustering:mixture of Gaussian

-hierarchical clustering: a hierarchy of nested clusters is build using divisive or agglomerative approach

- Flat clustering: no hierarchy:k-means, spectral clustering

(13)

What is clustering?

Goal:group together similar instances Requires data but no labels

Useful when you don’t know what you are looking for

●●

●●

●●

●●

●●

● ●

●●●●

●●

●●● ●

●●

●●

● ●

● ●

x x

x

1 2 3 4 5

1 2 3 4 5

x

y

Types of clustering algorithms:

- model based clustering:mixture of Gaussian

-hierarchical clustering: a hierarchy of nested clusters is build using divisive or agglomerative approach

- Flat clustering: no hierarchy:k-means, spectral clustering

(14)

What is clustering?

Goal:group together similar instances Requires data but no labels

Useful when you don’t know what you are looking for

●●

●●

●●

●●

●●

●●

● ●

●●●●

●●

●●● ●

●●

●●

● ●

● ●

x x

x

1 2 3 4 5

1 2 3 4 5

x

y

Types of clustering algorithms:

- model based clustering:mixture of Gaussian

-hierarchical clustering: a hierarchy of nested clusters is build using divisive or agglomerative approach

- Flat clustering: no hierarchy:k-means, spectral clustering

(15)

Examples of applications

Clustering has a variety of goals which all relates to grouping or segmenting a collection of objects into homogeneous objects.

-Biology: group species and generate phylogenies

-Marketing: group similar shopping items and similar customersmarket outreach, recommender systems

-Computer science:image segmentation

(16)

What do we need for clustering?

Clustering does not need any labeled data. But:

1. A proximity measure: it can be

– similarity measurebetween itemss(Xi,Xj)which is large ifXiandXjare similar. Example: correlation

– dissimilarity measured(Xi,Xj)which is large ifXiandXjare different.

Example: distance from a metricd(Xi,Xj) =∥Xi−Xj 2. A criterion to evaluate the clustering:

vs

3. A clustering algorithmoptimizing the criterion

(17)

Proximity measure

Proximity matrix: the data may be represented directly in terms of the proximity between pairs of objects.

This is the case for categorical variables.

Example:experiment where participants are asked to judge how much objects differ from one another.

The proximity matrix is assumed to be symmetric:D→(D+DT)/2

Distance measure: the similarity can be measured by ametric - Euclidean distance:d(Xi,Xj) =∥Xi−Xj2

- Correlation

- Manhattan distance:d(Xi,Xj) =∥Xi−Xj1

- Minkowvski distance:d(Xi,Xj) =∥Xi−Xjp

The results crucially depends on the metric choice: depends on data.

(18)

Clustering performance evaluation

This is often a hard problem. Much harder than for supervised learning!

A trade-off

Intra-cluster homogeneity: objects within the same cluster should be similar from one another

Inter-clusters heterogeneity: clusters should be well-separeted, their centers should be far from one another

In many application, we ask experts

(19)

Examples of performance criterion

Known ground truth:mutual information: ifU,V:{1, . . . ,n} → P({1, . . . ,n})are two clustering ofnobjects, the mutual information measures their agreement as:

MI(U,V) =

n i=1

n j=1

P(i,j) log ( P(i,j)

P(i)P(j) )

whereP(i) =|Ui|/n(resp.P(j) =|Vj|/nandP(i,j) =|Vj| ∩ |Ui|/n) is the probability that an object picked at random falls into classUi(resp.Vjand into both classes).

Drawback: the ground truth is rarely known Unknown ground truth:Silhouette scoredefined as

Score= b−a max{a,b}

where

- a: the mean distance between a sample and all other points in the same class.

- b: the mean distance between a sample and all other points in the next nearest cluster.

The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

(20)

How many clusters?

vs

Possibles approaches:

- Fix the number of clustersk

- Choose the number of clusters by optimizing the performance criterion such as the silhouette score

(21)

Outline

Reminder of last class on supervised learning: OLS and logistic regression

Clustering

What is clustering?

Clustering algorithms

Dimensionality Reduction Algorithms

(22)

Clustering algorithms

Types of clustering algorithms:

-Kmeans: flat clustering -Spectral clustering -Combinatorial clustering

-Gaussian mixture: model based clustering

-Hierarchical clustering: a hierarchy of nested clusters is build using divisive or agglomerative approach

-Affinity propagation: based on message passing in a graph

(23)

K-means

●●●●

●●

● ●●●●●

●●●●

●●

●●

● ●

●●

●●

●●

● ●●●

●●

●●

● ●

● ●

x x

x

1 2 3 4 5

1 2 3 4 5

x

y

-Initialization: sampleKpoints as cluster centers

-Alternate:

1. Assign points to closest center 2. Update cluster to the aver-

aged of its assigned points -Stopwhen no point’s assignment

change.

(24)

K-means

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●●

●●

●●

● ●

● ●

x x

x

1 2 3 4 5

1 2 3 4 5

x

y

-Initialization: sampleKpoints as cluster centers

-Alternate:

1. Assign points to closest center 2. Update cluster to the aver-

aged of its assigned points -Stopwhen no point’s assignment

change.

(25)

K-means

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●●

●●

●●

● ●

● ●

x x

x

1 2 3 4 5

1 2 3 4 5

x

y

-Initialization: sampleKpoints as cluster centers

-Alternate:

1. Assign points to closest center 2. Update cluster to the aver-

aged of its assigned points -Stopwhen no point’s assignment

change.

(26)

K-means

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●●

●●

●●

● ●

● ●

x x

x

1 2 3 4 5

1 2 3 4 5

x

y

-Initialization: sampleKpoints as cluster centers

-Alternate:

1. Assign points to closest center 2. Update cluster to the aver-

aged of its assigned points -Stopwhen no point’s assignment

change.

(27)

K-means

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●●

●●

●●

● ●

● ●

x

x

x

1 2 3 4 5

1 2 3 4 5

x

y

-Initialization: sampleKpoints as cluster centers

-Alternate:

1. Assign points to closest center 2. Update cluster to the aver-

aged of its assigned points -Stopwhen no point’s assignment

change.

Références

Documents relatifs

q in bottom-up manner, by creating a tree from the observations (agglomerative techniques), or top-down, by creating a tree from its root (divisives techniques).. q Hierarchical

The DM determines whether QGen should ask a follow-up question or the Guesser should guess the target object, based on the image and dialogue history.. As shown in Figure 1, the

In our experiments, this version worked better in the Unsupervised and Transfer Learning Challenge, characterized by a non-discriminant linear classifier and very few labeled

Update cluster to the aver- aged of its assigned points - Stop when no point’s

• Step 0 : First initiate the algorithm by choosing the first center uniformly at random among the data points.. • Step 1: For each data point x i of your data set, compute the

Our algorithm, given a video stream from a single camera and the rough 2D position estimation of a person’s head, incrementally learns to automatically ex- tract the VFOA of the

• VISUALIZE (generally in 2D) the distribution of data with a topology-preserving “projection” (2 points close in input space should be projected on close cells on the SOM).

n  Collaborative Topological Learning uses the principle of the Collaborative Fuzzy c-means (Pedrycz &amp; Rai, 2002)... Strehl &amp; Ghosh,