• Aucun résultat trouvé

slides

N/A
N/A
Protected

Academic year: 2022

Partager "slides"

Copied!
14
0
0

Texte intégral

(1)

DIFFRAC : a discriminative and flexible framework for clustering

Francis Bach Za¨ıd Harchaoui

INRIA - Ecole Normale Sup´erieure Telecom Paris

NIPS - November 2007

(2)

Summary

• Discriminative clustering = find labels that optimize linear separability

• Square loss for classification = cost function in closed form

• Optimization of the labels by convex relaxation

• Efficient optimization algorithm by partial dualization

• Application in semi-supervised learning

(3)

Classification with square loss

• n points x1, . . . , xn in Rd, represented in a matrix X ∈ Rn×d.

• Labels = partitions of {1, . . . , n} into k > 1 clusters, represented by indicator matrices

y ∈ {0, 1}n×k such that y1k = 1n

• Regularized linear regression problem of y given X : J(y, X, κ) = min

wRd×k, bR1×k 1

nky − Xw − 1nbk2F + κ tr ww, – Multi-label classification problems with square loss functions – Solution in closed form (with Πn = Inn11n1n ):

w = (XΠnX + nκIn)1XΠny and b = 1

n1n (y − Xw)

(4)

Discriminative clustering cost

• Discriminative clustering consists in finding labels such that they lead to best linear separation by a discriminative classifier (Xu et al., 2004, 2005)

• Use square loss for multi-class classification

• Main advantages

– minimizing the regularized cost in closed form

– including a bias term by simply centering the data

• Optimal value equal to J(y, X, κ) = tr yyA(X, κ), where

A(X, κ) = 1

n(In−X(XΠnX + nκI)1Xn

(5)

Diffrac

• Optimization problem: minimize tr yyA(X, κ) with respect to y (indicator matrices)

• The cost function only involves the matrix M = yy ∈ Rn×n (=

k-class equivalence matrix)

• Convex outer approximation for M

– M is positive semidefinite (denoted as M < 0)

– the diagonal of M is equal to 1n (denoted as diag(M) = 1n) – if M corresponds to at most k clusters, we have M < 1

k1n1n

• Convex set:

Ck = {M ∈ Rn×n, M = M, diag(M) = 1n, M > 0, M < 1

k1n1n }

(6)

Minimum cluster sizes

• Avoid trivial solution by imposing a minimum size λ0 for each cluster, through:

– Row sums: M1n > λ01n and M1n 6 (n − (k − 1)λ0)1n (same constraint as Xu et al., 2005).

– Eigenvalues: The sizes of the clusters are exactly the k largest eigenvalues of M ⇒ constraint equivalent to Pn

i=1 1λi(M)>λ0 > k, where λ1(M), . . . , λn(M) are the n eigenvalues of M.

∗ Non convex constraint

∗ Relaxed as Pn

i=1 φλ0i(M)) > k, where φλ0(κ) = min{κ/λ0, 1}

• Final convex relaxation: minimize tr A(X, κ)M such that M = M, diag(M) = 1n, M > 0, M < 1

k1n1n , Pn

i=1 φλ0i(M)) > k

(7)

Comparison with K-means

• DIFFRAC (κ = 0): minimize

tr Πn(In − X(XΠnX)1Xnyy

• K-Means: minimize (Zha et al., 2002, Bach & Jordan, 2004)

minµRk×d kX − yµk2F = tr(In − y(yy)1y)(ΠnX)(ΠnX)

0 10 20 30

−0.2 0 0.2 0.4 0.6 0.8 1

noise dimension

clustering error K−means

diffrac

(8)

Kernels

• The matrix A(X, κ) can be expressed only in terms of the Gram matrix K = XX.

A(K, κ) = κΠn(Ke + nκIn)1Πn

where Ke = Πnn is the “centered Gram matrix” of the points X.

• Additional relaxation to kernel PCA:

1. relaxing the constraints M < 1

k1n1n into M < 0 2. relaxing diag(M) = 1n into tr M = n

3. removing the constraint M > 0 and the constraints on the row sums.

• Important constraint: diag(M) = 1

(9)

Optimization by partial dualization - I

• Optimization problem:

min tr AM such that M = M, M < 0, tr M = n Φλ0(M) > k

diag(M) = 1n β1

M1n 6 (n − (k − 1)λ0)1n, M1n > λ01n β2, β3

M > 0 β4

M < 1n1

n

k β5, β6

• Partial dualization of constraints

– Kept constraints lead to simple spectral problem

(10)

Optimization by partial dualization - II

• Lagrangian equal to tr B(β)M − b(β) with

B(β) = A + Diag(β1) − 122 − β3)1121(β2 − β3) − β4 + 12β5ββ5

6

b(β) = β11 − (n − (k − 1)λ021 + λ0β31 + kβ6/2 + β51,

• Primal variable M, dual variables β1, β2, β3, β4, (β5, β6)

• Dual problem: max

β

(

M<0,trMmin=n,Φλ

0(M)>k tr B(β)M − b(β) )

• Minimization with respect to M leads to convex non differentiable spectral function in β

• Maximization with respect to β by projected subgradient or projected gradient (after smoothing)

(11)

Computational complexity - Rounding

• Constant times the matrix-vector operation with the matrix A

• Linear complexity in the number n of data points.

• For linear kernels with dimension d: O(d2n)

• For general kernels: O(n3) or O(m2n) using an incomplete Cholesky decomposition of rank m

• Rounding

– After the convex optimization, we obtain a low-rank matrix M ∈ Ck which is pointwise nonnegative with unit diagonal

– Spectral clustering algorithm on the matrix M (Ng & al., 2001) – NB : Diffrac works better than doing spectral clustering on A or

K!

(12)

Semi-supervised learning

• Equivalence matrices M allows simple inclusion of prior knowledge (Xu et al., 2004, De Bie and Cristianini, 2006)

• “must-link” constraints (positive constraints) : Mij = 1

– With a square loss ⇒ equivalent to grouping into chuncks

• “must-not-link” constraints (negative constraints) : Mij = 0

0 20 40

0 0.5 1

noise dimension

clustering error

20 % x n

0 20 40

0 0.5 1

40 % x n

noise dimension

K−means diffrac

(13)

Simulations

• Clustering classification datasets

– Performance measured by clustering error between 0 and 100(k−1) – Comparison with K-means and RCA (Bar-Hillel et al., 2003)

Dataset K-means Diffrac RCA

Mnist-linear 0% 5.6 ± 0.1 6.0 ± 0.4

Mnist-linear 20% 4.5 ± 0.3 3.6 ± 0.3 3.0 ± 0.2 Mnist-linear 40% 2.9 ± 0.3 2.2 ± 0.2 1.8 ± 0.4 Mnist-RBF 0% 5.6 ± 0.2 4.9 ± 0.2

Mnist-RBF 20% 4.6 ± 0.0 1.8 ± 0.4 4.1 ± 0.2 Mnist-RBF 40% 4.9 ± 0.0 0.9 ± 0.1 2.9 ± 0.1 Isolet-linear 0% 12.1 ± 0.6 12.3 ± 0.3

Isolet-linear 20% 10.5 ± 0.2 7.8 ± 0.8 9.5 ± 0.4 Isolet-linear 40% 9.2 ± 0.5 3.7 ± 0.2 7.0 ± 0.4 Isolet-RBF 0% 11.4 ± 0.4 11.0 ± 0.3

Isolet-RBF 20% 10.6 ± 0.0 7.5 ± 0.5 7.8 ± 0.5 Isolet-RBF 40% 10.0 ± 0.0 3.7 ± 1.0 6.9 ± 0.6

(14)

Simulations

• Semi-supervised classification

• Diffrac works with any amount of supervision – Diffrac works with any amount of supervision – Comparison with LDS (Chapelle & Zien, 2004)

0 50 100 150 200

0 0.2 0.4 0.6 0.8

Number of labelled training points

Test error

Learning curve on Coil100

DIFFRAC LDS

0 50 100 150

0.25 0.3 0.35 0.4 0.45 0.5

Number of labelled training points

Test error

Learning curve on BCI

DIFFRAC LDS

0 50 100 150 200

0.1 0.15 0.2 0.25 0.3 0.35

Number of labelled training points

Test error

Learning curve on Text

DIFFRAC LDS

Références

Documents relatifs

6 Laboratoire de physique th´eorique, CNRS, UPMC and ´ Ecole normale sup´erieure, Paris, France (Received 12 October 2013; revised manuscript received 11 February 2014; published

CENTRE DE MORPHOLOGIE MATHEMATIQUE Universit´e catholique de Louvain... Ecole nationale sup´erieure des mines

Vlasov-Poisson equation, strong magnetic field regime, finite Larmor radius scaling, electric drift, polarization drift, oscillations in time.. 1 Ecole Normale Sup´ ´ erieure,

• April - May 2016: No´ emie Fong (L3 student in the department of Computer Sciences of ´ Ecole normale sup´ erieure).. Bibliographic works on static analysis of

Ecole Nationale Sup´ ´ erieure de Techniques Avanc´ ees.. Programmation

J’en adresse copie aux chefs des ´ etablissements concern´ es au premier chef, l’ ´ Ecole normale sup´ erieure et l’Universit´ e Paris–Diderot Paris 7, qui auront ` a

1 Laboratoire Physique Statistique, CNRS, Ecole Normale Sup´ erieure, PSL Research University, Universit´ e Pierre et Marie Curie, Paris, France.. 2 Institut de Biologie de l’ENS,

2 CNRS, DMA, Ecole Normale Sup´ erieure, 45 rue d’Ulm, 75230 Paris Cedex 05, France The spectral properties of su(2) Hamiltonians are studied for energies near the critical