• Aucun résultat trouvé

TP APPC : Visualisation et réduction de dimension Stéphane Canu 9 Novembre 2021, ASI, INSA Rouen

N/A
N/A
Protected

Academic year: 2022

Partager "TP APPC : Visualisation et réduction de dimension Stéphane Canu 9 Novembre 2021, ASI, INSA Rouen"

Copied!
4
0
0

Texte intégral

(1)

TP APPC : Visualisation et réduction de dimension Stéphane Canu

9 Novembre 2021, ASI, INSA Rouen

The purpose of this notebook is compare different data visualization algorithm to represent the MNIST data set in two dimensions order to determine

• what are the best ones to visualize data in 2 dimensions ?

• what are the scalable ones ?

The interest of using the MNIST data set it that the labels are known. We are going to apply data visualization methods without the labels and use it to asses the quality of the resulting representation

1

. Import the needed material. The function plot_repr is provided in the file plot_repr.py

-150 -100 -50 0 50 100 150

-150 -100 -50 0 50 100 150

0 0 0

0 0

0 0 000 0

0 0 0

0 0

0 0 0

0 00

0 0

0 0 0 0 0 000

0 0 0 0 0

0

0 0

0

0 0 0

0 0 0

0 0

00 0 00

0

0 0 0 0

0 0

0 0

0

0 00

0 0

0 0 0 0 00

0 00 0 0

0

0 0

0 0 00 00 0 0 0 00

0 0 0 0 00 1

1 1 1 1

1 1

11 1 1 111 1 1

1 1 1

1 1

1

1 1 1 1

1 1

1 1

1 11 1 1

1 1 1 1

1 1 1 11 1 1

1 1

1 1

1 1 1

1 1

1 1

1

11 11 1 11

11 1

1 1 1 1 1

1 1 11

1 1

11 1

1 1 1

1 1

1 1

111 1 1

1 1

1 1 1

1

2 2

2 2

2

2 2

2 2

2

22 2 2

2 2 2 2 2

2 2 2 2

2 2

2

2

2 2

2 2 2

2 2

2 2

2 2 2

2 2 2

2 2

2

2

2 2

2 2

22 2 2 2

22 2

2

2 2 2

2 2

2 2 2

2 2

2

2 2 22

2

2 2

2 2

2 2 22

2

2 2

2 2 2 2

2 2

2 2

2

2 2 22 2 3

3 3 3

3

3 3

3 3

3

3 33

3 33 3 3 3

33 3 3

33 3 3

3 3 3 3 3

3 3 3

3 3 3

3 33

33 3

3 3

3

3 3 3 3

33 3

3 3 33

3 3 3

33

3 3

3

3 3

3 3

3 3

3

3 3

3 3 3 3 3 3 3

3 3

3 3 3

33 33

3 3 3

3 3

3333

4 4 4 4 4 4

4 4 4 4

4

4 4

4 4 4

4 4 4 4

4 4

4

4 4

4 44

4 4

4 4 4

4

4 4

4

4 4

4 4

4 4 44

4

4 4

4 4

4 4 4

4 4

4 4 4

4 44

4 44 4

4 4 44

4 4

4 4

4 4

4 4 44

44 4 4 4

44 4

4 4

4

4 4 4

4 4 4

4 4

4 4

5 5

5 5

5 5

5

5 5

5 5

5 5 5

5 5 5

5 5 5 5 5

5

5

5 5

5 5 5 5

5 5

5 5

5

5 5 5 5

5 5 5

5 5 5

5 55

5 5

5 55 5

5 5 5 5

5

5 5

5 55

5

5 5

5

5 5

5 5

5 5

5

5 5

5 55

5 5 5

5 5

5 5

5 5

5 5 5

5 5 55

5 55

5 66

66 66 6

6 6

6 6 66 6

6 6

6 6

6 6

6 66 6

6 6

6 6 6

6 6 6

6 6 6

6

6 6 6 6 6

6

6 6

6

66 6

6

6 6 6

6 6

6 6

6 6

6 66

6 6 66

6

6

6 6

6 6

66 6 6

6

66 6

6 6 6

6

6 6 6

6 6 6

6 6 666 6 6 6

6 6 6

7 7

7

7 7

77

7

7 77

7 7 7

7 7 7

7 7

7 7

7 7 7 7

7 7 7

7 7

7 7 7

7

7 7 7 77

77 77

7

7 7

7

7 7 7

7 7

7

7 7

7 7 7

7

7 7 7 7

7 7

777 7

7

7 7

7

7 7 7

7 7

7 7

7

7 7 7

7 7

7 7 7

7

77 7

7

7 7

7 7

7 7

8

8 8

8 8 8

8 8 8 8

8

8 8 8 8

8

8 8

888 8

8 8 8

8 8

8 8

8 8 8 8

8 8

8 88 88

8 8

8 8 8 8 88

8

88 8 8 8

8 8

8 8 8 8

8 8 8

8

88 8 8 8

8 8 8 8

8 8

8 88 8

88 8

8 8

8

8 8

8

8

8 8

8 8

8 8 88

8 8

8

9 9 99

9 9 9 9

9

9

9 9

9

9 9

9 9

999 9 9

99

9 9

99 9 9

9 9 9 9 9

9

9

9 9 9

9 9 9

9 9

9 9

9 9 9

9 9 9

9 9

9 9 9

9 9 9

9 9 9 9

9 9 99

9 9 9

9 9

9 9

9 9 9 9

9 9

9 9

999 9 9 9

99 9

9

99 99

9 9

Figure 1 : Expected result

Ex. 1 — Dimensionality reduction and manifold learning for data visaulization

1. Data preparation.

a) Load some relevant API

import numpy as np

from numpy.random import permutation import time

import pandas as pd

import matplotlib.pyplot as plt

fw, fh = plt.rcParams["figure.figsize"]

from sklearn.datasets import fetch_openml from sklearn.decomposition import PCA

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.manifold import MDS, Isomap, LocallyLinearEmbedding, TSNE,

SpectralEmbedding

from sklearn.metrics import silhouette_score, calinski_harabaz_score from plot_repr import plot_repr

b) Load MNIST data from scikit learn. Pick n = 2000 exemple randomly to begin with.

When you will have identify the scalable data vizualization methods, you can go up to n = 60, 000.

1The code is adapted fromscikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html.

1

(2)

Xini, yini = fetch_openml('mnist_784', version=1, return_X_y=True) Xini = Xini / 255.

n = 2000

choice = permutation(Xini.shape[0])[:n]

X, y = Xini[choice,:], yini[choice]

c) Scale and normalize

Xc = np.float64(X - np.mean(X,0))

d) Choose the number of dimensions to project on

ndim = 2

After selecting the relevant ones, create a frame for the resulting metrics

2

.

res = pd.DataFrame(index=["Computation time", "Silhouette score", "Calinski-Harabaz score"])

Why these score are relevant ?

2.

3. PCA (principal component analysis)

a) Compute the PCA of the MNIST data in 3 different ways by using SVD, eig and sklearn

3

.

mt = Xc.mean(axis=0)

Xf = Xc[:,np.nonzero(mt)[0]]

mt = Xf.mean(axis=0) st = np.std(Xf,axis=0) Xn = (Xf - mt)/st

print(np.sum(Xn.mean(axis=0))) print(np.sum(np.std(Xn,axis=0)))

# PCA with SVD

u, s, vh = np.linalg.svd(Xf, full_matrices=False) print(u.shape, s.shape, vh.shape)

fig, ax = plot_repr(u[:,0:2]*s[0:2], y, title="PCA with SVD") plt.show()

# PCA with eig

lam, v = np.linalg.eig(Xf.T@Xf)

fig, ax = plot_repr(Xf@v[:,0:2], y, title="PCA with eig") plt.show()

# PCA with sklearn t = time.time()

Xr = PCA(n_components=ndim).fit_transform(Xc) # eigenvalues ???

t = time.time() - t

fig, ax = plot_repr(Xr, y, title="PCA with sklearn")

res.loc[:, ax.get_title()] = (t, silhouette_score(Xr, y), calinski_harabaz_score(Xr, y))

plt.show()

b) Linear Discriminant Analysis do use the label so it should be compered with other methods. In this comparizon framework, LDA play the role of a gold standart

4

sklearn.discriminant_analysis.LinearDiscriminantAnalysis

2http://scikit-learn.org/stable/modules/clustering.html

3http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

4http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.

LinearDiscriminantAnalysis.html

2

(3)

t = time.time()

Xr = LinearDiscriminantAnalysis(n_components=2).fit_transform(Xc, y) t = time.time() - t

fig, ax = plot_repr(Xr, y, title="LDA")

res.loc[:, ax.get_title()] = (t, silhouette_score(Xr, y), calinski_harabaz_score(Xr, plt.show()y))

4. try now with Multidimensionnal Scaling (MDS)

5

.

t = time.time()

Xr = MDS(n_components=ndim, metric=True).fit_transform(X) t = time.time() - t

fig, ax = plot_repr(Xr, y, title="MDS")

res.loc[:, ax.get_title()] = (t, silhouette_score(Xr, y), calinski_harabaz_score(Xr, y)) plt.show()

5. Isomap

6

t = time.time()

Xr = Isomap(n_components=ndim, n_neighbors=10).fit_transform(X) t = time.time() - t

fig, ax = plot_repr(Xr, y, title="Isomap")

res.loc[:, ax.get_title()] = (t, silhouette_score(Xr, y), calinski_harabaz_score(Xr, y)) plt.show()

6. Locally Linear Embedding

7

.

t = time.time()

Xr = LocallyLinearEmbedding(n_components=ndim, n_neighbors=10, reg=0.001).fit_transform(

t = time.time() - tX)

fig, ax = plot_repr(Xr, y, title="LLE")

res.loc[:, ax.get_title()] = (t, silhouette_score(Xr, y), calinski_harabaz_score(Xr, y)) plt.show()

7. t-Distributed Stochastic Neighbor Embedding (t-SNE)

8

https://github.com/oreillymedia/t-SNE-tutorial

t = time.time()

#Xr = TSNE(n_components=ndim, n_iter=1000, init='pca',verbose=0).fit_transform(X) Xr = TSNE(n_components=ndim, n_iter=500, verbose=0).fit_transform(X)

t = time.time() - t

fig, ax = plot_repr(Xr, y, title="t-SNE")

res.loc[:, ax.get_title()] = (t, silhouette_score(Xr, y), calinski_harabaz_score(Xr, y)) plt.show()

8. Spectral Embedding

9

.

t = time.time()

Xr = SpectralEmbedding(n_components=ndim, affinity="nearest_neighbors", gamma=None).

fit_transform(X) t = time.time() - t

fig, ax = plot_repr(Xr, y, title="SE")

res.loc[:, ax.get_title()] = (t, silhouette_score(Xr, y), calinski_harabaz_score(Xr, y)) plt.show()

5http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html

6http://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html

7http://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.

html

8http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

9http://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html

3

(4)

9. Uniform Manifold Approximation and Projection (UMAP)

10

you may need the install umap first by using

pip install umap-learn

import umap t = time.time()

embedding = umap.UMAP(n_neighbors=10, min_dist=0.001,

metric='correlation').fit_transform(X) t = time.time() - t

fig, ax = plot_repr(embedding, y, title="umap")

res.loc[:, ax.get_title()] = (t, silhouette_score(embedding, y), calinski_harabaz_score(

embedding, y)) plt.show()

10. Visualize and compare the results of all these methods. What are you conclusions about their quality and scalability ?

plt.close("all")

fig, ax = plt.subplots(3, figsize=(fw, 3 * fh)) for i in range(len(res.index)):

ax[i].bar(res.columns.tolist(), res.iloc[i], color=plt.cm.Set2(i)) ax[i].axhline(color="k", linewidth=plt.rcParams["axes.linewidth"]) ax[i].set_ylabel(res.index[i])

plt.show() res

11. Your turn :

a) What are the list of hyperparameter of each of these methods ? What ar their influence ?

b) How long does it take to compute the t-SNE projection over 20 000 images ? c) find out another dimensionality reduction method

d) According to your experience, which is the best method and why ? e) Try to compare these methods on the COIL-20 data set

11

.

import scipy as sp

coil = sp.io.loadmat('Donnees_coil_20') X = coil['X']

describe the data and build the labels. . .

nc = 20

yc = np.zeros(nc*72) for i in range(1,nc+1):

yc[(i-1)*72:i*72-1] = i-1;

(n,p) = Xn.shape print((n,p))

10https://umap-learn.readthedocs.io/en/latest/

11http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php

4

Références

Documents relatifs

Our run submis- sions for this task explore a number of techniques to achieve this: probabilistic models of workers (i.e. estimating which workers are bad, and discounting their

Otherwise I’m going to stay at home and watch television. ) $ Jack and I are going to the Theatre.. There is a show of Mr

Pour le faire fonctionner, vous êtes supposé avoir déjà installé CVX (que vous pourrez télécharger à cette adresse : http://cvxr.com/cvx/). Le code suivant est disponible en

c) Essayez avec TPOT, après l’avoir installé (attention aux problèmes de versions) pip install deap update_checker tqdm stopit. pip install scikit-mdr skrebate pip

DBSCAN is a density based clustering algorithm that can neatly handle noise (the clue is in the name). Clusters are considered zones that are

Choisisez un réseaux de neurones pour lequel vous voulez générer un exemple adversaire 2.. Générez les données avec lesquelles vous

Any 1-connected projective plane like manifold of dimension 4 is homeomor- phic to the complex projective plane by Freedman’s homeomorphism classification of simply connected

Good Canadian data on the epi- demiology of domestic violence in the general population are readily available (eg, the 1993 Statistics Canada survey of 12300 adult women 2 ), as