• Aucun résultat trouvé

négative pour l’apprentissage par transfert non-supervié

N/A
N/A
Protected

Academic year: 2022

Partager "négative pour l’apprentissage par transfert non-supervié"

Copied!
62
0
0

Texte intégral

(1)

Factorisation matricielle non-

négative pour l’apprentissage par transfert non-supervié

Evaluation à mi-parcours

Ievgen REDKO

(2)

What is Transfer Learning?

 Transfer learning

Given a source domain D

S

and a learning task T

S

, a target domain D

T

and a target task T

T

, transfer learning aims to help improve the learning performance in D

T

using knowledge gained from D

S

and T

S

, where D

S

≠ D

T

and T

S

≠ T

T

.

 Subspace paradigm

Simultaneously cluster the data into multiple subspaces to find a lower-dimensional subspace fitting each group of points.

Source domain Target domain

Transfer Learning ?

2

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(3)

Nonnegative Matrix Factorization

What is NMF?

(4)

Goals

 Study NMF models under different constraints

 Use NMF methods for representation learning and apply them futher to unsupervised transfer learning

4

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(5)

Plan

 Ce qui est fait (Contributions)

Partie 1 : Amélioration de la NMF

Non-negative Matrix Factorization with Orthogonality Constraints

Non-negative Matrix Factorization with Schatten p-norms Regularization

Partie 2 : NMF pour l’apprentissage par transfert

Random Subspace NMF for Unsupervised Transfer Learning

Bridge Convex NMF for Unsupervised Transfer Learning

(6)

Non-negative Matrix Factorization with

Orthogonality Constraints

6

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(7)

Uni-Orthgonal Non-negative Matrix Factorization

 Uni-Orthogonal NMF takes the following form:

 Why UONMF?

 In case of orthogonal constraints imposed on F we obtain a dictionary with distinct basis vectors

 In case of orthogonal constraints imposed on G we force

X + @ F + G + , X Î R m ´n , F Î R m ´k , G Î R k ´n s .t . F T F = I or G T G = I

(8)

Bi-Orthogonal Non-negative Matrix Factorization

 Bi-Orthogonal NMF takes the following form:

 Why BONMF?

 Can be seen as a co-clustering approach where F is a clustering of features and G is a clustering of data.

 Gives unique matrix factorization!!!

X + @ F + S + G + , X Î R m ´n , F Î R m ´k ,S Î R k ´k , G Î R k ´n s .t . F T F = I andG T G = I

8

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(9)

Gram-Schmidt Orthogonal Non-Negative Matrix Factorization (1)

 Calculate a set of basis vectors arising from Standard NMF:

 Apply the projection operator from Gram-Schmidt process multiplied by to F:

(10)

Gram-Schmidt Orthogonal Non-Negative Matrix Factorization (2)

 Apply Semi-NMF with basis matrix fixed to :

 Orthogonality measured by the following expression:

increases till it reaches 1.

10

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(11)

Weighted Orthogonal Non-Negative Matrix Factorization (1)

 Consider the following objective function:

 Add weighting parameter and solve over F, G :

(12)

Weighted Orthogonal Non-Negative Matrix Factorization (2)

 Update rules for ONMF with matrix G can be derived in two ways:

Following [Ding et al., 2005] :

Following [Mirzal, 2010] :

F = F XG T

FGG T æ

è ç ö

ø ÷ G = G F T X F T XG T G æ

è ç ö

ø ÷

F = F XG T FGG T æ

è ç ö

ø ÷ G = G F T X + G F T FG + GG T G æ

è ç ö

ø ÷

12

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(13)

Weighted Orthogonal Non-Negative Matrix Factorization (3)

 Differences between two types of update rules:

 In our case the updates rules for matrix F take the following form:

[Ding et al., 2005] [Mirzal, 2010]

Optimization type Constrained optimization Unconstrained optimization Assumptions Off-diagonal elements of

Lagrangian matrix equals to 0 None

(14)

Criteria Equation

Entropy Purity

Sparseness

Evaluation criteria

14

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(15)

Purity values on different data sets

(16)

Entropy values on different data sets

16

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(17)

Sparseness values on different data sets

(18)

Graphical examples

18

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(19)

Conclusions

 “Hard” orthogonality is not as beneficial as the “soft” one

 Very sparse features can give poor results in terms of quality

 Orthogonality of basis vectors that produces the most

qualitative result is usually greater than 1.

(20)

Non-negative Matrix

Factorization with Schatten p-norms regularization

20

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(21)

Background

 Regularization methods are often used to:

 prevent model’s overfitting

 obtain sparse representations of features of a given data set.

 Most used regularizations are l and l norms.

min J ( x 1 ,..., x n ) ® min J ( x 1 ,..., x n ) + y ( x i )

i= 1

å n

(22)

l 1 , l 2 and Schatten p-norms

If p = 1 we obtain a nuclear norm:

If p = 2 we obtain a Frobenius norm:

22

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(23)

Regularized NMF

 Regularization NMF takes the following form:

 Update rules that decrease monotonically the

objective function are:

(24)

Example for l 1 , l 2 norms

 For l 1 norm:

 For l 2 norm:

24

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(25)

Update rules for Schatten p-norms Regularized NMF (1)

 Our model takes the following form:

 Direct update rules involve SVD:

(26)

Update rules for Schatten p-norms Regularized NMF (2)

 Change our objective function:

 Update rules take the following form:

26

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(27)

Experimental results on image data sets

Data sets:

• Yale (165 instances of 15 people)

• ORL (400 images of 40 people of size 32x32 )

(28)

Purity and entropy as a function of p for PIE and USPS data sets

28

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(29)

Conclusions

 Regularization with Schatten p-norms can be more effective than regularization with l 1 and l 2 norms

 For big data sets the effect of regularization is more clear

 Quality of a data set is not crucial.

(30)

Random Subspace NMF

for Unsupervised Transfer Learning

30

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(31)

Preliminary knowledge

 Standard NMF

 Convex NMF (column vectors of F lie within the column space of X)

 Multilayer NMF (we build up a system that has many layers or cascade connection of L mixing subsystems)

X @ FG T , X Î R m ´n , F Î R m ´k , G Î R k ´n

X @ XWG T , X Î R m ´n , W Î R n ´k , G Î R k ´n

X @ F G , X Î R m ´n , F Î R m ´k , G Î R k ´n

(32)

Our approach: RS-NMF

Find initial partition and prototype matrix of the target task

Build a sequence of partitions in different subspaces of a source task (“knowledge decomposition”)

Find k nearest neighbors among them with respect to a target partition

Find “link” matrices between them

Use these “link” matrices to perform a final factorization

Clustering

of the target task

Knowledge Decomposition in the source task

Select KNN among them with respect to a target

partition

“link”

matrices between them

Final

factorization using “link”

matrices G

i

{ }

i=1M

X

T

@ X

T

W

T

G

TT

X

ssi

{ }

i=1M

Gi

{ }

i=1k =Nk

(G

T

) { } W

i i=1k

PT

, { }

Wi i=k1

{ }

X

T @

P

TW1

... W

k

G

*T

32

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(33)

Initialization

 Let us consider two tasks T S and T T defined by two matrices X S and X T

 We perform Convex NMF on X T

G T is an initial partition

P T = X T W T is a matrix of basis vectors that are linear combinations of the original data points.

X T @ X T W T G T T

(34)

Knowledge decomposition

 Choose randomly features of X S and perform any arbitrary type of NMF on the sequence of the reduced matrices

 Obtain a sequence of partition matrices that were calculated on the subspaces of X.

m

X

ssi

{ }

i=M1

G

i

{ }

i=M1

34

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(35)

Random Subspace NMF Purity values

(36)

Defining neighborhood

 Simply use any arbitrary similarity measure (any divergence measure or just a simple

correlation function) to find k nearest neighbors of target task’s partition G T .

 We use a simple correlation function given by the following expression

corr( X , Y ) = cov( X , Y ) s X s Y G i

{ } i=1 k = N k (G T )

36

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(37)

Learning “link” matrices

 At this step we take each of the chosen matrices and perform the NMF of the following form:

 The idea behind constructing this sequence of ”link”

matrices is that they capture the relationships between clusters and thus reflect the structure of a data set.

G i @ W i G i * , G i Î R k ´n , W i Î R k ´k , G * i Î R k ´n " i = 1...k .

W

i

{ }

i=1k

(38)

Final decomposition

 Finally we have a sequence of matrices

 Performing Multilayer NMF of the following form gives the final partition G T * .

P T , { } W i i= k 1

{ }

X T @ P T W 1 ... W k G * T

38

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(39)

Evaluation criteria

Dunn’s index (k denotes the number of clusters, i and j are cluster labels, d(c

i

, c

j

) defines the between-cluster distance between clusters X

i

and X

j

; d(X

k

) represents the within-cluster of X

k

.

Calinski-Harabasz index (S

B

is a between-cluster scatter matrix, S

W

is the internal scatter matrix, n

p

is a number of clustered samples and k is a number of clusters.)

Dunn = min

1£i£k min d (c

i

, c

j

)

max 1£k£k (d (X

k

)) ì í

î

ü ý þ ì í

ï îï

ü ý

ï

þï

(40)

Dunn’s index

for transfer between different data sets

40

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(41)

Calinski-Harabasz index

for transfer between different data sets

(42)

Sparse Matrix Factorization [B. Neyshabur and R. Panigrahy, 2013]

 For a given binary matrix Y minimizing the total sparsity of the following decomposition Y = sign(X 1 sign(X 2 sign(… X n ))) is equal to the computations in a deep neural network where each X i corresponds to the i th layer.

 Learning “link” matrices can be seen as learning nonnegative encoders between target and chosen partitions (i.e. injecting

auxiliary knowledge in the corresponding layer of a deep neural network)

Why does it work?

42

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(43)

 Common assumption: transfer learning is useful only for closely related data sets [Rosenstein et al., 2004].

Is it really so?!

 Transfer learning using Kolmogorov complexity [M. M. Mahmud and S. R. Ray

Why does it work on these data sets?

(44)

Bridge Convex NMF for Unsupervised Transfer

Learning

44

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(45)

Background

Transfer learning depends on correlation …

(46)

Background

… but it can also be used when connection is very tenuous

• Measuring similarity using universal distances

• Defining the right amount of knowledge to be transferred

• Most general transfer experiments

46

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(47)

Kernel NMF(K-NMF)

 Kernel NMF is a natural extension of C-NMF. It seeks the following decomposition:

where K is a Gram matrix of some arbitrary kernel function k.

 Why K-NMF?

 Sometimes clustering based on similarities between objects gives better results

K @ KW + G + T , K Î R n ´n , W Î R n ´k , G Î R k ´n

(48)

Relatedness measures

Kernel Target Alignment

captures the similarity between two Gram matrices

Renyi entropy based alignment

48

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(49)

Algorithm

 We calculate Gram matrices K

S

and K

T

using some arbitrary kernel function for matrices X

S

and X

T

(source and target tasks data)

 Kernel target alignment gives us an idea about the initial proximity of two tasks:

 We adapt the kernel function in order to increase the alignment:

A(K

S

, K

T

) = K

S

, K

T F

K

S

, K

S F

K

T

, K

T F

(50)

Why it works?

 What happens when we perform alignment optimization?

If we consider the following entropic relatedness measure

then it is easy to show that A(K

S

, K

T

) is equivalent to R

I

(X

S

,X T ) w.r.t.

to the type of the inner-product used.

Moreover, it can be proved that:

50

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(51)

Complexity

For a given matrix the complexity is:

desired number of clusters

number of iteration used for kernel alignment optimization

Source task

… or

(52)

Results for cross-domain transfer

• DBI

before transfer

– Davies-Bouldin index for baseline (Convex NMF applied to

target task)

• DBI

after transfer

– Davies-Bouldin index for our approach Bridge Convex NMF

• OVA – optimal level of alignment where the best result was obtained

52

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(53)

Simple example

Glass data set

C-NMF of Iris data set BC-NMF of Iris data set

(54)

Conclusions

 Transfer learning can be applied to tasks with tenuous connection

 Competitive performance on UCI datasets

Big question we still need to answer:

So if the cognitive distance between two tasks is defined as

then what is the smallest cognitive distance between two tasks when transfer learning can be applied?

54

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(55)

• Apply a hyper-optimization technique for imposing prior on factors of NMF in order to control orthogonality

• Apply a hyper-optimization technique for choosing trade-off parameters in Schatten p-norms NMF

• Use Hessian Schatten p-norms regularization with NMF

• Multitask transfer learning extension of the algorithm RS-NMF

• Finding possible range of suboptimal values of k beforehand depending on the minimum correlation level in RS-NMF

Future works

(56)

Thank you for you attention!

Feel free to ask questions if you have any.

56

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(57)
(58)

Transfer Learning vs Traditional ML

58

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(59)

Gram-Schmidt Process

 Gram-Schmidt process works as follows:

(60)

Complexity study

 First model that involves SVD:

 Model without SVD:

 Second model should be applied if data matrix is rather big in terms of number of instances.

O (t (m 2 k + k 2 n+ mk 2 + nk 2 + k 3 + n 3 + knm )) O(t (m 2 k+ k 2 m + nk 2 + nk 2 + knm ))

60

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(61)

Kernels and Gram matrices

Kernel is a function k:

satisfying

where Ф maps into some dot product space H, sometimes called the feature space.

Gram matrix of a kernel function k w.r.t a set of

k :À´À® Â , ( x , x') ® k ( x , x ')

" ( x , x ') ÎÀ , k ( x , x ') = F( x ), F( x ')

(62)

Kernel functions

 Different similarity measures can be used as a kernel functions. For instance:

Linear kernel

Polynomial kernel

Gaussian kernel

… etc

k ( x , x ') = x T x '+ c k ( x , x ') = (ax T x '+ c ) d

k ( x , x ') = exp (-

x - x'

2

2 s

2

)

Apprentissage par factorisation matricielle (EPAT’14 – Carry-Le-Rouet 7-12 juin 2014)

62

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

Références

Documents relatifs

• utilisation : pour une entrée X (de  d ), chaque neurone k de la carte calcule sa sortie = d(W k ,X), et on associe alors X au neurone « gagnant » qui est celui de sortie la

where K is a Gram matrix calculated using any arbitrary kernel function with respect to initial data set...

Nous exigeons l’abandon de ces propositions, le maintien du réseau des CIO au sein de l’Education Nationale, du statut de fonctionnaire de l’Education Nationale pour les

La façon la plus immédiate d’utiliser le modèle de BERT pour la classification est de se servir d’un modèle pré-entraîné (par Google ou OpenAI par exemple) pour transformer

[r]

the proble by solvin lem (9), w tors prov coordinat Implem straightfo points we est neighb tance or i l.. Apprentissage non-supervis

The present research program proposes to investigate the possibility of explicitly discovering and learning unknown invariant transformations, as a way to go beyond the limits

Le loup n’arrive pas d’un pas joyeux.. La grand-mère n’a pas de