négative pour l’apprentissage par transfert non-supervié

(1)

Factorisation matricielle non-

négative pour l’apprentissage par transfert non-supervié

Evaluation à mi-parcours

Ievgen REDKO

(2)

What is Transfer Learning?

 Transfer learning



Given a source domain D

S

and a learning task T

S

, a target domain D

T

and a target task T

T

, transfer learning aims to help improve the learning performance in D

T

using knowledge gained from D

S

and T

S

, where D

S

≠ D

T

and T

S

≠ T

T

.

 Subspace paradigm



Simultaneously cluster the data into multiple subspaces to find a lower-dimensional subspace fitting each group of points.

Source domain Target domain

Transfer Learning ?

2

Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko

(3)

Nonnegative Matrix Factorization

What is NMF?

(4)

Goals

 Study NMF models under different constraints

 Use NMF methods for representation learning and apply them futher to unsupervised transfer learning

4

(5)

Plan

 Ce qui est fait (Contributions)



Partie 1 : Amélioration de la NMF



Non-negative Matrix Factorization with Orthogonality Constraints



Non-negative Matrix Factorization with Schatten p-norms Regularization



Partie 2 : NMF pour l’apprentissage par transfert



Random Subspace NMF for Unsupervised Transfer Learning



Bridge Convex NMF for Unsupervised Transfer Learning

(6)

Non-negative Matrix Factorization with

Orthogonality Constraints

6

(7)

Uni-Orthgonal Non-negative Matrix Factorization

 Uni-Orthogonal NMF takes the following form:

 Why UONMF?

 In case of orthogonal constraints imposed on F we obtain a dictionary with distinct basis vectors

 In case of orthogonal constraints imposed on G we force

X ₊ @ F ₊ G ₊ , X Î R ^m ^´n , F Î R ^m ^´k , G Î R ^k ^´n s .t . F ^T F = I or G ^T G = I

(8)

Bi-Orthogonal Non-negative Matrix Factorization

 Bi-Orthogonal NMF takes the following form:

 Why BONMF?

 Can be seen as a co-clustering approach where F is a clustering of features and G is a clustering of data.

 Gives unique matrix factorization!!!

X ₊ @ F ₊ S ₊ G ₊ , X Î R ^m ^´n , F Î R ^m ^´k ,S Î R ^k ^´k , G Î R ^k ^´n s .t . F ^T F = I andG ^T G = I

8

(9)

Gram-Schmidt Orthogonal Non-Negative Matrix Factorization (1)

 Calculate a set of basis vectors arising from Standard NMF:

 Apply the projection operator from Gram-Schmidt process multiplied by to F:

(10)

Gram-Schmidt Orthogonal Non-Negative Matrix Factorization (2)

 Apply Semi-NMF with basis matrix fixed to :

 Orthogonality measured by the following expression:

increases till it reaches 1.

10

(11)

Weighted Orthogonal Non-Negative Matrix Factorization (1)

 Consider the following objective function:

 Add weighting parameter and solve over F, G :

(12)

Weighted Orthogonal Non-Negative Matrix Factorization (2)

 Update rules for ONMF with matrix G can be derived in two ways:



Following [Ding et al., 2005] :



Following [Mirzal, 2010] :

F = F XG ^T

FGG ^T æ

è ç ö

ø ÷ G = G F ^T X F ^T XG ^T G æ

è ç ö

ø ÷

F = F XG ^T FGG ^T æ

è ç ö

ø ÷ G = G F ^T X + G F ^T FG + GG ^T G æ

è ç ö

ø ÷

12

(13)

Weighted Orthogonal Non-Negative Matrix Factorization (3)

 Differences between two types of update rules:

 In our case the updates rules for matrix F take the following form:

[Ding et al., 2005] [Mirzal, 2010]

Optimization type Constrained optimization Unconstrained optimization Assumptions Off-diagonal elements of

Lagrangian matrix equals to 0 None

(14)

Criteria Equation

Entropy Purity

Sparseness

Evaluation criteria

14

(15)

Purity values on different data sets

(16)

Entropy values on different data sets

16

(17)

Sparseness values on different data sets

(18)

Graphical examples

18

(19)

Conclusions

 “Hard” orthogonality is not as beneficial as the “soft” one

 Very sparse features can give poor results in terms of quality

 Orthogonality of basis vectors that produces the most

qualitative result is usually greater than 1.

(20)

Non-negative Matrix

Factorization with Schatten p-norms regularization

20

(21)

Background

 Regularization methods are often used to:

 prevent model’s overfitting

 obtain sparse representations of features of a given data set.

 Most used regularizations are l and l norms.

min J ( x ₁ ,..., x _n ) ® min J ( x ₁ ,..., x _n ) + y ( x _i )

i= 1

å n

(22)

l ₁ , l ₂ and Schatten p-norms

If p = 1 we obtain a nuclear norm:

If p = 2 we obtain a Frobenius norm:

22

(23)

Regularized NMF

 Regularization NMF takes the following form:

 Update rules that decrease monotonically the

objective function are:

(24)

Example for l ₁ , l ₂ norms

 For l ₁ norm:

 For l ₂ norm:

24

(25)

Update rules for Schatten p-norms Regularized NMF (1)

 Our model takes the following form:

 Direct update rules involve SVD:

(26)

Update rules for Schatten p-norms Regularized NMF (2)

 Change our objective function:

 Update rules take the following form:

26

(27)

Experimental results on image data sets

Data sets:

• Yale (165 instances of 15 people)

• ORL (400 images of 40 people of size 32x32 )

(28)

Purity and entropy as a function of p for PIE and USPS data sets

28

(29)

Conclusions

 Regularization with Schatten p-norms can be more effective than regularization with l 1 and l 2 norms

 For big data sets the effect of regularization is more clear

 Quality of a data set is not crucial.

(30)

Random Subspace NMF

for Unsupervised Transfer Learning

30

(31)

Preliminary knowledge

 Standard NMF

 Convex NMF (column vectors of F lie within the column space of X)

 Multilayer NMF (we build up a system that has many layers or cascade connection of L mixing subsystems)

X @ FG ^T , X Î R ^m ^´n , F Î R ^m ^´k , G Î R ^k ^´n

X @ XWG ^T , X Î R ^m ^´n , W Î R ⁿ ^´k , G Î R ^k ^´n

X @ F G , X Î R ^m ^´n , F Î R ^m ^´k , G Î R ^k ^´n

(32)

Our approach: RS-NMF



Find initial partition and prototype matrix of the target task



Build a sequence of partitions in different subspaces of a source task (“knowledge decomposition”)



Find k nearest neighbors among them with respect to a target partition



Find “link” matrices between them



Use these “link” matrices to perform a final factorization

Clustering

of the target task

Knowledge Decomposition in the source task

Select KNN among them with respect to a target

partition

“link”

matrices between them

Final

factorization using “link”

matrices G

_i

{ }

_i=1^M

X

_T

@ X

_T

W

_T

G

_T^T

X

_ss_i

{ }

_i=1^M

G_i

{ }

_i=1^k ⁼^Nk

(G

_T

) { } W

_i _i=1^k

P_T

, { }

W_i _i=^k₁

{ }

X

_T @

P

_TW₁

... W

_k

G

^*_T

32

(33)

Initialization

 Let us consider two tasks T S and T T deﬁned by two matrices X _S and X _T

 We perform Convex NMF on X T

 G T is an initial partition

 P T = X T W T is a matrix of basis vectors that are linear combinations of the original data points.

X _T @ X _T W _T G _T ^T

(34)

Knowledge decomposition

 Choose randomly features of X _S and perform any arbitrary type of NMF on the sequence of the reduced matrices

 Obtain a sequence of partition matrices that were calculated on the subspaces of X.

m

X

_ss_i

{ }

_i=^M₁

G

_i

{ }

_i=^M₁

34

(35)

Random Subspace NMF Purity values

(36)

Defining neighborhood

 Simply use any arbitrary similarity measure (any divergence measure or just a simple

correlation function) to find k nearest neighbors of target task’s partition G _T .

 We use a simple correlation function given by the following expression

corr( X , Y ) = cov( X , Y ) s _X s _Y G _i

{ } _i=1 ^k ⁼ ^N _k ^(G _T ⁾

36

(37)

Learning “link” matrices

 At this step we take each of the chosen matrices and perform the NMF of the following form:

 The idea behind constructing this sequence of ”link”

matrices is that they capture the relationships between clusters and thus reﬂect the structure of a data set.

G _i @ W _i G _i ^* , G _i Î R ^k ^´n , W _i Î R ^k ^´k , G ^* _i Î R ^k ^´n " i = 1...k .

W

_i

{ }

_i=1^k

(38)

Final decomposition

 Finally we have a sequence of matrices

 Performing Multilayer NMF of the following form gives the final partition G _T ^* .

P _T , { } W _i _i= ^k ₁

{ }

X _T @ P _T W ₁ ... W _k G ^* _T

38

(39)

Evaluation criteria



Dunn’s index (k denotes the number of clusters, i and j are cluster labels, d(c

_i

, c

_j

) deﬁnes the between-cluster distance between clusters X

_i

and X

_j

; d(X

_k

) represents the within-cluster of X

_k

.



Calinski-Harabasz index (S

B

is a between-cluster scatter matrix, S

W

is the internal scatter matrix, n

p

is a number of clustered samples and k is a number of clusters.)

Dunn = min

1£i£k min d (c

_i

, c

_j

)

max _1£k£k (d (X

_k

)) ì í

î

ü ý þ ì í

ï îï

ü ý

ï

þï

(40)

Dunn’s index

for transfer between different data sets

40

(41)

Calinski-Harabasz index

for transfer between different data sets

(42)

Sparse Matrix Factorization [B. Neyshabur and R. Panigrahy, 2013]

 For a given binary matrix Y minimizing the total sparsity of the following decomposition Y = sign(X ₁ sign(X ₂ sign(… X _n ))) is equal to the computations in a deep neural network where each X _i corresponds to the i ^th layer.

 Learning “link” matrices can be seen as learning nonnegative encoders between target and chosen partitions (i.e. injecting

auxiliary knowledge in the corresponding layer of a deep neural network)

Why does it work?

42

(43)

 Common assumption: transfer learning is useful only for closely related data sets [Rosenstein et al., 2004].

Is it really so?!

 Transfer learning using Kolmogorov complexity [M. M. Mahmud and S. R. Ray

Why does it work on these data sets?

(44)

Bridge Convex NMF for Unsupervised Transfer

Learning

44

(45)

Background

Transfer learning depends on correlation …

(46)

Background

… but it can also be used when connection is very tenuous

• Measuring similarity using universal distances

• Defining the right amount of knowledge to be transferred

• Most general transfer experiments

46

(47)

Kernel NMF(K-NMF)

 Kernel NMF is a natural extension of C-NMF. It seeks the following decomposition:

where K is a Gram matrix of some arbitrary kernel function k.

 Why K-NMF?

 Sometimes clustering based on similarities between objects gives better results

K @ KW ₊ G ₊ ^T , K Î R ⁿ ^´n , W Î R ⁿ ^´k , G Î R ^k ^´n

(48)

Relatedness measures

• Kernel Target Alignment

captures the similarity between two Gram matrices

• Renyi entropy based alignment

48

(49)

Algorithm

 We calculate Gram matrices K

S

and K

T

using some arbitrary kernel function for matrices X

_S

and X

_T

(source and target tasks data)

 Kernel target alignment gives us an idea about the initial proximity of two tasks:

 We adapt the kernel function in order to increase the alignment:

A(K

_S

, K

_T

) = K

_S

, K

_{T F}

K

_S

, K

_{S F}

K

_T

, K

_{T F}

(50)

Why it works?

 What happens when we perform alignment optimization?



If we consider the following entropic relatedness measure

then it is easy to show that A(K

_S

, K

_T

) is equivalent to R

_I

(X

_S

,X _T ) w.r.t.

to the type of the inner-product used.

Moreover, it can be proved that:

50

(51)

Complexity

For a given matrix the complexity is:

desired number of clusters

number of iteration used for kernel alignment optimization

Source task

… or

(52)

Results for cross-domain transfer

• DBI

_before _transfer

– Davies-Bouldin index for baseline (Convex NMF applied to

target task)

• DBI

after transfer

– Davies-Bouldin index for our approach Bridge Convex NMF

• OVA – optimal level of alignment where the best result was obtained

52

(53)

Simple example

Glass data set

C-NMF of Iris data set BC-NMF of Iris data set

(54)

Conclusions

 Transfer learning can be applied to tasks with tenuous connection

 Competitive performance on UCI datasets

Big question we still need to answer:

So if the cognitive distance between two tasks is defined as

then what is the smallest cognitive distance between two tasks when transfer learning can be applied?

54

(55)

• Apply a hyper-optimization technique for imposing prior on factors of NMF in order to control orthogonality

• Apply a hyper-optimization technique for choosing trade-off parameters in Schatten p-norms NMF

• Use Hessian Schatten p-norms regularization with NMF

• Multitask transfer learning extension of the algorithm RS-NMF

• Finding possible range of suboptimal values of k beforehand depending on the minimum correlation level in RS-NMF

Future works

(56)

Thank you for you attention!

Feel free to ask questions if you have any.

56

(57)

(58)

Transfer Learning vs Traditional ML

58

(59)

Gram-Schmidt Process

 Gram-Schmidt process works as follows:

(60)

Complexity study

 First model that involves SVD:

 Model without SVD:

 Second model should be applied if data matrix is rather big in terms of number of instances.

O (t (m ² k + k ² n+ mk ² + nk ² + k ³ + n ³ + knm )) O(t (m ² k+ k ² m + nk ² + nk ² + knm ))

60

(61)

Kernels and Gram matrices

Kernel is a function k:

satisfying

where Ф maps into some dot product space H, sometimes called the feature space.

Gram matrix of a kernel function k w.r.t a set of

k :À´À® Â , ( x , x') ® k ( x , x ')

" ( x , x ') ÎÀ , k ( x , x ') = F( x ), F( x ')

(62)

Kernel functions

 Different similarity measures can be used as a kernel functions. For instance:



Linear kernel



Polynomial kernel



Gaussian kernel

… etc

k ( x , x ') = x ^T x '+ c k ( x , x ') = (ax ^T x '+ c ) ^d

k ( x , x ') = exp ^(-

x - x'

²

2 s

²

⁾

Apprentissage par factorisation matricielle (EPAT’14 – Carry-Le-Rouet 7-12 juin 2014)

négative pour l’apprentissage par transfert non-supervié

Factorisation matricielle non-