Factorisation matricielle non-
négative pour l’apprentissage par transfert non-supervié
Evaluation à mi-parcours
Ievgen REDKO
What is Transfer Learning?
Transfer learning
Given a source domain D
Sand a learning task T
S, a target domain D
Tand a target task T
T, transfer learning aims to help improve the learning performance in D
Tusing knowledge gained from D
Sand T
S, where D
S≠ D
Tand T
S≠ T
T.
Subspace paradigm
Simultaneously cluster the data into multiple subspaces to find a lower-dimensional subspace fitting each group of points.
Source domain Target domain
Transfer Learning ?
2
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Nonnegative Matrix Factorization
What is NMF?
Goals
Study NMF models under different constraints
Use NMF methods for representation learning and apply them futher to unsupervised transfer learning
4
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Plan
Ce qui est fait (Contributions)
Partie 1 : Amélioration de la NMF
Non-negative Matrix Factorization with Orthogonality Constraints
Non-negative Matrix Factorization with Schatten p-norms Regularization
Partie 2 : NMF pour l’apprentissage par transfert
Random Subspace NMF for Unsupervised Transfer Learning
Bridge Convex NMF for Unsupervised Transfer Learning
Non-negative Matrix Factorization with
Orthogonality Constraints
6
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Uni-Orthgonal Non-negative Matrix Factorization
Uni-Orthogonal NMF takes the following form:
Why UONMF?
In case of orthogonal constraints imposed on F we obtain a dictionary with distinct basis vectors
In case of orthogonal constraints imposed on G we force
X + @ F + G + , X Î R m ´n , F Î R m ´k , G Î R k ´n s .t . F T F = I or G T G = I
Bi-Orthogonal Non-negative Matrix Factorization
Bi-Orthogonal NMF takes the following form:
Why BONMF?
Can be seen as a co-clustering approach where F is a clustering of features and G is a clustering of data.
Gives unique matrix factorization!!!
X + @ F + S + G + , X Î R m ´n , F Î R m ´k ,S Î R k ´k , G Î R k ´n s .t . F T F = I andG T G = I
8
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Gram-Schmidt Orthogonal Non-Negative Matrix Factorization (1)
Calculate a set of basis vectors arising from Standard NMF:
Apply the projection operator from Gram-Schmidt process multiplied by to F:
Gram-Schmidt Orthogonal Non-Negative Matrix Factorization (2)
Apply Semi-NMF with basis matrix fixed to :
Orthogonality measured by the following expression:
increases till it reaches 1.
10
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Weighted Orthogonal Non-Negative Matrix Factorization (1)
Consider the following objective function:
Add weighting parameter and solve over F, G :
Weighted Orthogonal Non-Negative Matrix Factorization (2)
Update rules for ONMF with matrix G can be derived in two ways:
Following [Ding et al., 2005] :
Following [Mirzal, 2010] :
F = F XG T
FGG T æ
è ç ö
ø ÷ G = G F T X F T XG T G æ
è ç ö
ø ÷
F = F XG T FGG T æ
è ç ö
ø ÷ G = G F T X + G F T FG + GG T G æ
è ç ö
ø ÷
12
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Weighted Orthogonal Non-Negative Matrix Factorization (3)
Differences between two types of update rules:
In our case the updates rules for matrix F take the following form:
[Ding et al., 2005] [Mirzal, 2010]
Optimization type Constrained optimization Unconstrained optimization Assumptions Off-diagonal elements of
Lagrangian matrix equals to 0 None
Criteria Equation
Entropy Purity
Sparseness
Evaluation criteria
14
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Purity values on different data sets
Entropy values on different data sets
16
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Sparseness values on different data sets
Graphical examples
18
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Conclusions
“Hard” orthogonality is not as beneficial as the “soft” one
Very sparse features can give poor results in terms of quality
Orthogonality of basis vectors that produces the most
qualitative result is usually greater than 1.
Non-negative Matrix
Factorization with Schatten p-norms regularization
20
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Background
Regularization methods are often used to:
prevent model’s overfitting
obtain sparse representations of features of a given data set.
Most used regularizations are l and l norms.
min J ( x 1 ,..., x n ) ® min J ( x 1 ,..., x n ) + y ( x i )
i= 1
å n
l 1 , l 2 and Schatten p-norms
If p = 1 we obtain a nuclear norm:
If p = 2 we obtain a Frobenius norm:
22
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Regularized NMF
Regularization NMF takes the following form:
Update rules that decrease monotonically the
objective function are:
Example for l 1 , l 2 norms
For l 1 norm:
For l 2 norm:
24
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Update rules for Schatten p-norms Regularized NMF (1)
Our model takes the following form:
Direct update rules involve SVD:
Update rules for Schatten p-norms Regularized NMF (2)
Change our objective function:
Update rules take the following form:
26
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Experimental results on image data sets
Data sets:
• Yale (165 instances of 15 people)
• ORL (400 images of 40 people of size 32x32 )
Purity and entropy as a function of p for PIE and USPS data sets
28
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Conclusions
Regularization with Schatten p-norms can be more effective than regularization with l 1 and l 2 norms
For big data sets the effect of regularization is more clear
Quality of a data set is not crucial.
Random Subspace NMF
for Unsupervised Transfer Learning
30
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Preliminary knowledge
Standard NMF
Convex NMF (column vectors of F lie within the column space of X)
Multilayer NMF (we build up a system that has many layers or cascade connection of L mixing subsystems)
X @ FG T , X Î R m ´n , F Î R m ´k , G Î R k ´n
X @ XWG T , X Î R m ´n , W Î R n ´k , G Î R k ´n
X @ F G , X Î R m ´n , F Î R m ´k , G Î R k ´n
Our approach: RS-NMF
Find initial partition and prototype matrix of the target task
Build a sequence of partitions in different subspaces of a source task (“knowledge decomposition”)
Find k nearest neighbors among them with respect to a target partition
Find “link” matrices between them
Use these “link” matrices to perform a final factorization
Clustering
of the target task
Knowledge Decomposition in the source task
Select KNN among them with respect to a target
partition
“link”
matrices between them
Final
factorization using “link”
matrices G
i{ }
i=1MX
T@ X
TW
TG
TTX
ssi{ }
i=1MGi
{ }
i=1k =Nk(G
T) { } W
i i=1kPT
, { }
Wi i=k1{ }
X
T @P
TW1... W
kG
*T32
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Initialization
Let us consider two tasks T S and T T defined by two matrices X S and X T
We perform Convex NMF on X T
G T is an initial partition
P T = X T W T is a matrix of basis vectors that are linear combinations of the original data points.
X T @ X T W T G T T
Knowledge decomposition
Choose randomly features of X S and perform any arbitrary type of NMF on the sequence of the reduced matrices
Obtain a sequence of partition matrices that were calculated on the subspaces of X.
m
X
ssi{ }
i=M1G
i{ }
i=M134
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Random Subspace NMF Purity values
Defining neighborhood
Simply use any arbitrary similarity measure (any divergence measure or just a simple
correlation function) to find k nearest neighbors of target task’s partition G T .
We use a simple correlation function given by the following expression
corr( X , Y ) = cov( X , Y ) s X s Y G i
{ } i=1 k = N k (G T )
36
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Learning “link” matrices
At this step we take each of the chosen matrices and perform the NMF of the following form:
The idea behind constructing this sequence of ”link”
matrices is that they capture the relationships between clusters and thus reflect the structure of a data set.
G i @ W i G i * , G i Î R k ´n , W i Î R k ´k , G * i Î R k ´n " i = 1...k .
W
i{ }
i=1kFinal decomposition
Finally we have a sequence of matrices
Performing Multilayer NMF of the following form gives the final partition G T * .
P T , { } W i i= k 1
{ }
X T @ P T W 1 ... W k G * T
38
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Evaluation criteria
Dunn’s index (k denotes the number of clusters, i and j are cluster labels, d(c
i, c
j) defines the between-cluster distance between clusters X
iand X
j; d(X
k) represents the within-cluster of X
k.
Calinski-Harabasz index (S
Bis a between-cluster scatter matrix, S
Wis the internal scatter matrix, n
pis a number of clustered samples and k is a number of clusters.)
Dunn = min
1£i£k min d (c
i, c
j)
max 1£k£k (d (X
k)) ì í
î
ü ý þ ì í
ï îï
ü ý
ï
þï
Dunn’s index
for transfer between different data sets
40
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Calinski-Harabasz index
for transfer between different data sets
Sparse Matrix Factorization [B. Neyshabur and R. Panigrahy, 2013]
For a given binary matrix Y minimizing the total sparsity of the following decomposition Y = sign(X 1 sign(X 2 sign(… X n ))) is equal to the computations in a deep neural network where each X i corresponds to the i th layer.
Learning “link” matrices can be seen as learning nonnegative encoders between target and chosen partitions (i.e. injecting
auxiliary knowledge in the corresponding layer of a deep neural network)
Why does it work?
42
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Common assumption: transfer learning is useful only for closely related data sets [Rosenstein et al., 2004].
Is it really so?!
Transfer learning using Kolmogorov complexity [M. M. Mahmud and S. R. Ray
Why does it work on these data sets?
Bridge Convex NMF for Unsupervised Transfer
Learning
44
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Background
Transfer learning depends on correlation …
Background
… but it can also be used when connection is very tenuous
• Measuring similarity using universal distances
• Defining the right amount of knowledge to be transferred
• Most general transfer experiments
46
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Kernel NMF(K-NMF)
Kernel NMF is a natural extension of C-NMF. It seeks the following decomposition:
where K is a Gram matrix of some arbitrary kernel function k.
Why K-NMF?
Sometimes clustering based on similarities between objects gives better results
K @ KW + G + T , K Î R n ´n , W Î R n ´k , G Î R k ´n
Relatedness measures
• Kernel Target Alignment
captures the similarity between two Gram matrices
• Renyi entropy based alignment
48
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Algorithm
We calculate Gram matrices K
Sand K
Tusing some arbitrary kernel function for matrices X
Sand X
T(source and target tasks data)
Kernel target alignment gives us an idea about the initial proximity of two tasks:
We adapt the kernel function in order to increase the alignment:
A(K
S, K
T) = K
S, K
T FK
S, K
S FK
T, K
T FWhy it works?
What happens when we perform alignment optimization?
If we consider the following entropic relatedness measure
then it is easy to show that A(K
S, K
T) is equivalent to R
I(X
S,X T ) w.r.t.
to the type of the inner-product used.
Moreover, it can be proved that:
50
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Complexity
For a given matrix the complexity is:
desired number of clusters
number of iteration used for kernel alignment optimization
Source task
… or
Results for cross-domain transfer
• DBI
before transfer– Davies-Bouldin index for baseline (Convex NMF applied to
target task)
• DBI
after transfer– Davies-Bouldin index for our approach Bridge Convex NMF
• OVA – optimal level of alignment where the best result was obtained
52
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Simple example
Glass data set
C-NMF of Iris data set BC-NMF of Iris data set
Conclusions
Transfer learning can be applied to tasks with tenuous connection
Competitive performance on UCI datasets
Big question we still need to answer:
So if the cognitive distance between two tasks is defined as
then what is the smallest cognitive distance between two tasks when transfer learning can be applied?
54
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
• Apply a hyper-optimization technique for imposing prior on factors of NMF in order to control orthogonality
• Apply a hyper-optimization technique for choosing trade-off parameters in Schatten p-norms NMF
• Use Hessian Schatten p-norms regularization with NMF
• Multitask transfer learning extension of the algorithm RS-NMF
• Finding possible range of suboptimal values of k beforehand depending on the minimum correlation level in RS-NMF
Future works
Thank you for you attention!
Feel free to ask questions if you have any.
56
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Transfer Learning vs Traditional ML
58
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Gram-Schmidt Process
Gram-Schmidt process works as follows:
Complexity study
First model that involves SVD:
Model without SVD:
Second model should be applied if data matrix is rather big in terms of number of instances.
O (t (m 2 k + k 2 n+ mk 2 + nk 2 + k 3 + n 3 + knm )) O(t (m 2 k+ k 2 m + nk 2 + nk 2 + knm ))
60
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko
Kernels and Gram matrices
Kernel is a function k:
satisfying
where Ф maps into some dot product space H, sometimes called the feature space.
Gram matrix of a kernel function k w.r.t a set of
k :À´À® Â , ( x , x') ® k ( x , x ')
" ( x , x ') ÎÀ , k ( x , x ') = F( x ), F( x ')
Kernel functions
Different similarity measures can be used as a kernel functions. For instance:
Linear kernel
Polynomial kernel
Gaussian kernel
… etc
k ( x , x ') = x T x '+ c k ( x , x ') = (ax T x '+ c ) d
k ( x , x ') = exp (-
x - x'
22 s
2)
Apprentissage par factorisation matricielle (EPAT’14 – Carry-Le-Rouet 7-12 juin 2014)
62
Factorisation matricielle non-négative pour l’apprentissage par transfert non-supervié I. Redko