Méthodes pour l’apprentissage de données massives
Nathalie Villa-Vialaneix http://www.nathalievilla.org
en collaboration avec Fabrice Rossi
Journées d’Études en Statistique
2-7 octobre 2016 - Fréjus
Apprentissage statistique et données massives
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 1/46
Sommaire
1 Introduction
2 Subsampling
Bag of Little Bootstrap (BLB) Nyström approximation
3 Divide & Conquer
4 Online learning
Onlinek-nearest neighbors Online kernel ridge regression Online bagging
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 2/46
Sommaire
1 Introduction
2 Subsampling
Bag of Little Bootstrap (BLB) Nyström approximation
3 Divide & Conquer
4 Online learning
Onlinek-nearest neighbors Online kernel ridge regression Online bagging
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 3/46
Big Data?
Reference to the fast and recentincrease of worldwide data storage:
Image taken from Wikipedia Commons, CC BY-SA 3.0, author: Myworkforwiki. exabyte∼1018bits
The 3V: Volume: amount of data
Velocity: speed at which new data is generated
Variety: different types of data (text, images, videos, networks...)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46
Big Data?
Reference to the fast and recentincrease of worldwide data storage:
The 3V:
Volume: amount of data
Velocity: speed at which new data is generated
Variety: different types of data (text, images, videos, networks...)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46
Big Data?
Reference to the fast and recentincrease of worldwide data storage:
The 3V:
Volume: amount of data
Velocity: speed at which new data is generated
Variety: different types of data (text, images, videos, networks...)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46
Big Data?
Reference to the fast and recentincrease of worldwide data storage:
The 3V:
Volume: amount of data
Velocity: speed at which new data is generated
Variety: different types of data (text, images, videos, networks...)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46
Big Data?
Standard sizes(in bits):
42×106: for the complete work of Shakespeare 6.4×109: capacity of human genome (2 bits/pb)
4.5×1016: capacity of HD space in Google server farm in 2004 2×1017: storage space of Megaupload when it was shut down (2012) 2.4×1018: storage space of facebook data warehouse in 2014, with an increase of 0.6×1015/ day
1.2×1020storage space of Google data warehouse in 2013
Source:https://en.wikipedia.org/wiki/Orders_of_magnitude_(data)
The 3V: Volume: amount of data
Velocity: speed at which new data is generated
Variety: different types of data (text, images, videos, networks...)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46
Big Data?
The 3V:
Volume: amount of data
Velocity: speed at which new data is generated
Variety: different types of data (text, images, videos, networks...)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46
Why are Big Data seen as an opportunity?
economic opportunity: advertisements, recommendations, . . . social opportunity: better job profiling
find new solutions to existing problems: open data websites with challenges or publication of re-usehttps://www.data.gouv.fr, https://ressources.data.sncf.comor
https://data.toulouse-metropole.fr
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 5/46
When should we consider data as “big”?
We deal with Big Data when:
data are atgoogle scale(rare)
data are big compared to ourcomputing capacities
... and depending onwhat we need to do with them
[R Core Team, 2015,Kane et al., 2013]
R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 6/46
When should we consider data as “big”?
We deal with Big Data when:
data are atgoogle scale(rare)
data are big compared to ourcomputing capacities
... and depending onwhat we need to do with them
[R Core Team, 2015,Kane et al., 2013]
R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 6/46
When should we consider data as “big”?
We deal with Big Data when:
data are atgoogle scale(rare)
data are big compared to ourcomputing capacities... and depending onwhat we need to do with them
[R Core Team, 2015,Kane et al., 2013]
R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 6/46
Big Data and Statistics
Evolution of problems posed to statistics:
“smalln, smallp” problems
large dimensional problems:p>norpn big data problems: nis very large
[Jordan, 2013]:scale statistical methodsoriginally designed to deal with smallnand take advantage ofparallel or distributed computing
environmentsto deal with large/bignwhile ensuring good properties (consistency, good approximations...).
This requires a closer cooperation between statisticians and computer scientists.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 7/46
Big Data and Statistics
Evolution of problems posed to statistics:
“smalln, smallp” problems
large dimensional problems:p>norpn big data problems: nis very large
[Jordan, 2013]:scale statistical methodsoriginally designed to deal with smallnand take advantage ofparallel or distributed computing
environmentsto deal with large/bignwhile ensuring good properties (consistency, good approximations...).
This requires a closer cooperation between statisticians and computer scientists.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 7/46
Big Data and Statistics
Evolution of problems posed to statistics:
“smalln, smallp” problems
large dimensional problems:p>norpn big data problems: nis very large
[Jordan, 2013]:scale statistical methodsoriginally designed to deal with smallnand take advantage ofparallel or distributed computing
environmentsto deal with large/bignwhile ensuring good properties (consistency, good approximations...).
This requires a closer cooperation between statisticians and computer scientists.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 7/46
Purpose of this presentation
What we will discuss
Standard approaches used toscale statistical methodswithexamplesof applications to learning methods discussed in previous presentations.
What we will not discuss
Practical implementations on various computing environments or programs in which these approaches can be used.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 8/46
Purpose of this presentation
What we will discuss
Standard approaches used toscale statistical methodswithexamplesof applications to learning methods discussed in previous presentations.
What we will not discuss
Practical implementations on various computing environments or programs in which these approaches can be used.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 8/46
Organization of the talk
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 9/46
Organization of the talk
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 9/46
Organization of the talk
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 9/46
Organization of the talk
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 9/46
Sommaire
1 Introduction
2 Subsampling
Bag of Little Bootstrap (BLB) Nyström approximation
3 Divide & Conquer
4 Online learning
Onlinek-nearest neighbors Online kernel ridge regression Online bagging
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 10/46
Overview of BLB
[Kleiner et al., 2012,Kleiner et al., 2014]
method used to scale any bootstrap estimation
consistency result demonstrated for a bootstrap estimation
Here: we describe the approach in the simplified case ofbagging (illustration for random forest)
Framework: (Xi,Yi)i=1,...,n a learning set. We want to define a predictor of Y ∈RfromX given the learning set.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 11/46
Overview of BLB
[Kleiner et al., 2012,Kleiner et al., 2014]
method used to scale any bootstrap estimation
consistency result demonstrated for a bootstrap estimation
Here: we describe the approach in the simplified case ofbagging (illustration for random forest)
Framework: (Xi,Yi)i=1,...,n a learning set. We want to define a predictor of Y ∈RfromX given the learning set.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 11/46
Overview of BLB
[Kleiner et al., 2012,Kleiner et al., 2014]
method used to scale any bootstrap estimation
consistency result demonstrated for a bootstrap estimation
Here: we describe the approach in the simplified case ofbagging (illustration for random forest)
Framework: (Xi,Yi)i=1,...,n a learning set. We want to define a predictor of Y ∈RfromX given the learning set.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 11/46
Standard bagging
(Xi,Yi)i=1,...,n
(Xi,Yi)i∈τ1 (Xi,Yi)i∈τb (Xi,Yi)i∈τB
ˆf1 ˆfb ˆfB
aggregation:ˆfbootstrap = B1 PB
b=1ˆfb subsample with replacementB times
Advantage for Big Data: Bootstrap estimators can be learned in parallel.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 12/46
Standard bagging
(Xi,Yi)i=1,...,n
(Xi,Yi)i∈τ1 (Xi,Yi)i∈τb (Xi,Yi)i∈τB
ˆf1 ˆfb ˆfB
aggregation:ˆfbootstrap = B1 PB
b=1ˆfb subsample with replacementB times
Advantage for Big Data: Bootstrap estimators can be learned in parallel.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 12/46
Problem with standard bagging
Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!
First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap: bootstrap samples have sizemwithmn
But: The quality of the estimator strongly depends onm!
Idea behind BLB
Use bootstrap samples having sizenbut witha very small number of different observations in each of them.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 13/46
Problem with standard bagging
Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!
First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:
bootstrap samples have sizemwithmn
But: The quality of the estimator strongly depends onm!
Idea behind BLB
Use bootstrap samples having sizenbut witha very small number of different observations in each of them.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 13/46
Problem with standard bagging
Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!
First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:
bootstrap samples have sizemwithmn
But: The quality of the estimator strongly depends onm!
Idea behind BLB
Use bootstrap samples having sizenbut witha very small number of different observations in each of them.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 13/46
Problem with standard bagging
Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!
First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:
bootstrap samples have sizemwithmn
But: The quality of the estimator strongly depends onm!
Idea behind BLB
Use bootstrap samples having sizenbut witha very small number of different observations in each of them.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 13/46
Presentation of BLB
(X1,Y1). . .(Xn,Yn)
(X1(1),Y1(1)). . .(X(1)m,Ym(1))
(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))
...
n1(1,1). . .n(1,1)m
n(1,B1 2). . .n(1,Bm 2)
n(B11,1). . .n(Bm1,1)
n1(B1,B2). . .n(Bm1,B2)
...
...
ˆf(1,1)
ˆf(1,B2)
ˆf(B1,1)
ˆf(B1,B2) ˆf1
ˆfB1
ˆfBLB sampling, no
replacement (sizemn)
over-sampling
mean
mean
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46
Presentation of BLB
(X1,Y1). . .(Xn,Yn)
(X1(1),Y1(1)). . .(X(1)m,Ym(1))
(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))
...
n1(1,1). . .n(1,1)m
n(1,B1 2). . .n(1,Bm 2)
n(B11,1). . .n(Bm1,1)
n1(B1,B2). . .n(Bm1,B2)
...
...
ˆf(1,1)
ˆf(1,B2)
ˆf(B1,1)
ˆf(B1,B2) ˆf1
ˆfB1
ˆfBLB
sampling, no replacement (sizemn)
over-sampling
mean
mean
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46
Presentation of BLB
(X1,Y1). . .(Xn,Yn)
(X1(1),Y1(1)). . .(X(1)m,Ym(1))
(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))
...
n1(1,1). . .n(1,1)m
n(1,B1 2). . .n(1,Bm 2)
n(B11,1). . .n(Bm1,1)
n1(B1,B2). . .n(Bm1,B2)
...
...
ˆf(1,1)
ˆf(1,B2)
ˆf(B1,1)
ˆf(B1,B2) ˆf1
ˆfB1
ˆfBLB
sampling, no replacement (sizemn)
over-sampling
mean
mean
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46
Presentation of BLB
(X1,Y1). . .(Xn,Yn)
(X1(1),Y1(1)). . .(X(1)m,Ym(1))
(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))
...
n1(1,1). . .n(1,1)m
n(1,B1 2). . .n(1,Bm 2)
n(B11,1). . .n(Bm1,1)
n1(B1,B2). . .n(Bm1,B2)
...
...
ˆf(1,1)
ˆf(1,B2)
ˆf(B1,1)
ˆf(B1,B2)
ˆf1
ˆfB1
ˆfBLB
sampling, no replacement (sizemn)
over-sampling
mean
mean
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46
Presentation of BLB
(X1,Y1). . .(Xn,Yn)
(X1(1),Y1(1)). . .(X(1)m,Ym(1))
(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))
...
n1(1,1). . .n(1,1)m
n(1,B1 2). . .n(1,Bm 2)
n(B11,1). . .n(Bm1,1)
n1(B1,B2). . .n(Bm1,B2)
...
...
ˆf(1,1)
ˆf(1,B2)
ˆf(B1,1)
ˆf(B1,B2) ˆf1
ˆfB1
ˆfBLB
sampling, no replacement (sizemn)
over-sampling
mean
mean
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46
Presentation of BLB
(X1,Y1). . .(Xn,Yn)
(X1(1),Y1(1)). . .(X(1)m,Ym(1))
(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))
...
n1(1,1). . .n(1,1)m
n(1,B1 2). . .n(1,Bm 2)
n(B11,1). . .n(Bm1,1)
n1(B1,B2). . .n(Bm1,B2)
...
...
ˆf(1,1)
ˆf(1,B2)
ˆf(B1,1)
ˆf(B1,B2) ˆf1
ˆfB1
ˆfBLB sampling, no
replacement (sizemn)
over-sampling
mean
mean
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46
What is over-sampling and why is it working?
BLB steps:
1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]: forn=106andγ =0.6, typicalmis about 4000, compared to 630 000 for standard bootstrap
2 for every subsampleτb, repeatB2times:
I over-sampling: affect weights(n1, . . . ,nm)simulated asM n,m11m
to observations inτb
I estimation step: train an estimator with weighted observations
3 aggregate by averaging
Remark: Final sample size (Pm
i=1ni) is equal ton(with replacement) as in standard bootstrap samples.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 15/46
What is over-sampling and why is it working?
BLB steps:
1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]
2 for every subsampleτb, repeatB2times:
I over-sampling: affect weights(n1, . . . ,nm)simulated asM
n,m11m to observations inτb
I estimation step: train an estimator with weighted observations
3 aggregate by averaging
Remark: Final sample size (Pm
i=1ni) is equal ton(with replacement) as in standard bootstrap samples.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 15/46
What is over-sampling and why is it working?
BLB steps:
1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]
2 for every subsampleτb, repeatB2times:
I over-sampling: affect weights(n1, . . . ,nm)simulated asM
n,m11m to observations inτb
I estimation step: train an estimator with weighted observations (if the learning algorithm allows a genuine processing of weights,
computational cost is low because of the small size ofm)
3 aggregate by averaging
Remark: Final sample size (Pm
i=1ni) is equal ton(with replacement) as in standard bootstrap samples.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 15/46
What is over-sampling and why is it working?
BLB steps:
1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]
2 for every subsampleτb, repeatB2times:
I over-sampling: affect weights(n1, . . . ,nm)simulated asM
n,m11m to observations inτb
I estimation step: train an estimator with weighted observations
3 aggregate by averaging
Remark: Final sample size (Pm
i=1ni) is equal ton(with replacement) as in standard bootstrap samples.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 15/46
What is over-sampling and why is it working?
BLB steps:
1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]
2 for every subsampleτb, repeatB2times:
I over-sampling: affect weights(n1, . . . ,nm)simulated asM
n,m11m to observations inτb
I estimation step: train an estimator with weighted observations
3 aggregate by averaging
Remark: Final sample size (Pm
i=1ni) is equal ton(with replacement) as in standard bootstrap samples.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 15/46
Large scale kernel methods
Given a training set(Xi,Yi)i=1,...,n (X ∈ XandY ∈Ror{0, . . . ,K −1}), kernel methods(such as SVM) are based on the definition of a kernel:
K : X × X →Rthat is symmetric (K(x,x0) =K(x0,x)) and positive (PN
i,i0=1αiαi0K(xi,xi0)≥0).
This is equivalent to embedXinto an (implicit) Hilbert space in which standard (often linear) statistical methods can be used with the kernel trick:
K(x,x0) =hφ(x), φ(x0)iH
Standard complexity (number of elementary operations) of kernel learning methods: O(n2)or evenO(n3)(ex: K-PCA)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 16/46
Large scale kernel methods
Given a training set(Xi,Yi)i=1,...,n (X ∈ XandY ∈Ror{0, . . . ,K −1}), kernel methods(such as SVM) are based on the definition of a kernel:
K : X × X →R
This is equivalent to embedXinto an (implicit) Hilbert space in which standard (often linear) statistical methods can be used with the kernel trick:
K(x,x0) =hφ(x), φ(x0)iH
Standard complexity (number of elementary operations) of kernel learning methods: O(n2)or evenO(n3)(ex: K-PCA)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 16/46
Large scale kernel methods
Given a training set(Xi,Yi)i=1,...,n (X ∈ XandY ∈Ror{0, . . . ,K −1}), kernel methods(such as SVM) are based on the definition of a kernel:
K : X × X →R
This is equivalent to embedXinto an (implicit) Hilbert space in which standard (often linear) statistical methods can be used with the kernel trick:
K(x,x0) =hφ(x), φ(x0)iH
Standard complexity (number of elementary operations) of kernel learning methods: O(n2)or evenO(n3)(ex: K-PCA)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 16/46
Low rank approximation solutions
Aim: approximateK= (K(Xi,Xi0)i,i0=1,...,nwith a low rank matrix (matrix with rankk n.
Then, use the approximation to train your predictor and correct it to
“re-scale” it ton. Typical computational cost:O(nk2).
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 17/46
Low rank approximation solutions
Aim: approximateK= (K(Xi,Xi0)i,i0=1,...,nwith a low rank matrix (matrix with rankk n.
Then, use the approximation to train your predictor and correct it to
“re-scale” it ton. Typical computational cost:O(nk2).
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 17/46
Sketch of Nyström approximation
[Williams and Seeger, 2000,Drineas and Mahoney, 2005]
Pick at randommobservations in{1, . . . , m}(without loss of generality suppose that the firstmones have been chosen).
Re-write: K=
"
K(m) K(m,n−m) K(n−m,m) K(n−m,n−m)
#
et K(n,m)=
"
K(m) K(n−m,m)
# ,
withK(n−m,m)= (K(m,n−m))>, and useK(m)instead ofK.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 18/46
Sketch of Nyström approximation
[Williams and Seeger, 2000,Drineas and Mahoney, 2005]
Pick at randommobservations in{1, . . . , m}(without loss of generality suppose that the firstmones have been chosen).
Re-write:
K=
"
K(m) K(m,n−m) K(n−m,m) K(n−m,n−m)
#
et K(n,m)=
"
K(m) K(n−m,m)
# ,
withK(n−m,m)= (K(m,n−m))>, and useK(m)instead ofK.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 18/46
Approximate spectral decomposition of K
Notations:
K eigenvectors:(vj)j=1,...,n
eigenvalues:(λj)j=1,...,n (positive, decreasing order)
K(m) eigenvectors:(vj(m))j=1,...,m
eigenvalues:(λ(m)j )j=1,...,m (positive, decreasing order)
∀j =1, . . . , m, µj' n
mµ(m)j and vj' rm
n 1 µ(m)j K
(n,m)vj(m)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 19/46
Approximate spectral decomposition of K
Notations:
K eigenvectors:(vj)j=1,...,n
eigenvalues:(λj)j=1,...,n (positive, decreasing order)
K(m) eigenvectors:(vj(m))j=1,...,m
eigenvalues:(λ(m)j )j=1,...,m (positive, decreasing order)
∀j =1, . . . , m, µj' n
mµ(m)j and vj' rm
n 1 µ(m)j K
(n,m)vj(m)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 19/46
Approximate spectral decomposition of K
Notations:
K eigenvectors:(vj)j=1,...,n
eigenvalues:(λj)j=1,...,n (positive, decreasing order)
K(m) eigenvectors:(vj(m))j=1,...,m
eigenvalues:(λ(m)j )j=1,...,m (positive, decreasing order)
∀j =1, . . . , m, µj' n
mµ(m)j and vj' rm
n 1 µ(m)j K
(n,m)vj(m)
complexity of thedirect calculation: O(n3)
complexity of theapproximate solution: O(m3) +O(nm2)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 19/46
Approximate spectral decomposition of K
Notations:
K eigenvectors:(vj)j=1,...,n
eigenvalues:(λj)j=1,...,n (positive, decreasing order)
K(m) eigenvectors:(vj(m))j=1,...,m
eigenvalues:(λ(m)j )j=1,...,m (positive, decreasing order)
∀j =1, . . . , m, µj' n
mµ(m)j and vj' rm
n 1 µ(m)j K
(n,m)vj(m)
Remark: When the rank ofKis<m, the approximation is exact.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 19/46
Approximate spectral decomposition of K
Notations:
K eigenvectors:(vj)j=1,...,n
eigenvalues:(λj)j=1,...,n (positive, decreasing order)
K(m) eigenvectors:(vj(m))j=1,...,m
eigenvalues:(λ(m)j )j=1,...,m (positive, decreasing order)
∀j =1, . . . , m, µj' n
mµ(m)j and vj' rm
n 1 µ(m)j K
(n,m)vj(m)
Notations: µ(n,m)= mnµ(m)j andvj(n,m) = qm
n 1
µ(m)j K(n,m)vj(m)
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 19/46
Low rank approximation of K
Arankk (fork ≤m) approximation ofKis obtained with:
eK=K(n,m)
K(m)k +
K(m,n)
in whichK(m)k is the best rankk approximation ofK(m)and
K(m)k +
is its pseudo inverse.
=Pk j=1
µ(m)j −1
vj(m)
vj(m) >
.
eK=
k
X
j=1
µ(n,m)j vj(n,m)
vj(n,m) >
Quality of the approximation;[Cortes et al., 2010,Bach, 2013]
With probability at least 1−δ,
keK−Kk2≤ kKk −Kk2+ √n m max
i=1,...,nKii 2+log 1 δ
!!
whereKk is the best rankk approximation ofKandKiiare diagonal terms inK.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 20/46
Low rank approximation of K
Arankk (fork ≤m) approximation ofKis obtained with:
eK=K(n,m)
K(m)k +
K(m,n)
in whichK(m)k is the best rankk approximation ofK(m)and
K(m)k +
=Pk
j=1
µ(m)j −1
vj(m)
vj(m) >
.
eK= Xk
j=1
µ(n,m)j vj(n,m)
vj(n,m) >
Quality of the approximation;[Cortes et al., 2010,Bach, 2013]
With probability at least 1−δ,
keK−Kk2≤ kKk −Kk2+ √n m max
i=1,...,nKii 2+log 1 δ
!!
whereKk is the best rankk approximation ofKandKiiare diagonal terms inK.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 20/46
Low rank approximation of K
Arankk (fork ≤m) approximation ofKis obtained with:
eK= Xk
j=1
µ(n,m)j vj(n,m)
vj(n,m) >
Quality of the approximation;[Cortes et al., 2010,Bach, 2013]
With probability at least 1−δ,
keK−Kk2≤ kKk −Kk2+ √n m max
i=1,...,nKii 2+log 1 δ
!!
whereKk is the best rankk approximation ofKandKiiare diagonal terms inK.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 20/46
Low rank approximation of K
Arankk (fork ≤m) approximation ofKis obtained with:
eK= Xk
j=1
µ(n,m)j vj(n,m)
vj(n,m) >
Quality of the approximation;[Cortes et al., 2010,Bach, 2013]
With probability at least 1−δ,
keK−Kk2≤ kKk −Kk2+ √n m max
i=1,...,nKii 2+log 1 δ
!!
whereKk is the best rankk approximation ofKandKiiare diagonal terms inK.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 20/46
Application to kernel methods
Example of the kernel ridge regression [Saunders et al., 1998]
KRRis a kernel regression method with simply performs a linear regression in the feature space, regularized byL2norm:
arg min
w∈H n
X
i=1
(Yi− hw, φ(xi)iH)2+λ 2kwk2H
Solution:
ˆf : x∈ X → hw, φ(xi)iH = Xn
i=1
αiK(Xi,x)
α1
... αn
= (K+λIn)−1
Y1
... Yn
The inverse of(K+λIn)is computationallyvery costly!
Nyström approximation: replaceKby its Nyström rankk approximation... αhas the explicit form:
α˜= 1 λ
Y−Vk
λIn+ ΛkV>kVk−1
ΛkV>kY ,
Quality of the predictionis also controlled in this approximation (see
[Cortes et al., 2010,Bach, 2013]).
Similar methodsexist to use Nyström approximation in SVM.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 21/46
Application to kernel methods
Example of the kernel ridge regression [Saunders et al., 1998]
KRRis a kernel regression method with simply performs a linear regression in the feature space, regularized byL2norm:
arg min
w∈H n
X
i=1
(Yi− hw, φ(xi)iH)2+λ 2kwk2H
Solution:
ˆf : x ∈ X → hw, φ(xi)iH = Xn
i=1
αiK(Xi,x)
α1
...
αn
= (K+λIn)−1
Y1
...
Yn
The inverse of(K+λIn)is computationallyvery costly! Nyström approximation: replaceKby its Nyström rankk approximation... αhas the explicit form:
α˜= 1 λ
Y−Vk
λIn+ ΛkV>kVk−1
ΛkV>kY ,
Quality of the predictionis also controlled in this approximation (see
[Cortes et al., 2010,Bach, 2013]).
Similar methodsexist to use Nyström approximation in SVM.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 21/46
Application to kernel methods
Example of the kernel ridge regression [Saunders et al., 1998]
KRRis a kernel regression method with simply performs a linear regression in the feature space, regularized byL2norm:
arg min
w∈H n
X
i=1
(Yi− hw, φ(xi)iH)2+λ 2kwk2H
Solution:
ˆf : x ∈ X → hw, φ(xi)iH = Xn
i=1
αiK(Xi,x)
α1
...
αn
= (K+λIn)−1
Y1
...
Yn
The inverse of(K+λIn)is computationallyvery costly!
Nyström approximation: replaceKby its Nyström rankk approximation... αhas the explicit form:
α˜= 1 λ
Y−Vk
λIn+ ΛkV>kVk−1
ΛkV>kY ,
Quality of the predictionis also controlled in this approximation (see
[Cortes et al., 2010,Bach, 2013]).
Similar methodsexist to use Nyström approximation in SVM.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 21/46
Application to kernel methods
Example of the kernel ridge regression [Saunders et al., 1998]
Nyström approximation: replaceKby its Nyström rankk approximation...
αhas the explicit form:
α˜= 1 λ
Y−Vk
λIn+ ΛkV>kVk−1
ΛkV>kY ,
Quality of the predictionis also controlled in this approximation (see
[Cortes et al., 2010,Bach, 2013]).
Similar methodsexist to use Nyström approximation in SVM.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 21/46
Application to kernel methods
Example of the kernel ridge regression [Saunders et al., 1998]
Nyström approximation: replaceKby its Nyström rankk approximation...
αhas the explicit form:
α˜= 1 λ
Y−Vk
λIn+ ΛkV>kVk−1
ΛkV>kY ,
Quality of the predictionis also controlled in this approximation (see
[Cortes et al., 2010,Bach, 2013]).
Similar methodsexist to use Nyström approximation in SVM.
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 21/46
Sommaire
1 Introduction
2 Subsampling
Bag of Little Bootstrap (BLB) Nyström approximation
3 Divide & Conquer
4 Online learning
Onlinek-nearest neighbors Online kernel ridge regression Online bagging
Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 22/46