Pack only the essentials: Adaptive dictionary learning for kernel ridge regression

(1)

HAL Id: hal-01482756

https://hal.inria.fr/hal-01482756

Submitted on 3 Mar 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Pack only the essentials: Adaptive dictionary learning for kernel ridge regression

Daniele Calandriello, Alessandro Lazaric, Michal Valko

To cite this version:

Daniele Calandriello, Alessandro Lazaric, Michal Valko. Pack only the essentials: Adaptive dictio-

nary learning for kernel ridge regression. Adaptive and Scalable Nonparametric Methods in Machine

Learning at Neural Information Processing Systems, 2016, Barcelona, Spain. �hal-01482756�

(2)

Pack only the essentials: Adaptive dictionary learning for kernel ridge regression

Daniele Calandriello Alessandro Lazaric Michal Valko SequeL team, INRIA Lille - Nord Europe, France

{daniele.calandriello, alessandro.lazaric, michal.valko}@inria.fr

1 Introduction

One of the major limits of kernel ridge regression (KRR) is that fornsamples storing and manip- ulating the kernel matrixKnrequiresO(n²)space, which becomes rapidly unfeasible for largen.

Many solutions focus on how to scale KRR by reducing its space (and time) complexity without compromising the prediction accuracy. A popular approach is to construct low-rank approximations of the kernel matrix by randomly selecting a subset ofmcolumns fromKn, thus reducing the space complexity toO(nm). These methods, often referred to asNyström approximations, mostly differ in the distribution used to sample the columns ofK_nand the construction of low-rank approximations.

Both of these choices significantly affect the accuracy of the resulting approximation [5]. Bach [2]

showed that uniform sampling preserves the prediction accuracy of KRR (up toε) only when the number of columnsmis proportional to the maximum degree of freedom of the kernel matrix. This may require samplingO(n)columns in datasets with high coherence [4] (i.e., a kernel matrix with weakly correlated columns). Alternatively, Alaoui and Mahoney [1] showed that sampling columns according to their ridge leverage scores (RLS) (i.e., a measure of the influence of a point on the regression) produces an accurate Nyström approximation with only a number of columnsmproportional to the average degrees of freedom of the matrix, calledeffective dimension. Unfortunately, the complexity of computing RLS is comparable to solving KRR itself, making this approach unfeasible. However, Alaoui and Mahoney [1] proposed a fast method to compute a constant-factor approximation of the RLS and showed that accuracy and space complexity are close to the case of sampling with exact RLS at the cost of an extra dependency on the inverse of the minimal eigenvalue of the kernel matrix.

Unfortunately, the minimal eigenvalue can be arbitrarily small in many problems. Calandriello et al.

[3] addressed this issue by processing the datasetincrementallyand updating estimates of the ridge leverage scores, effective dimension, and Nyström approximations on-the-fly. Although the space complexity of the resulting algorithm (INK-ESTIMATE) does not depend on the minimal eigenvalue anymore, it introduces a dependency on the largest eigenvalue ofKn, which in the worst case can be as big asn. This can potentially reduce the advantage of the method. In this paper we introduce SQUEAK, a new algorithm that builds on INK-ESTIMATE, but usesunnormalizedRLS and an improved RLS estimator. As a consequence, the algorithm is simpler, does not need to compute an estimate of the effective dimension for normalization, and it achieves a space complexity that is only a constant factor worse than sampling according to the exact RLS.

2 Background

Notation. We use curly capital lettersAfor collections and|A|for the number of entries inA, upper-case bold lettersAfor matrices and lower-case bold lettersafor vectors. We denote by[A]ij

and[a]ithe(i, j)element of a matrix andi-th element of a vector respectively. We useen,i ∈Rⁿfor thei-th indicator vector of dimensionn. Finally, the set of the firstnintegers is[n] :={1, . . . , n}.

Kernel regression.We consider a regression datasetD={(xt, y_t)}ⁿ_t=1, with inputx_t∈ X ⊆R^d and outputyt=f^?(xt) +ηt, wheref^?is an unknown target function andηtis a zero-mean i.i.d.

noise. We denote byK:X × X →Ra positive definite kernel function. Given the firsttsamples inD, the kernel matrixKt∈R^t×tis obtained as[Kt]ij =K(xi,xj)for anyi, j∈[t]and we denote byyt,f_t^? ∈ R^tthe vectors with componentsyi andf^?(xi),i ∈ [t]. Whenever a new pointxt+1

arrives, the kernel matrixK_t+1∈R^t+1×t+1is obtained byborderingK_tas Kt+1=

"

K_t k_t+1 k^T_t+1 kt+1

#

(1)

(3)

wherekt+1 ∈R^tis such that[kt+1]i =K(xt+1,xi)for anyi ∈ [t]andkt+1 =K(xt+1,xt+1).

At any timet, the objective of kernel regression is to find the vectorwbt∈R^tthat minimizes the regularized quadratic loss

wbt= arg min

w

kyt−Ktwk²+µkwk²= (Kt+µI)⁻¹yt, (2) whereµ∈Ris a regularization parameter. Ifµis properly tuned, thenwbtachieves a near-optimal riskR(wb_t) =Eη

||f_t^?−K_twb_t||²₂

. Nonetheless, the computation of the finalwb_nrequiresO(n³) time andO(n²)space, which is infeasible for large datasets.

Nyström approximation. A common approach to reduce the complexity is to (randomly) se- lectmcolumns ofK_taccording to some distributionp_t={p_t,i}^t_i=1and construct the dictionary It = {(ij,kt,i_j,pet,i_j)}^m_j=1, which contains the set of indicesij∈[t], the corresponding columns and their weights. Given a dictionaryIt, the regularized Nyström approximation ofKtis obtained as Ket=KtSt(S^T_tKtSt+γIm)⁻¹S^T_tKt, (3) where the selection matrixSt∈R^t×mis defined asSt = [(qept,i₁)^−1/2et,i₁, . . . ,(qpet,i_m)^−1/2et,i_m], qis a constant, andγis a regularization term (possibly different fromµ). At this point,Ketcan be used to computewe_t= (Ke_t+µI_t)⁻¹y_tefficiently using block inversion, reducing the complexity fromO(n³)toO(nm²+m³)time and fromO(n²)toO(nm)space.

Ridge leverage scores.The accuracy ofKetis strictly related to the distributionptused to construct the dictionaryIt. In particular, Alaoui and Mahoney [1] showed that sampling according to the γ-ridge leverage scores (RLS) ofK_tleads to an accurate Nyström approximation.

Definition 1. GivenKt=UtΛtU^T_t, theγ-ridge leverage score (RLS) of columni∈[t]is

τt,i=k^T_t,i(Kt+γIt)⁻¹et,i=e^T_t,iKt(Kt+γIt)⁻¹et,i, (4) Furthermore, the effective dimension of the kernel is defined asd_eff(γ)_t=Pt

i=1τ_t,i. Similar to standard leverage scores (i.e.,P

j[U]²_i,j), RLSs measure the importance of each pointxi

for the kernel regression. Furthermore, the sum of the RLSs is the effective dimensiond_eff(γ)t, which measures the intrinsic capacity of the kernelK_twhen its spectrum is soft-thresholded by a regularizationγ. Using RLS in constructing a Nyström approximation leads to the following result.

Proposition 1(Alaoui and Mahoney [1]). Let ε ∈ [0,1]andI_n be the dictionary built withm columns selected proportionally to RLSs{τn,i}. Ifm=O(_ε¹2deff(γ)nlog(ⁿ_δ)), the Nyström approxi- mationKenis aγ-approximation ofKt, that is0Kt−Ket_1−ε^γ Kt(Kt+γI)⁻¹ _1−ε^γ Iand the risk ofwe_tisR(we_t)≤(1 +^γ_µ_1−ε¹ )R(wb_t).

Unfortunately, computing exact RLS requires storingKn, and has the sameO(n²)space requirement as solving Eq. 2. In the next section, we introduce SQUEAK, an RLS-based incremental algorithm able to preserve the same accuracy of Prop. 1 without requiring to know the RLS in advance, and that generates a dictionary only a constant factor larger than exact RLS sampling.

3 Incremental Nyström approximation with ridge leverage scores

SQUEAK (Alg. 1) builds on the INK-ESTIMATEalgorithm [3] with the major algorithmic difference that the sampling probabilities are computed directly on estimatesτt,iwithout renormalizing them by an estimate ofd_eff(γ)_t. SQUEAK introduces two key elements:1)an improved, accurate estimator of the RLS and2)an incremental sampling scheme for the construction of the dictionaryIt. 1) Estimation of RLS.We introduce an RLS estimator that improves on [3], showing that it can be efficiently computed. At any timet, letQt = P

iQt,i be the number of columns|It|con- tained in the dictionary at timet, andSt ∈ R^t×Q^t the selection matrix constructed so far. Let St+1 ∈ R^(t+1)×(Q^t^+q) be constructed as[St,(q)^−1/2et+1,t+1, . . . ,(q)^−1/2et+1,t+1]by adding qcopies ofe_t+1,t+1 to the selection matrix. Denotingα= (1 +ε)/(1−ε), we define the RLS estimator as

eτ_t+1,i=1 +ε αγ

k_i,i−k_t+1,iS

S^TK_t+1S+γI⁻¹

S^Tk_t+1,i

. (5)

2

(4)

Algorithm 1The SQUEAK algorithm Input: DatasetD, regularizationγ, µ,q Output: Ken,wen

1: InitializeI0as empty,pe1,0= 1 2: fort= 0, . . . , n−1do

3: Receive new column[kt+1, kt+1]

4: Computeα-approximate RLS{eτ_t+1,i:i∈ I_t∪ {t+ 1}}, usingI_t,[k_t+1, k_t+1], and Eq. 5 5: Setpe_t+1,i= max{min{τe_t+1,i, pe_t,i}, pe_t,i/2}

6: InitializeIt+1=∅ 7: for allj∈ {1, . . . , t}do 8: Q_t,j =|{i=j:i∈ I_t}|

9: ifQt,j6= 0then

10: Qt+1,j ∼ B(ept+1,j/pet,j, Qt,j)

11: AddQt+1,j copies of(j,kt+1,j,pet+1,j)toIt+1. 12: end if

13: end for

14: Q_t+1,t+1∼ B(pe_t+1,t+1, q)

15: AddQ_t+1,t+1copies of(t+ 1,k_t+1,t+1,pe_t+1,t+1)toIt+1

16: end for

17: ComputeKenusingInand Eq. 3 18: ComputewenusingKen,yn

DICT-UPDATE

SHRINK

EXPAND

IfQ_t ≥ q, thenτe_t+1,ican be computed inO(Q³_t)time (O(Q_t)to computek_t+1,iSandO(Q_t)³ to invert the inner matrix) andO(Q²_t)space. IfQ_t < q the same applies withq replacingQ_t. Furthermore, we have the following guarantee.

Lemma 1. Assume that the dictionaryItinduces aγ-approximate kernelKet. Then for allisuch thati∈ {I_t∪ {t+ 1}},τe_t+1,icomputed using Eq.5 is anα-approximation of the RLSτ_t,i, that is τt+1,i(γ)/α≤eτt+1,i≤τt+1,i(γ).

2) Sequential sampling.At each time stept, SQUEAK receives a new column[kt+1, kt+1]. This can be implemented either by having a separate algorithm that constructs each column sequentially and streams it to SQUEAK, or by storing just the samples (with an additionalO(td)space complexity) and computing the column once. Adding a new column to the matrix can either decrease the importance of columns already observed (i.e., if they are correlated to the new column) or leave it unchanged (i.e., if they are orthogonal) and thus the RLS evolves asτt+1,i≤τt,i[3, App. A, Lem. 4].

In the DICT-UPDATEloop, the dictionary is updated to reflect the change in importance of old columns (e.g.,pt,i=τt,imay decrease) and to add the new column proportionally to its RLSτt+1,t+1. The dictionaryIt, and the new column are used to compute new approximate RLSeτt+1,ias in Eq. 5, which in turn define the new sampling probabilitiesep_t+1,i. The DICT-UPDATEphase is composed of two steps. For each indexi∈[t], the SHRINKstep counts the number of copiesQt,ipresent inIt, and then draws a sample from the binomialB(pet+1,i/pet,i, Qt,i), where takingpet+1,i= min{τet+1,i, pet,i} ensures that the binomial probability at L10 is well defined. The morepe_t+1,iis lower thanpe_t,i, the moreQt+1,iwill be lower thanQt,i. If the probabilitypet+1,icontinues to decrease over time, it is also possible that Qt+1,i is decreased to zero, and columni is completely dropped from the dictionary. Intuitively, the SHRINKstep stochastically reduces the size of the dictionary to reflect the reductions of the RLSs. Conversely, the EXPANDstep add the new column to the dictionary with a number of copies (from 0 toq) which depends on its estimated relevancepe_t+1,t+1. Unlike in [3], the approximate probabilitiespet,iare not obtained by normalizing the approximateeτt,iby an estimate of the effective dimension and thus they do not necessarily sum to one. Yet, we guarantee thatpe_t,i≤p_t,i≤1by construction. Note that SQUEAKneverestimates again the RLS of a columns dropped fromIt. Moreover, computing Eq. (5) requires only to construct the kernel sub-matrix for samples whose indices are inIt. Therefore, if we are only interested in estimating the approximate RLSeτ_t,iand not the regression weightswe_t, SQUEAK is the first RLS sampling algorithm that can operate in a single pass over the dataset (store and access only the samples inItinstead of the whole Dt), without ever constructing the whole matrix. Thm. 1 guarantees that SQUEAK succeeds in

(5)

Time |In|(Total space =O(n|In|)) Acc. loss Increm.

EXACT n³ n 1 N/A

Bach [2] ^nd^max_ε ²ⁿ +^d^max_ε³ⁿ ^d^max,n_ε (1 + 4ε) No Alaoui and Mahoney [1] n(|In|)²

λ_min+nµε λ_min−nµε

deff(γ)n+^Tr(K_µεⁿ⁾ (1 + 2ε)² No Calandriello et al. [3] ^λ²^max_γ2

n²d_eff(γ)²_n ε²

λmax γ

d_eff(γ)_n

ε² (1 + 2ε)² Yes

SQUEAK ⁿ²^d^eff_ε2^(γ)²ⁿ

d_eff(γ)n

ε² (1 + 2ε)² Yes

RLS-SAMPLING ^nd^eff^(γ)

2n ε²

d_eff(γ)_n

ε² (1 + 2ε)² N/A

Table 1:Comparison of Nyström methods.λmaxandλminrefer to largest and smallest eigenvalues ofKn. Theorem 1. Letα=

1+ε 1−ε

andγ >1. For any0≤ε≤1, and0≤δ≤1, if we run Alg. 1 with parameterq=O(_ε^α2log(ⁿ_δ))to compute a sequence of random dictionariesIteach with a random number of entries|It|, then with probability1−δ, for all iterationst∈[n]

(1) The Nyström approximationKe_t(Eq. 3) associated withItis aγ-approximation ofK_t. (2) The number of stored columns is|It|=P

iQt,i ≤ O(qd_eff(γ)t)≤ O(_ε^α2d_eff(γ)nlog(ⁿ_δ)).

(3) The solutionwetsatisfiesR(wet)≤(1 + ^γ_µ_1−ε¹ )R(wbt).

As the previous theorem holds for anyt ∈ [n], SQUEAK has any-time guarantees on its space complexity, approximation, and risk performance. In fact,(1)combined with Lem. 1 shows that, at all steps,eτt,iareα-approximate RLSs estimates. Since adding a column toKtcan only increase the effective dimension (i.e.,d_eff(γ)_t≤d_eff(γ)_t+1) [3, App. A, Lem. 5], from(2)we see that the number of columns stored by SQUEAK over iterations never exceeds the budgetO(deff(γ)nlog(n)) required by sampling columns according to the exact RLS computed over the whole dataset. Notice that this is obtained by automatically increasing the dictionary size (and space occupation) over time to adapt to the growth in effective dimension of the data, which does not need to be known in advance.

Furthermore, if the size of the dictionary grows too large w.r.t. the memory available, we can still terminate the algorithm knowing that the intermediate dictionary returned is a good approximation of the part of dataset processed. We can also restart the process with a largerγ, sinced_eff(γ)n is inversely proportional toγ. The tradeoffs of this approach are quantified by(3), which shows that all solutionswetincur a risk only a factor roughly(1 +γ/µ)away from the corresponding exact solutionwbt. This means that choosing a smallγ < µallows to achieve a risk close to the exact solution for a large range ofµ, at the cost of increasing the space, while largerγrequire less space but it may prevent from tuningµoptimally. Finally, it is important to notice that even in the worst cased_eff(γ)_n=n, SQUEAK requires onlylog(n)more space than storing the whole matrix.

4 Discussion

Table 1 compares several Nyström approximation methods w.r.t. their space complexity and risk.

For all methods, we omitO(log(n))factors. The space complexity of uniform sampling [2] scales with the maximal degree of freedomd_max. Sinced_max=nmaxiτn,i≥P

iτn,i=d_eff(γ)n, uniform sampling is often outperformed by RLS sampling. While Alaoui and Mahoney [1] also sample according to RLS, their two-pass estimator is not very accurate. In particular, the first pass requires to sampleO(nµε/(λ_min−nµε))columns, which quickly grows aboven² whenλ_min becomes small. Finally, [3] require that the maximum dictionary size is fixed in advance, which implies some knowledge of the effective dimensionsd_eff(γ)n, and requires estimating botheτt,i andde_eff(γ)t. In particular, this extra estimation effort causes an additionalλ_max/γ factor to appear in the space complexity. This factor cannot be easily estimated, and causes a space complexity ofn³ in the worst case. We also include RLS-SAMPLING, a fictitious algorithm that receives the exact RLS in input, as an ideal baseline for all RLS sampling algorithms. From the table, we can therefore see that SQUEAK achieves the same space complexity (up to constant factors) as knowing the RLS in advance. Moreover, although in this paper we only considered fixed design KRR,γ-approximation guarantees forKenare commonly used in similar problems such as random design KRR, or Kernel PCA. Finally, with a more careful analysis, we can generalize SQUEAK and its guarantees to the distributed setting, where multiple machines construct dictionaries in parallel on separate datasets, and then recursively merge them to construct a dictionary for the union of the datasets.

4

(6)

References

[1] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel methods with statistical guarantees. InNeural Information Processing Systems, 2015.

[2] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. InConference on Learning Theory, 2013.

[3] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Analysis of Nyström method with sequential ridge leverage scores. InUncertainty in Artificial Intelligence, 2016.

[4] Alex Gittens and Michael W Mahoney. Revisiting the Nyström method for improved large-scale machine learning. InInternational Conference on Machine Learning, 2013.

[5] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computa- tional regularization. InNeural Information Processing Systems, 2015.