• Aucun résultat trouvé

Efficient online learning with kernels for adversarial large scale problems

N/A
N/A
Protected

Academic year: 2021

Partager "Efficient online learning with kernels for adversarial large scale problems"

Copied!
26
0
0

Texte intégral

(1)

HAL Id: hal-02019402

https://hal.archives-ouvertes.fr/hal-02019402v2

Preprint submitted on 28 May 2019

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Efficient online learning with kernels for adversarial large scale problems

Rémi Jézéquel, Pierre Gaillard, Alessandro Rudi

To cite this version:

Rémi Jézéquel, Pierre Gaillard, Alessandro Rudi. Efficient online learning with kernels for adversarial

large scale problems. 2019. �hal-02019402v2�

(2)

E FFICIENT ONLINE LEARNING WITH K ERNELS FOR

ADVERSARIAL LARGE SCALE PROBLEMS

Rémi Jézéquel Pierre Gaillard Alessandro Rudi

INRIA - Département d’Informatique de l’École Normale Supérieure PSL Research University

Paris, France

A

BSTRACT

We are interested in a framework of online learning with kernels for low-dimensional but large-scale and potentially adversarial datasets. We study the computational and theoretical performance of online variations of kernel Ridge regression. Despite its simplicity, the algorithm we study is the first to achieve the optimal regret for a wide range of kernels with a per-round complexity of ordernα withα <2.

The algorithm we consider is based on approximating the kernel with the linear span of basis functions.

Our contributions is two-fold: 1) For the Gaussian kernel, we propose to build the basis beforehand (independently of the data) through Taylor expansion. Ford-dimensional inputs, we provide a (close to) optimal regret of orderO((logn)d+1)with per-round time complexity and space complexity O((logn)2d). This makes the algorithm a suitable choice as soon asn ed which is likely to happen in a scenario with small dimensional and large-scale dataset; 2) For general kernels with low effective dimension, the basis functions are updated sequentially in a data-adaptive fashion by sampling Nyström points. In this case, our algorithm improves the computational trade-off known for online kernel regression.

1 Introduction

Nowadays the volume and the velocity of data flows are deeply increasing. Consequently many applications need to switch from batch to online procedures that can treat and adapt to data on the fly. Furthermore to take advantage of very large datasets, non-parametric methods are gaining increasing momentum in practice. Yet the latter often suffer from slow rates of convergence and bad computational complexities. At the same time, data is getting more complicated and simple stochastic assumptions such as i.i.d. data are often not satisfied. In this paper, we try to combine these different aspects due to large scale and arbitrary data. We build a non-parametric online procedure based on kernels, which is efficient for large data sets and achieves close to optimal theoretical guarantees.

Online learning is a subfield of machine learning where some learner sequentially interacts with an environment and tries to learn and adapt on the fly to the observed data as one goes along. We consider the following sequential setting.

At each iterationt≥1, the learner receives some inputxt∈ X; makes a predictionybt∈Rand the environment reveals the outputyt∈R. The inputsxtand the outputsytare sequentially chosen by the environment and can be arbitrary.

Learner’s goal is to minimize his cumulative regret Rn(f) :=

n

X

t=1

(yt−byt)2

n

X

t=1

yt−f(xt)2

(1) uniformly over all functionsf in a space of functionsH. We will consider Reproducing Kernel Hilbert Space (RKHS) H, [see next section or 1, for more details]. It is worth noting here that all the properties of a RKHS are controlled by the associatedkernel functionk:X × X →R, usually known in closed form, and that many function spaces of interest are (or are contained in) RKHS, e.g. whenX ⊆Rd: polynomials of arbitrary degree, band-limited functions, analytic functions with given decay at infinity, Sobolev spaces and many others [2].

(3)

Previous work Kernel regression in a statistical setting has been widely studied by the statistics community. Our setting of online kernel regression with adversarial data is more recent. Most of existing work focuses on the linear setting (i.e., linear kernel). First work on online linear regression dates back to [3]. [4] provided the minimax rates (together with an algorithm) and we refer to reader to references therein for a recent overview of the literature in the linear case. We only recall relevant work for this paper. [5, 6] designed the nonlinear Ridge forecaster (denotedAWV).

In linear regression (linear kernel), it achieves the optimal regret of orderO(dlogn)uniformly over all`2-bounded vectors. The latter can be extended to kernels (see Definition (3)) which we refer to as Kernel-AWV. With regularization parameterλ >0, it obtains a regret upper-bounded for allf ∈ Has

Rn(f).λ f

2+B2deff(λ), where deff(λ) := Tr(Knn Knn+λIn)−1

(2) is the effective dimension, whereKnn := k(xi, xj)

1≤i,j≤n∈Rn×n denotes the kernel matrixat timen. The above upper-bound on the regret is essentially optimal (see next section). Yet the per round complexity and the space complexity of Kernel-AWV areO(n2). In this paper, we aims at reducing this complexity while keeping optimal regret guarantees.

Though the literature on online contextual learning is vast, little considers non-parametric function classes. Related work includes [7] that considers the Exponentially Weighted Average forecaster or [8] which considers bounded Lipschitz function set and Lipschitz loss functions, while here we focus on the square loss. Minimax rates for general function setsHare provided by [9]. RKHS spaces were first considered in [10] though they only obtainO(√

n)rates which are suboptimal for our problem. More recently, a regret bound of the form (2) was proved by [11] for a clipped version of kernel Ridge regression and by [12] for a clipped version of Kernel Online Newton Step (KONS) for general exp-concave loss functions.

The computational complexity (O(n2)per round) of these algorithms is however prohibitive for large datasets. [12] and [13] provide approximations of KONS to get manageable complexities. However these come with deteriorated regret guarantees. [12] improves the time and space complexities by a factorγ∈(0,1)enlarging the regret upper-bound by 1/γ. [13] designs an efficient approximation of KONS based on Nyström approximation [14, 15] and restarts with per-round complexitiesO m2)wheremis the number of Nyström points. Yet their regret bound suffers an additional multiplicative factormwith respect to (2) because of the restarts. Furthermore, contrary to our results, the regret bounds of [12] and [13] are not with respect to all functions inHbut only with functionsf ∈ Hsuch thatf(xt)≤Cfor all t≥1whereCis a parameter of their algorithm. SinceCcomes has a multiplicative factor of their bounds, their results are sensitive to outliers that may lead to largeC. Another relevant approximation scheme of Online Kernel Learning was done by [16]. The authors consider online gradient descent algorithms which they approximate using Nyström approximation or a Fourier basis. However since they use general Lipschitz loss functions and consider`1-bounded dual norm of functionsf, their regret bounds of orderO(√

n)are hardly comparable to ours and seem suboptimal inn in our restrictive setting with square loss and kernels with small effective dimension (such as Gaussian kernel).

Contributions and outline of the paper The main contribution of the paper is to analyse a variant of Kernel-AWV that we call PKAWV (see Definition (4)). Despite its simplicity, it is to our knowledge the first algorithm for kernel online regression that recovers the optimal regret (see bound (2)) with an improved space and time complexity of ordern2per round. Table 1 summarizes the regret rates and complexities obtained by our algorithm and the ones of [12, 13].

Our procedure consists simply in applying Kernel-AWV while, at timet≥1, approximating the RKHSHwith a linear subspaceH˜tof smaller dimension. In Theorem 3, PKAWV suffers an additional approximation term with respect to the optimal bound of Kernel-AWV which can be made small enough by properly choosingH˜t. To achieve the optimal regret with a low computational complexity,H˜tneeds to approximateHwell and to be low dimensional with an easy-to-compute projection. We provide two relevant constructions forH˜t.

In section 3.1, we focus on the Gaussian kernel that we approximate by a finite set of basis functions. The functions are deterministic and chosen beforehand by the player independently of the data. The number of functions included in the basis is a parameter to be optimized and fixes an approximation-computational trade-off. Theorem 4 shows that PKAWV satisfies (up to log) the optimal regret bounds (2) while enjoying a per-round space and time complexity of O log2d nλ . For the Gaussian kernel, this corresponds toO deff(λ)2

which is known to be optimal even in the statistical setting with i.i.d. data.

In section 3.2, we consider data adaptive approximation spacesH˜tbased on Nyström approximation. At timet≥1, we approximate any kernelHby sampling a subset of the input vectors{x1, . . . , xt}. If the kernel satisfies the capacity conditiondeff(λ)≤(n/λ)γforγ ∈(0,1), the optimal regret is then of orderdeff(λ) =O(nγ/(1+γ))for well-tuned parameterλ. Our method then recovers the optimal regret with a computational complexity ofO deff(λ)4/(1−γ)

. The latter iso(n2)(for well-tunedλ) as soon asγ <√

2−1. Furthermore, if the sequence of input vectorsxtis given

(4)

Kernel Algorithm Regret Per-round complexity Gaussian

deff(λ)≤ lognλd

PKAWV (logn)d+1 (logn)2d

Sketched-KONS [12] (c >0) c(logn)d+1 n/c2

Pros-N-KONS [13] (logn)2d+1 (logn)2d

General deff(λ)≤ nλγ

γ <√ 2−1

PKAWV nγ+1γ logn n1−γ2

Sketched-KONS [12] (c >0) cnγ+1γ logn n/c2

Pros-N-KONS [13] n

(1+γ)2 logn n

4γ(1−γ) (1+γ)2

Table 1: Order in nof the best possible regret rates achievable by the algorithms and corresponding per-round time-complexity. Up tologn, the rates obtained by PKAWV are optimal.

beforehand to the algorithm, the per-round complexity needed to reach the optimal regret is improved toO(deff(λ)4) and our algorithm can achieve it for allγ∈(0,1).

Finally, we perform in Section 4 several experiments based on real and simulated data to compare the performance (in regret and in time) of our methods with competitors.

Notations We recall here basic notations that we will use throughout the paper. Given a vectorv∈Rd, we write v= (v(1), . . . , v(d)). We denote byN0=N∪{0}the set of non-negative integers and forp∈Nd0,|p|=p(1)+· · ·+p(d). By a sligh abuse of notation, we denote byk · kboth the Euclidean norm and the norm for the Hilbert spaceH. Write v>w, the dot product betweenv, w ∈ RD. The conjugate transpose for linear operatorZ on Hwill be denoted Z. The notation.will refer to rough inequalities up to logarithmic multiplicative factors. Finally we will denote a∨b= max(a, b)anda∧b= min(a, b), fora, b∈R.

2 Background

Kernels. Letk:X × X →Rbe a positive definite kernel [1] that we assume to be bounded (i.e.,supx∈Xk(x, x)≤ κ2 for some κ > 0). The function k is characterized by the existence of a feature map φ : X → RD, with D∈N∪ {∞}1such thatk(x, x0) =φ(x)>φ(x0). Moreover thereproducing kernel Hilbert space(RKHS) associated tokis characterized byH = {f |f(x) = w>φ(x), w ∈ RD, x ∈ X },with inner producthf, giH := v>w, for f, g∈ Hdefined byf(x) =v>φ(x),g(x) =w>φ(x)andv, w∈RD. For more details and different characterizations ofk,H, see [1, 2]. It’s worth noting that the knowledge ofφis not necessary when working with functions of the form f =Pp

i=1αiφ(xi), withαi∈R,xi ∈ Xand finitep∈N, indeedf(x) =Pp

i=1αiφ(xi)>φ(x) =Pp

i=1αik(xi, x), and moreoverkfk2H>Kppα, withKppthe kernel matrix associated to the set of pointsx1, . . . , xp.

Kernel-AWV. The (denotedAWV) on the space of linear functions onX =Rdhas been introduced and analyzed in [5, 6]. We consider here a straightforward generalization to kernels (denotedKernel-AWV) of the nonlinear Ridge forecaster (AWV) introduced by [5, 6] on the space of linear functions onX =Rd. At iterationt≥1, Kernel-AWV predictsybt=fbt(xt), where

fbt∈argmin

f∈H

(t−1 X

s=1

ys−f(xs)2

f

2+f(xt)2 )

. (3)

A variant of this algorithm, more used in the context of data independently sampled from distribution, is known as kernel Ridge regression. It corresponds to solving the problem above, without the last penalization termf(xt)2. Optimal regret for Kernel-AWV.In the next proposition we state a preliminary result which proves that Kernel-AWV achieves a regret depending on the eigenvalues of the kernel matrix.

Proposition 1. Letλ, B >0. For any RKHSH, for alln≥1, for all inputsx1, . . . , xn ∈ X and ally1, . . . , yn ∈ [−B, B], the regret of Kernel-AWV is upper-bounded for allf ∈ Has

Rn(f)≤λ f

2+B2

n

X

k=1

log

1 + λk(Knn) λ

,

1whenD=∞we considerRDas the space of squared summable sequences.

(5)

whereλk(Knn)denotes thek-th largest eigenvalue ofKnn.

The proof is a direct consequence of the known regret bound ofAWVin the finite dimensional linear regression setting—see Theorem 11.8 of [17] or Theorem 2 of [18]. For completeness, we reproduce the analysis for infinite dimensional space (RKHS) in Appendix C.1. In online linear regression in dimensiond, the above result implies the optimal rate of convergencedB2log(n) +O(1)(see [18] and [6]). As shown by the following proposition, Proposition 1 yields optimal regret (up to log) of the form (2) for online kernel regression.

Proposition 2. For alln≥1,λ >0and all input sequencesx1, . . . , xn ∈ X,

n

X

k=1

log

1 + λk(Kn) λ

≤log

e+enκ2 λ

deff λ

.

Combined with Proposition 1, this entails that Kernel-AWV satisfies (up to the logarithmic factor) the optimal regret bound (2). As discussed in the introduction, such an upper-bound on the regret is not new and was already proved by [11] or by [12] for other algorithms. An advantage of Kernel-AWV is that it does not require any clipping and thus the beforehand knowledge ofB >0to obtained Proposition 1. Furthermore, we slightly improve the constants in the above proposition. We note that the regret bound for Kernel-AWV is optimal up to log terms whenλis chosen minimizing r.h.s. of Eq. (2), since it meets known minimax lower bounds for the setting where(x(i), yi)ni=1are sampled independently from a distribution. For more details on minimax lower bounds see [19], in particular Eq. (9), related discussion and references therein, noting that theirλcorrespond to ourλ/nand ourdeff(λ)corresponds to theirγ(λ/n).

It is worth pointing out that in the worst casedeff(λ)≤κ2n/λfor any bounded kernel. In particular, optimizing the bound yieldsλ=O(√

nlogn)and a regret bound of orderO(√

nlogn). In the special case of the Gaussian kernel (which we consider in Section 3.1), the latter can be improved todeff(λ). log(n/λ)d

(see [20]) which entails Rn(f)≤O (logn)d+1

for well tuned value ofλ.

3 Online Kernel Regression with projections

In previous section we have seen that Kernel-AWV achieves optimal regret. Yet, it has computational requirements that areO(n3)in time andO(n2)in space, fornsteps of the algorithm, making it unfeasible in the context of large scale datasets, i.e.n105. In this paper, we consider and analyze a simple variation of Kernel-AWV denoted PKAWV. At timet≥1, for a regularization parameterλ >0and a linear subspaceH˜tofHthe algorithm predictsbyt=fbt(xt), where

fbt= argmin

f∈H˜t

(t−1 X

s=1

ys−f(xs)2

f

2+f(xt)2 )

. (4)

In the next subsections, we explicit relevant approximationsH˜t(typically the span of a small number of basis functions) ofHthat trade-off good approximation with low computational cost. Appendix H details how (4) can be efficiently implemented in these cases.

The result below bounds the regret of the PKAWV for any functionf ∈ Hand holds for any bounded kernel and any explicit subspaceH˜associated with projectionP. The cost of the approximation ofHbyH˜is measured by the important quantityµ:=

(I−P)Cn1/2

2, whereCnis the covariance operator.

Theorem 3. LetH˜be a linear subspace ofHandPthe Euclidean projection ontoH. When PKAWV is run with˜ λ >0 and fixed subspacesH˜t= ˜H, then for allf ∈ H

Rn(f)≤λ f

2+B2

n

X

j=1

log

1 +λj(Knn) λ

+ (µ+λ)nµB2

λ2 , (5)

for any sequence(x1, y1), . . . ,(xn, yn)∈ X ×[−B, B]whereµ:=

(I−P)Cn1/2

2andCn:=Pn

t=1φ(xt)⊗φ(xt).

The proof of Thm. 3 is deferred to Appendix D.1 and is the consequence of a more general Thm. 9.

3.1 Learning with Taylor expansions and Gaussian kernel for very large data set

In this section we focus on non-parametric regression with the widely usedGaussian kerneldefined byk(x, x0) = exp(−kx−x0k2/(2σ2))forx, x0 ∈ Xandσ >0and the associated RKHSH.

(6)

Using the results of the previous section with a fixed linear subspaceH˜which is the span of a basis ofO(polylog(n/λ)) functions, we prove that PKAWV achieves optimal regret. This leads to a computational complexity that is only O(npolylog(n/λ))for optimal regret. We need a basis that (1) approximates very well the Gaussian kernel and at the same time (2) whose projection is easy to compute. We consider the following basis of functions, fork∈Nd0,

gk(x) =

d

Y

i=1

ψki(x(i)), where ψt(x) = xt σt

t!ex

2

2. (6)

For one dimensional data this corresponds to Taylor expansion of the Gaussian kernel. Our theorem below states that PKAWV (see (4)) using for all iterationst≥1

t= Span(GM) with GM ={gk| |k| ≤M, k∈Nd0}

where|k|:=k1+· · ·+kd, fork∈Nd0, gets optimal regret while enjoying low complexity. The size of the basisM controls the trade-off between approximating well the Gaussian kernel (to incur low regret) and large computational cost. Theorem 4 optimizesM so that the approximation term of Theorem 3 (due to kernel approximation) is of the same order than the optimal regret.

Theorem 4. Letλ >0, n∈Nand letR, B >0. Assume thatkxtk ≤Rand|yt| ≤B. WhenM =8R2

σ2 ∨2 logλ∧1n , then running PKAWV usingGM as set of functions achieves a regret bounded by

∀f ∈ H, Rn(f)≤λ f

2+3B2 2

n

X

j=1

log

1 +λj(Knn) λ

.

Moreover, its per iteration computational cost isO 3 + 1dlogλ∧1n 2d

in space and time.

Therefore PKAWV achieves a regret-bound only deteriorated by a multiplicative factor of3/2with respect to the bound obtained by Kernel-AWV (see Prop. 1). From Prop. 2 this also yields (up to log) the optimal bound (2).

In particular, it is known [20] for the Gaussian kernel that deff(λ)≤3

6 +41 d

R22 +3

dlogn λ

d

=O logn

λ d

.

The upper-bound is matching even in the i.i.d. setting for non trivial distributions. In this case, we have|GM|.deff(λ).

The per-round space and time complexities are thusO deff(λ)2

. Though our method is quite simple (since it uses fixed explicit embedding) it is able to recover results -in terms of computational time and bounds in the adversarial setting- that are similar to results obtained in the more restrictive i.i.d. setting obtained via much more sophisticated methods, like learning with (1) Nyström with importance sampling via leverage scores [21], (2) reweighted random features [22, 23], (3) volume sampling [24]. By choosingλ= (B/kfk)2,to minimize the r.h.s. of the regret bound of the theorem, we get

Rn(f).

lognkfk2H B2

d+1

B2.

Note that the optimalλdoes not depend onnand can be optimized in practice through standard online calibration methods such as using an expert advice algorithm [17] on a finite grid ofλ. Similarly, though we use a fixed number of featuresM in the experiments, the latter could be increased slowly over time thanks to online calibration techniques.

3.2 Nyström projection

The previous two subsections considered deterministic basis of functions (independent of the data) to approximate specific RKHS. Here, we analyse Nyström projections [21] that are data dependent and works for any RKHS. It consists in sequentially updating a dictionaryIt⊂ {x1, . . . , xt}and using

t= Spann

φ(x), x∈ Ito

. (7)

If the points included intoItare well-chosen, the latter may approximate well the solution of (3) which belongs to the linear span of{φ(x1), . . . , φ(xt)}. The inputsxtmight be included into the dictionary independently and uniformly at random. Here, we build the dictionary by following the KORS algorithm of [13] which is based on approximate leverage scores. At timet ≥1, it evaluates the importance of includingxtto obtain an accurate projectionPtby computing its leverage score. Then, it decides to add it or not by drawing a Bernoulli random variable. The points are never dropped from the dictionary so thatI1⊂ I2⊂ · · · In. With their notations, choosingε= 1/2and remarking thatkΦTt(I−Pttk=k(I−Pt)Ct1/2k2, their Proposition 1 can be rewritten as follows.

(7)

0 1+γ 1−γ2 1 0

γ 1+γ (1+γ)2

1

a

b

Sketched-KONS [12]

Pros-N-KONS [13]

PKAWV

PKAWV (beforehand features)

0 1+γ 1−γ12

0 γ 1+γ (1+γ)2

1

a

b

0 1+γ 1

0 γ 1+γ (1+γ)21

a

b

Figure 1: Comparison of the theoretical regret rateRn =O(nb)according to the size of the dictionarym=O(na) considered by PKAWV, Sketched-KONS and Pros-N-KONS for optimized parameters whendeff(λ)≤(n/λ)γwith γ= 0.25,√

2−1,0.75(from left to right).

Proposition 5. [13, Prop. 1] Letδ > 0,n≥1,µ > 0. Then, the sequence of dictionariesI1 ⊂ I2 ⊂ · · · ⊂ In

learned by KORS with parametersµandβ = 12 log(n/δ)satisfies w.p.1−δ,

∀t≥1,

(I−Pt)Ct1/2

2≤µ and |It| ≤9deff(µ) log 2n/δ2 . Furthermore, the algorithm runs inO deff(µ)2log4(n)

space andO deff(µ)2

time per iteration.

Using this approximation result together with Thm. 9 (which is a more general version of Thm. 3), we can bound the regret of PKAWV with KORS. The proof is postponed to Appendix E.1.

Theorem 6. Letn≥1,δ >0andλ≥µ >0. Assume that the dictionaries(It)t≥1are built according to Proposition 5.

Then, probability at least1−δ, PKAWV with the subspacesH˜tdefined in(7)satisfies the regret upper-bound Rn ≤λkfk2+B2deff(λ) log e+enκ2

+ 2B2(|In|+ 1)nµ λ , and the algorithm runs inO(deff(µ)2)spaceO(deff(µ)2)time per iteration.

The last term of the regret upper-bound above corresponds to the approximation cost of using the approximation (7) in PKAWV. This costs is controlled by the parameterµ >0which trades-off between having a small approximation error (smallµ) and a small dictionary of size|In| ≈ deff(µ)(largeµ) and thus a small computational complexity.

For the Gaussian Kernel, using thatdeff(λ) ≤O log(n/λ)d

, the above theorem yields for the choiceλ = 1and µ = n−2 a regret bound of orderRn ≤ O (logn)d+1

with a per-round time and space complexity of order O(|In|2) =O (logn)2d+4

. We recover a similar result to the one obtained in Section 3.1.

Explicit rates under the capacity condition Assuming the capacity conditiondeff0)≤ n/λ0γ

for0≤γ≤1 andλ0>0, which is a classical assumption made on kernels [21], the following corollary provides explicit rates for the regret according to the size of the dictionarym≈ |In|.

Corollary 7. Letn≥1andm ≥1. Assume thatdeff0)≤(n/λ0)γ for allλ0 >0. Then, under the assumptions of Theorem 6, PKAWV with µ = nm−1/γ has a dictionary of size |In|.m and a regret upper-bounded with high-probability as

Rn . (

n1+γγ ifm≥n

1−γ2 forλ=n1+γγ nm121 otherwise forλ=nm121 . The per-round space and time complexity of the algorithm isO(m2)per iteration.

The rate of ordern1+γγ is optimal in this case (it corresponds to optimizing (2) inλ). If the dictionary is large enough m≥n2γ/(1−γ2), the approximation term is negligible and the algorithm recovers the optimal rate. This is possible for a small dictionarym=o(n)whenever2γ/(1−γ2)<1, which corresponds toγ <√

2−1. The rates obtained in Corollary 7 can be compared to the one obtained by Sketched-KONS of [12] and Pros-N-KONS of [13] which also provide a similar trade-off between the dictionary sizemand a regret bound. The forms of the regret bounds inm,µ,λ of the algorithms can be summarized as follow

Rn .

λ+deff(λ) +nmµλ for PKAWV with KORS λ+mndeff(λ) for Sketched-KONS m(λ+deff(λ)) +λ for Pros-N-KONS

. (8)

(8)

Figure 2: Average classification error and time on: (top) code-rna (n = 2.7×105, d = 8); (bottom) SUSY (n= 6×106, d= 22).

Whendeff(λ)≤(n/λ)γ, optimizing these bounds inλ, PKAWV performs better than Sketched-KONS as soon as γ≤1/2and the latter cannot obtain the optimal rateλ+deff(λ) =n1+γγ ifm=o(n). Furthermore, because of the multiplicative factorm, Pros-N-KONS can’t either reached the optimal rate even form=n. Figure 1 plots the rate in nof the regret of these algorithms when enlarging the sizemof the dictionary. We can see that forγ= 1/4, PKAWV is the only algorithm that achieves the optimal ratenγ/(1+γ)withm=o(n)features. The rate of Pros-N-KONS cannot beat4γ/(1 +γ)2and stops improving even when the size of dictionary increases. This is because Pros-N-KONS is restarted whenever a point is added in the dictionary which is too costly for large dictionaries. It is worth pointing out that these rates are for well-tuned value ofλ. However, such an optimization can be performed at small cost using expert advice algorithm on a finite grid ofλ.

Beforehand known features We may assume that the sequence of feature vectorsxtis given in advance to the learner while only the outputsytare sequentially revealed (see [18] or [4] for details). In this case, the complete dictionaryIn ⊂ {x1, . . . , xn}may be computed beforehand and PKAWV can be used with the fix subspaceH˜ = Span(φ(x), x∈ In). In this case, the regret upper-bound can be improved toRn .λ+deff(λ) +λ by removing a factormin the last term (see (8)).

Corollary 8. Under the notation and assumptions of Corollary 7, PKAWV used with dictionaryIn and parameter µ=nm−1/γ achieves with high probability

Rn . (

n1+γγ ifm≥n1+γ forλ=n1+γγ nm1 otherwise forλ=nm1 .

Furthermore, w.h.p. the dictionary is of size|In|.mleading to a per-round space and time complexityO(m2).

The suboptimal rate due to a small dictionary is improved by a factor√

mcompared to the “sequentially revealed features” setting. Furthermore, since2γ/(1 +γ)<1for allγ∈(0,1), the algorithm is able to recover the optimal rate nγ/(1+γ)for allγ∈(0,1)with a dictionary of sub-linear sizemn. We leave for future work the question whether there is really a gap between these two settings or if this gap from a suboptimality of our analysis.

4 Experiments

We empirically test PKAWV against several state-of-the-art algorithms for online kernel regression. In particular we test our algorithms in (1) an adversarial setting [see Appendix G], (2) on large scale datasets. The following algorithms have been tested:

• Kernel-AWV for adversarial setting or Kernel Ridge Regression for i.i.d. real data settings;

• Pros-N-Kons [12];

• Fourier Online Gradient Descent (FOGD, [16]);

(9)

• PKAWV(or Projected-KRR for real data settings) with Taylor expansions (M ∈ {2,3,4})

• PKAWV(or Projected-KRR for real data settings) with Nyström

The algorithms above have been implemented in python with numpy (the code for our algorithm is in Appendix H.2).

For most algorithms we used hyperparameters from the respective papers. For all algorithms and all experiments, we setσ= 1[except for SUSY whereσ= 4, to match accuracy results from 25] andλ= 1. When using KORS, we set µ= 1,β= 1andε= 0.5as in [12]. The number of random-features in FOGD is fixed to1000and the learning rateη is1/√

n. All experiments have been done on a single desktop computer (Intel Core i7-6700) with a timeout of5-min per algorithm. The results of the algorithms are only recorded until this time.

Large scale datasets.The algorithms are evaluated on four datasets from UCI machine learning repository. In particular casp(regression) andijcnn1,cod-rna,SUSY(classification) [see Appendix G forcaspandijcnn1] ranging from 4×104to6×106datapoints. For all datasets, we scaledxin[−1,1]dandyin[−1,1]. In Figs. 2 and 4 we show the average loss (square loss for regression and classification error for classification) and the computational costs of the considered algorithm.

In all the experiments PKAWV withM = 2approximates reasonably well the performance of kernel forecaster and is usually very fast. We remark that using PKAWVM = 2on the first million examples ofSUSY, we achieve in 10minutes on a single desktop, the same average classification error obtained with specific large scale methods for i.i.d. data [25], although Kernel-AWV is using a number of features reduced by a factor100with respect to the one used in for FALKON in the same paper. Indeed they usedr= 104Nyström centers, while withM = 2we used r= 190features, validating empirically the effectiveness of the chosen features for the Gaussian kernel. This shows the effectiveness of the proposed approach for large scale machine learning problems with moderate dimensiond.

References

[1] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.

[2] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics.

Springer Science & Business Media, 2011.

[3] Dean P. Foster. Prediction in the worst case. The Annals of Statistics, 19(2):1084–1090, 1991.

[4] Peter L. Bartlett, Wouter M. Koolen, Alan Malek, Eiji Takimoto, and Manfred K. Warmuth. Minimax Fixed- Design Linear Regression. JMLR: Workshop and Conference Proceedings, 40:1–14, 2015. Proceedings of COLT’2015.

[5] Katy S. Azoury and Manfred K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211–246, 2001.

[6] Vladimir Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213–248, 2001.

[7] V. Vovk. Metric entropy in competitive on-line prediction. arXiv, 2006.

[8] Elad Hazan and Nimrod Megiddo. Online learning with prior knowledge. In International Conference on Computational Learning Theory, pages 499–513. Springer, 2007.

[9] A. Rakhlin, K. Sridharan, and A.B. Tsybakov. Empirical entropy, minimax regret and minimax risk. Bernoulli, 2013. To appear.

[10] V. Vovk. On-line regression competitive with reproducing kernel hilbert spaces.arXiv, 2005.

[11] Fedor Zhdanov and Yuri Kalnishkan. An identity for kernel ridge regression. InInternational Conference on Algorithmic Learning Theory, pages 405–419. Springer, 2010.

[12] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Second-order kernel online convex optimization with adaptive sketching. InInternational Conference on Machine Learning, 2017.

[13] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Efficient second-order online kernel learning with adaptive embedding. InNeural Information Processing Systems, 2017.

[14] AJ Smola, B Schölkopf, and P Langley. Sparse greedy matrix approximation for machine learning. In17th International Conference on Machine Learning, Stanford, 2000, pages 911–911, 2000.

[15] Christopher KI Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.

[16] Jing Lu, Steven CH Hoi, Jialei Wang, Peilin Zhao, and Zhi-Yong Liu. Large scale online kernel learning. The Journal of Machine Learning Research, 17(1):1613–1655, 2016.

(10)

[17] Nicolò Cesa-Bianchi and Gábor Lugosi.Prediction, Learning, and Games. Cambridge University Press, 2006.

[18] Pierre Gaillard, Sébastien Gerchinovitz, Malo Huard, and Gilles Stoltz. Uniform regret bounds overRdfor the sequential linear regression problem with the square loss.arXiv preprint arXiv:1805.11386, 2018.

[19] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research, 16(1):3299–3340, 2015.

[20] Jason Altschuler, Francis Bach, Alessandro Rudi, and Jonathan Weed. Massively scalable sinkhorn distances via the nystr\" om method.arXiv preprint arXiv:1812.05189, 2018.

[21] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization.

InAdvances in Neural Information Processing Systems, pages 1657–1665, 2015.

[22] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions.Journal of Machine Learning Research, 18(21):1–38, 2017.

[23] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. InAdvances in Neural Information Processing Systems, pages 3215–3225, 2017.

[24] Michał Derezi´nski, Manfred K Warmuth, and Daniel Hsu. Correcting the bias in least squares regression with volume-rescaled sampling.arXiv preprint arXiv:1810.02453, 2018.

[25] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large scale kernel method. In Advances in Neural Information Processing Systems, pages 3888–3898, 2017.

[26] Ingo Steinwart, Don Hush, and Clint Scovel. An explicit description of the reproducing kernel hilbert spaces of gaussian rbf kernels. IEEE Transactions on Information Theory, 52(10):4635–4643, 2006.

[27] Andrew Cotter, Joseph Keshet, and Nathan Srebro. Explicit approximations of the gaussian kernel.arXiv preprint arXiv:1109.4603, 2011.

[28] Frank WJ Olver, Daniel W Lozier, Ronald F Boisvert, and Charles W Clark. NIST handbook of mathematical functions. Cambridge University Press, 2010.

(11)

Supplementary material

The supplementary material is organized as follows:

• Appendix A starts with notations and useful identities that are used in the rest of the proofs

• Appendix B, C, D, E, and F contain the proofs mostly in order of appearance:

– Appendix B: statement and proof of our main theorem on which are based most of our results.

– Appendix C: proofs of Section 2 (Propositions 1 and 2) – Appendix D: proofs of section 3.1 (Theorem 3 and 4)

– Appendix E: proofs of section 3.2 (Theorem 7 and Corollaries 7 and 8) – Appendix F: proofs of additional lemmas.

• Appendix G provides additional experimental results (adversarial simulated data and large-scale real datasets).

• Appendix H describes efficient implementations of our algorithms together with the Python code used for the experiments.

A Notations and relevant equations

In this section, we give notations and useful identities which will be used in following proofs. We recall that at time t≥1, the forecaster is given an inputxt∈ X ⊂Rd, chooses a prediction functionfbt∈H˜t⊂ H, forecastsybt=fbt(xt) and observesybt∈[−B, B]. Moreover,His the RKHS associated to the kernelk: (x, x0)∈ X × X =φ(x)>φ(x0)for some feature mapφ:X →RD. We also define the following notations for allt≥1:

– Yt= (y1, . . . , yt)>∈RtandYbt= (yb1, . . . ,ybt)>∈Rt – Pt:H →H˜tis the Euclidean projection onH˜t

– Ct:=Pt

i=1φ(xi)⊗φ(xi)is the covariance operator at timet≥1;

– At:=Ct+λIis the regularized covariance operator;

– St:H →Rtis the operator such that[Stf]i=f(xi) =hf, φ(xi)ifor anyf ∈ H;

– Lt:=f ∈ H 7→

Yt−Stf

2+λkfk2is the regularized cumulative loss.

The prediction function of PKAWV at timet≥1is defined (see Definition 4) as fbt= arg min

f∈H˜t

(t−1 X

s=1

ys−f(xs)2

+λkfk2+f(xt)2 )

.

Standard calculation shows the equality

fbt=Pt−1t PtSt−1 Yt−1. (9) We define also the best functions in the subspaceH˜tandH˜t+1at timet≥1,

bgt+1= arg min

f∈H˜t

{Lt(f)}=Pt−1t PtStYt, (10)

˜

gt+1= arg min

f∈H˜t+1

{Lt(f)}=Pt+1(Pt+1CtPt+1+λI)−1Pt+1StYt, (11) and the best function in the whole spaceH

bht+1= arg min

f∈H{Lt(f)}=A−1t StYt. (12)

B Main theorem (statement and proof)

In this appendix, we provide a general upper-bound on the regret of PKAWV that is valid for any sequence of projections P1, ..., Pnassociated with the sequenceH˜1, . . . ,H˜n. Many of our results will be corollaries of the following theorem for specific sequences of projections.

(12)

Theorem 9. LetH˜1, . . . ,H˜nbe a sequence of linear subspaces ofHassociated with projectionsP1, . . . , Pn∈ H → H.

PKAWV with regularization parameterλ >0satisfies the following upper-bound on the regret: for allf ∈ H Rn(f)≤

n

X

t=1

y2tD

−1t Ptφ(xt), Ptφ(xt)E

+ (µt+λ)µttB2 λ , for any sequence(x1, y1), . . . ,(xn, yn)∈ X ×[−B, B]and whereµt:=

(Pt+1−Pt)Ct1/2

2andPn+1:=I.

Proof. Letf ∈ H. By definition ofbhn+1(see (12)), we haveLn(bhn+1)≤Ln(f)which implies by definition ofLn that

Yn−Snbhn+1

2

Yn−Snf

2≤λkfk2−λkbhn+1k2. (13) Now, the regret can be upper-bounded as

Rn(f):=(1)

n

X

t=1

(yt−ybt)2

n

X

t=1

yt−f(xt)2

(14)

=

Yn−Ybn

2

Yn−Snf

2

(13)

Yn−Ybn

2

Yn−Snbhn+1

2+λkfk2−λkbhn+1k2

≤λkfk2+

Yn−Ybn

2

Yn−Snbgn+1

2−λkbgn+1k2

| {z }

Z1

(15)

+

Yn−Snbgn+1

2+λkbgn+1k2

Yn−Snbhn+1

2−λkbhn+1k2

| {z }

Ω(n+1)

The first termZ1mainly corresponds to the estimation error of the algorithm: the regret incurred with respect to the best function in the approximation spaceH˜n. It also includes an approximation error due to the fact that the algorithm does not useH˜n but the sequence of approximationH˜1, . . . ,H˜n. The second term Ω(n+ 1)corresponds to the approximation error ofHbyH˜n. Our analysis will focus on upper-bounding both of these terms separately.

Part 1. Upper-bound of the estimation error Z1. Using a telescoping argument together with the convention L0(bg1) = 0, we have

Yn−Snbgn+1

2+λkbgn+1k2=Ln(bgn+1) =

n

X

t=1

Lt(bgt+1)−Lt−1(bgt). Substituted into the definition ofZ1(see (15)), the latter can be rewritten as

Z1=

n

X

t=1

(yt−ybt)2+Lt−1(bgt)−Lt(bgt+1)

=

n

X

t=1

(yt−byt)2+Lt−1(˜gt)−Lt(gbt+1)

| {z }

Z(t)

+Lt−1(bgt)−Lt−1(˜gt)

| {z }

Ω(t)

. (16)

where g˜t = Pt(PtCt−1Pt+λI)−1PtSt−1 Yt−1 is obtained by substituting Pt with Pt−1 in the definition ofgbt. Note that with the conventionPn+1 = Ithe second term Ω(t)matches the definition of Ω(n+ 1)of (15) since

˜

gn+1=A−1n SnYn=bhn+1. In the rest of the first part we focus on upper-bounding the termsZ(t). The approximation termsΩ(t)will be bounded in the next part.

Now, we remark that by expanding the square norm Lt(f) =kYtk2−2Yt>Stf+

Stf

2+λkfk2=kYtk2−2Yt>Stf+hf, Ctfi+λkfk2

=kYtk2−2Yt>Stf+hf, Atfi, (17) where for the second equality, we used

Stf

2=

n

X

t=1

f(xt)2=

n

X

t=1

hf, φ(xt)i2=

n

X

t=1

hf, φ(xt)⊗φ(xt)fi=hf, Ctfi.

(13)

Substitutingbgt+1into (17) we get

Lt(bgt+1) =kYtk2−2Yt>Stbgt+1+hgbt+1, Atbgt+1i. (18) But, sincebgt+1∈H˜t, we havebgt+1=Ptbgt+1which yields

Yt>Stbgt+1=Yt>StPtbgt+1=Yt>St−1ttPtbgt+1. Then, using thatA˜tPt= (PtCtPt+λI)Pt=PtAtPt, we get

Yt>Stbgt+1=Yt>StPt−1t Pt

| {z }

bgt+1>

Atbgt+1=hbgt+1, Atbgt+1i.

Thus, combining with (18) we get

Lt(bgt+1) =kYtk2− hbgt+1, Atbgt+1i. Similarly, substituting˜gtinto (17) and using˜gt∈H˜t, we can show

Lt−1(˜gt) =kYt−1k2− h˜gt, At−1˜gti. Combining the last two equations implies

Lt−1(˜gt)−Lt(bgt+1) =−yt2+hbgt+1, Atgbt+1i − h˜gt, At−1˜gti. (19) Furthermore, using the definition ofbgt+1, we have

PtAtbgt+1=PtAtPt−1t PtStYt=Ptt−1t PtStYt=PtStYt. The same calculation with˜gtyields

PtAt−1˜gt=Pt(Ct−1+λI)Pt(PtCt−1Pt+λI)−1St−1 Yt−1

=Pt(PtCt−1Pt+λI)(PtCt−1Pt+λI)−1St−1 Yt−1=PtSt−1 Yt−1. (20) Together with the previous equality, it entails

PtAtbgt+1−PtAt−1t=Pt(StYt−St−1Yt−1) =ytPtφ(xt). (21) Then, becausefbt∈H˜t, we havebyt=fbt(xt) =D

fbt, φ(xt)E

=D

fbt, Ptφ(xt)E

. This yields (yt−ybt)2=yt2−2ytybt+byt2

=yt2−2D

fbt, ytPtφ(xt)E +D

fbt, φ(xt)⊗φ(xt)fbt

E

(21)≤ yt2−2D

fbt, PtAtbgt+1−PtAt−1˜gtE +D

fbt, φ(xt)⊗φ(xt)fbtE

=yt2−2D

fbt, Atbgt+1−At−1tE +D

fbt,(At−At−1)fbtE

, (22)

where the last equality usesft∈H˜tand thatAt−At−1=φ(xt)⊗φ(xt).

Putting equations (19) and (22) together, we get

Z(t) (16)= (yt−ybt)2+Lt−1(˜gt)−Lt(bgt+1)

(19)+(22)

hbgt+1, Atbgt+1i −2D

fbt, Atbgt+1E +D

fbt, AtfbtE

h˜gt, At−1bgti −2D

Pt−1fbt, At−1t

E +D

fbt, At−1fbt

E

= D

bgt+1−fbt, At(bgt+1−fbt)E

−D

fbt−g˜t, At−1(fbt−˜gt)E

| {z }

≥0

≤ D

bgt+1−fbt,A˜t(bgt+1−fbt)E

(9)+(10)

= D

Pt−1t Pt(StYt−St−1 Yt−1),A˜tPt−1t Pt(StYt−St−1 Yt−1)E

= y2tD

Pt−1t Ptφ(xt),A˜tPt−1t Ptφ(xt)E

= y2tD

−1t Ptφ(xt), Ptφ(xt)E

(14)

where the last equality is becausePtt= ˜AtPtfrom the definition ofA˜t:= ˜Ct+λIwithC˜t:=PtCtPt. Therefore, plugging back into (16), we have

Z1

n

X

t=1

y2tD

−1t Ptφ(xt), Ptφ(xt)E

+ Ω(t), (23)

where we recall thatΩ(t) :=Lt−1(bgt)−Lt−1(˜gt).

Part 2. Upper-bound of the approximation termsΩ(t).We recall that we use the conventionPn+1=Iwhich does not change the algorithm. Lett≥1, expending the square losses we get

Ω(t+ 1) =

t

X

s=1

(bgt+1(xs)−ys)2−(˜gt+1(xs)−ys)2+λkbgt+1k2−λk˜gt+1k2

=

t

X

s=1

ys2−2hbgt+1, ysφ(xs)i+hbgt+1, φ(xs)⊗φ(xs)bgt+1i

ys2+ 2h˜gt+1, ysφ(xs)i − h˜gt+1, φ(xs)⊗φ(xs)˜gt+1i+λkbgt+1k2−λk˜gt+1k2

= 2h˜gt+1−bgt+1, StYti+hbgt+1, Atbgt+1i − h˜gt+1, At˜gt+1i Since bothg˜t+1andbgt+1belong toH˜t+1, we have

Ω(t+ 1) = 2h˜gt+1−bgt+1, Pt+1StYti+hbgt+1, Atbgt+1i − h˜gt+1, Att+1i, which using thatPt+1StYt=Pt+1Att+1by Equality (20) yields

Ω(t+ 1) = 2h˜gt+1−bgt+1, Pt+1Att+1i+hbgt+1, Atbgt+1i − h˜gt+1, Att+1i

= 2h˜gt+1−bgt+1, Att+1i+hbgt+1, Atbgt+1i − h˜gt+1, Att+1i

=−2hbgt+1, At˜gt+1i+hbgt+1, Atbgt+1i+h˜gt+1, Att+1i

=h˜gt+1−bgt+1, At(˜gt+1−bgt+1)i.

Let us denoteBt=Pt+1AtPt+1. Then, remarking thatbgt+1=Pt−1t PtAt˜gt+1and that(Pt+1−Pt−1t PtAt)Pt= 0, we have

Ω(t+ 1) =D

(Pt+1−Pt−1t PtAt)˜gt+1, At(Pt+1−Pt−1t PtAt)˜gt+1

E

=D

(Pt+1−Pt−1t PtBt)˜gt+1, Bt(Pt+1−Pt−1t PtBt)˜gt+1

E

=

B1/2t (Pt+1−Pt−1t PtBt)˜gt+1

2

=

B1/2t (Pt+1−Pt−1t PtBt)(Pt+1−Pt)˜gt+1

2

B1/2t (Pt+1−Pt−1t PtBt)k2k(Pt+1−Pt)˜gt+1

2. (24)

We now upper-bound the two terms of the right-hand-side. For the first one, we use that

Pt+1−Bt1/2Pt−1t PtBt1/2

2

=

Pt+1−2B1/2t Pt−1t PtBt1/2+B1/2t Pt−1t PtBt1/2Bt1/2Pt−1t

| {z }

Pt

PtBt1/2

=

Pt+1−Bt1/2Pt−1t PtBt1/2

2

∈ {0,1}, (25)

Références

Documents relatifs

These results clearly show that the generalization performance of large-scale learning systems depends on both the statistical properties of the objective function and the

The discussion so far has established that a properly designed online learning algorithm performs as well as any batch learning algorithm for a same number of examples.. Constants K

numerical optimization, machine learning, stochastic gradient methods, algorithm com- plexity analysis, noise reduction methods, second-order methods.. AMS

In the existing literature, the LRIP has been established [8] for randomized feature maps Φ( · ) (e.g., random Fourier features, random quadratic features) that mimic

Our speed-up method has been developed in the context of Denoising Auto-encoders, which are trained in a purely unsupervised way to capture the input distribution, and learn

From the former viewpoint, we extended the multiple kernel learning problem to take into account the group structure among kernels.. From the latter viewpoint, we generalized

This quasi- isometry property associated to a given sparse dictionary is quantified in terms of each of the sparsity measures presented in Section III, namely the

In this paper we presented two algorithms: ONORMA is an online learning algorithm that extends NORMA to operator-valued kernel framework and requires the knowledge of the