Orlicz Random Fourier Features

(1)

HAL Id: hal-02418576

https://hal.archives-ouvertes.fr/hal-02418576v2

Preprint submitted on 19 Jun 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

To cite this version:

Linda Chamakh, Emmanuel Gobet, Zoltán Szabó. Orlicz Random Fourier Features. 2020. �hal-

02418576v2�

(2)

Orlicz Random Fourier Features

Linda Chamakh ^†‡ [email protected] Emmanuel Gobet ^† [email protected] Zolt´ an Szab´ o ^† [email protected]

† CMAP, CNRS, ´ Ecole Polytechnique, Institut Polytechnique de Paris 91128 Palaiseau, France

‡ Global Markets Quantitative Research – BNP Paribas

Editor: Jean-Philippe Vert

Abstract

Kernel techniques are among the most widely-applied and influential tools in machine learning with applications at virtually all areas of the field. To combine this expressive power with computational efficiency numerous randomized schemes have been proposed in the literature, among which probably random Fourier features (RFF) are the simplest and most popular. While RFFs were originally designed for the approximation of kernel values, recently they have been adapted to kernel derivatives, and hence to the solution of large-scale tasks involving function derivatives. Unfortunately, the understanding of the RFF scheme for the approximation of higher-order kernel derivatives is quite limited due to the challenging polynomial growing nature of the underlying function class in the empirical process. To tackle this difficulty, we establish a finite-sample deviation bound for a general class of polynomial-growth functions under α-exponential Orlicz condition on the distribution of the sample. Instantiating this result for RFFs, our finite-sample uniform guarantee implies a.s. convergence with tight rate for arbitrary kernel with α-exponential Orlicz spectrum and any order of derivative.

Keywords: random Fourier features, kernel derivative, polynomial-growth functions, α-exponential Orlicz norm, unbounded empirical processes

1. Introduction

Kernel machines (Taylor and Cristianini, 2004; Steinwart and Christmann, 2008; Paulsen and Raghupathi, 2016) form one of the most fundamental tools in machine learning and statistics with a wide range of successful applications. The impressive modelling power and flexibility of kernel techniques in capturing complex nonlinear relations originates from the richness of the underlying H k function class called reproducing kernel Hilbert space (RKHS, Aronszajn 1950) associated to a k : X × X → R kernel. Kernels extend the classical notion of inner product on X = R ^d by assuming the existence of a φ : X → H feature map to a Hilbert space H such that k(x, x ⁰ ) = hφ(x), φ(x ⁰ )i _H for all x, x ⁰ ∈ X . This simple equality (also called the kernel trick) forms the basis of kernel techniques and enables one to compute inner products implicitly without direct access to the feature of the points.

© 2020 Linda Chamakh, Emmanuel Gobet and Zolt´ an Szab´ o.

(3)

In applications one is often given {x _n } ^N _n=1 samples and is facing with an optimization problem expressed in terms of function values and derivatives ¹

f min ∈ H

k

l {∂ ^p f (x n )} _n∈[N]

p∈D

n

, kf k ² _H

k

!

, (1)

where [N ] = {1, . . . , N }, D n ⊂ N ^d , N := {0, 1, . . .}, l : R ¹⁺

P

n∈[N]

#D

n

→ R is a loss function, #D n is the cardinality of the set D n , ∂ ^p f (x n ) := ^∂

^p¹⁺^...+^pd

^f ^(x

ⁿ

⁾

∂

x^p¹1

···∂

^pd_xd

and the RKHS H k is characterized by f (x) = hf, k(·, x)i _H

k

(∀x ∈ X , ∀f ∈ H k ) and k(·, x) ∈ H k (∀x ∈ X ). ² The first property of RKHSs is called the reproducing property, the second one describes basic elements of H k ; combining the two properties makes the canonical feature map and feature space explicit: k(x, x ⁰ ) = hφ(x), φ(x ⁰ )i _H

k

where φ(x) = k(·, x) ∈ H k .

For example by taking the quadratic loss, Tikhonov regularization, only function values (D _n = {0}, ∀n ∈ [N ]) and λ > 0, (1) reduces to kernel ridge regression

f∈ min H

k

1 N

X

n∈[N]

[f (x _n ) − y _n ] ² + λ kfk ² _H

k

.

Alternatively, one can get back Hermite learning with gradient data (Zhou, 2008; Shi et al., 2010) by additionally including first-order derivatives

f∈ min H

k

1 N

X

n∈[N]

[f (x n ) − y n ] ² +

f ⁰ (x n ) − y ⁰ _n

2 2

+ λ kf k ² _H

k

, λ > 0

where f ⁰ (x) = [∂ ^e

¹

f (x); . . . ; ∂ ^e

^d

f(x)] ∈ R ^d is the derivative of f , e _j ∈ R ^d is the j ^th canonical basis vector, k·k ₂ is the Euclidean norm and D _n = n

0, {e _j } ^d _j=1 o

(n ∈ [N ]). Further examples with function derivatives are semi-supervised learning with gradient information (Zhou, 2008), nonlinear variable selection (Rosasco et al., 2010, 2013), learning of piecewise- smooth functions (Lauer et al., 2012), multi-task gradient learning (Ying et al., 2012), structure optimization in parameter-varying ARX (autoregressive with exogenous input) processes (Duijkers et al., 2014), or density estimation with infinite-dimensional exponential families (Sriperumbudur et al., 2017).

An appealing property of RKHSs is that their geometry makes the optimization prob- lem (1) defined over function spaces computationally tractable. Indeed, assuming that l is increasing in its last argument, the ∂ ^p f (x) =

f, ∂ ^p,0 k(·, x)

H

k

derivative-reproducing prop- erty of kernels and the representer theorem (Zhou, 2008) guarantee that the solution of (1) has a finite-dimensional parameterization f (·) = P

n∈[N]

P

p∈D

_n

a _n,p ∂ ^p,0 k(·, x _n ) (a _n,p ∈ R ) and it is sufficient to solve

min a l











 X

m∈[N ]

X

q∈D

m

a m,q ∂ ^p,q k(x n , x m )





 _n∈[N]

p∈D

n

, X

n,m∈[N ] p∈D

n

,q∈D

m

a n,p a m,q ∂ ^p,q k(x n , x m )





 (2)

1. To have derivatives, in the sequel we assume that X = R

^d

.

2. We use the k(·, x) shorthand to denote the function y ∈ X 7→ k(y, x) ∈ R while keeping x ∈ X fixed.

(4)

determined by the ∂ ^p,q k(x, y) := ^∂

Pdi=1(pi+qi)

k(x,y)

∂

^px¹1

···∂

_xd^pd

∂

y^q1¹

···∂

_yd^qd

kernel derivatives; a = (a _n,p ) _{n∈[N],p∈D}

_n

∈ R

P

n∈[N]

#D

n

.

Though kernel methods show impressive modelling power at numerous areas, due to the implicit computation of feature similarities, this flexibility comes with a computational price. Several techniques have been developed in the literature to mitigate this computa- tional challenge such as incomplete Cholesky factorization (Bach and Jordan, 2002), sub- sampling schemes (Williams and Seeger, 2001; Drineas and Mahoney, 2005; Rudi et al., 2017), sketching (Alaoui and Mahoney, 2015; Yang et al., 2017), random Fourier features (RFF, Rahimi and Recht 2007, 2008), their quasi-Monte Carlo (Yang et al., 2014), memory- efficient (Le et al., 2013; Dai et al., 2014; Zhang et al., 2019), orthogonal (Yu et al., 2016) or structured (Bojarski et al., 2017) variants.

In this paper we study the RFF technique which is probably the conceptually simplest and most influential approach. ³ By the Bochner theorem (Rudin, 1990) a continuous, bounded, shift-invariant kernel k : R ^d × R ^d → R can be written as the Fourier transform of a (finite) measure Λ, called the spectral measure

k(x, y) = Z

R

^d

cos

ω ^> (x − y)

dΛ(ω). (3)

The RFF method uses this representation of k to provide an explicit low-dimensional feature map approximation for the kernel values and f

k(x, x ⁰ ) ≈

λ(x), λ(x ⁰ )

R

^2M

, f ˆ w (x) = hw, λ(x)i

R

^2M

, (4) where the integral representation (3) with respect to the measure Λ is replaced by an av- erage over random points; hence the random Fourier feature naming. As a result, one can estimate w by leveraging fast linear primal solvers. The idea has been successfully used in various contexts including differential privacy preserving (Chaudhuri et al., 2011), fast function-to-function regression (Oliva et al., 2015), learning message operators in expecta- tion propagation (Jitkrittum et al., 2015), causal discovery (Lopez-Paz et al., 2015; Strobl et al., 2019), independence testing (Zhang et al., 2017), prediction and filtering in dynam- ical systems (Downey et al., 2017), bandit optimization (Li et al., 2018), or estimation of Gaussian mixture models (Keriven et al., 2018).

Similarly to (4), one can consider RFF-based approximation of kernel derivatives when solving optimization tasks involving function derivatives [see (1) and (2)]. This is the strategy followed for example by Strathmann et al. (2015) to fit distributions belonging to the infinite-dimensional exponential family, which boils down to an optimization problem with third-order kernel derivatives (Sriperumbudur et al., 2017, Theorem 5).

The focus of this work is to study the approximation quality of the RFF-based kernel- derivative approximation

\ ∂ ^p,q k − ∂ ^p,q k

_S := sup

x,y∈S

∂ ^p,q k(x, y) − ∂ \ ^p,q k(x, y) ,

3. Rahimi and Recht (2007) won the 10-year test-of-time award at NIPS-2017 as a recognition of the

influence of RFFs.

(5)

where S ⊂ R ^d is a compact set. Despite the large number of successful RFF applications, quite little is understood theoretically on its approximation quality. Below we provide a brief summary with particular focus on optimal guarantees and results related to kernel derivatives.

• Kernel values (p = q = 0): The uniform finite-sample bounds (Rahimi and Recht, 2007; Sutherland and Schneider, 2015) have recently been improved (Sriperumbudur and Szab´ o, 2015) exponentially in terms of the diameter of the compact set S _M (|S _M |) arriving to ⁴

k − b k

S

M

= O _a.s.

√

log |S

_M

|∨ √ log M

√ M

from k − b k

S

M

= O _p _|S

M

| √ log M

√ M

, where ∨ denotes the maximum. The result shows that the diameter of the set S _M can grow at a

|S _M | = e ^o(M) rate while still getting a consistent estimate; this rate is optimal as shown in the characteristic function literature (Cs¨ org¨ o and Totik, 1983).

• Kernel ridge regression: RFFs have been settled in kernel ridge regression by Rudi and Rosasco (2017) via showing that using M = o(N ) = O √

N log N

random Fourier features is sufficient to get O

1/ √ N

generalization error. Under additional γ-capacity (γ ∈ [0, 1]) and r-range space conditions (r ≥ ¹ ₂ ), the same authors showed that even faster, minimax optimal O

N ⁻

^2r+γ^2r

rates are achievable with M = o(N ) = O

N

1+γ(2r−1) 2r+γ

log N

RFFs.

The result improves the originally proved guarantee (Rahimi and Recht, 2008) holding under the pessimistic M = O(N ) setting. The sufficiency of similar sub-linear number of RFFs with minimax guarantees was also established in the kernel least squares setting when applying mini-batched stochastic gradient descent (Carratino et al., 2018). Recently the analysis of Rudi and Rosasco (2017) has been further sharpened (in terms of the number of required RFFs, Li et al. 2019) by leveraging the notion of effective degrees of freedom.

• Classification with 0-1 loss: In the classification setting with the 0-1 loss and RKHSs, Gilbert et al. (2018) proved that M = o(N ) = ˜ O

N

^2+c²

optimized RFF features—

optimized in the sense of Bach (2017)—are sufficient to achieve a learning rate of ˜ O

N ⁻

^2+c^c

provided that the spectrum of the integral operator associated to the kernel decay polyno- mially at the rate of λ _i = O (i ^−c ) with c > 1. ⁴ The same authors showed that the learning rate can be improved to ˜ O N ⁻¹

with M = ˜ O ln ^d (N )

RFF-s in case of sub-exponential spectrum, where d denotes the dimension of the inputs in the classification.

• Kernel PCA: Sriperumbudur and Sterge (2018) have proved that the statistical perfor- mance of kernel principal component analysis (KPCA) can be matched by M = O(N ^2/3 ) (polynomial decay) or M = O( √

N ) (exponential decay) RFFs, depending on the eigenvalue decay of the covariance operator associated to the kernel. Ullah et al. (2018) derived a sim- ilar bound for a streaming KPCA algorithm under exponential spectrum decay condition.

4. The classical O(·) notation up to logarithmic factors is denoted by ˜ O(·); the extension of O(·) in almost

sure and convergence in probability sense are O

p

(·) and O

a.s.

(·).

(6)

• Kernel derivatives: If the support of the spectral measure associated to k is either bounded or it satisfies the Bernstein condition

Z

R

^d

Q d

j=1 ω ^p _j

^j

^+q

^j

n

(σ p,q ) ⁿ dΛ(ω) ≤ n!

2 K ⁿ⁻² , n = 2, 3, . . . (5) with some constant K ≥ 1 and σ _p,q =

r R

R

^d

Q d

j=1 ω _j ^p

^j

^+q

^j

2 dΛ(ω), then

∂ \ ^p,q k − ∂ ^p,q k S

M

= O _a.s.

p log |S _M |∨ √ log M

√ M

!

rate is achievable as shown by Sriperumbudur and Szab´ o (2015) and Szab´ o and Sripe- rumbudur (2019), respectively. As a practical example, one can consider for instance the previously mentioned infinite-dimensional exponential (IE) family fitting problem where the estimation boils to solving a linear equation system with coefficients made of third-order kernel derivatives (|p + q| = 3). The IE family is defined by a kernel for which a common choice is the Gaussian; this implies a Gaussian spectrum Λ. Unfortunately, in this case the bounded support condition is not met. Similarly, the Bernstein condition (5) only holds for

|p + q| ≤ 2 with the Gaussian kernel, as it can be verified (Szab´ o and Sriperumbudur, 2019, Remark ’Higher-order derivatives’ in Section 3) by making use of the analytical expressions for the absolute moments of the Gaussian spectrum. These limitations (summarized in Table 1) of the popular random Fourier features technique motivate our work and the study of widely-applied kernels with unbounded spectral support for the RFF approximation of high-order kernel derivatives. A consequence of our new estimates in Theorem 1 is that the a.s. rates previously obtained under stringent conditions (on p, q or Λ) are now available for any p, q and any spectral measure Λ with α-exponential moments (as defined in (6), α > 0). Because Bernstein condition implies exponential moments, our result includes the one given by Szab´ o and Sriperumbudur (2019).

Particularly, assuming additional smoothness on the bounded shift-invariant kernel, its derivative satisfies a representation similar to (3):

∂ ^p,q k(x, y) = Z

R

^d

" _d Y

j=1

ω _j ^p

^j

(−ω _j ) ^q

^j

# c ₍ ^P

d

i=1

|p

_i

+q

i

|)

ω ^> (x − y)

| {z }

=:f

x−y

(ω)

dΛ(ω),

where c n is the n ^th derivative of the cos(·) function. The primary difficulty is to handle the polynomial growing nature of the

F = {ω 7→ f x−y (ω) : x, y ∈ S}

function class which controls the error

\ ∂ ^p,q k−∂ ^p,q k

S . We tackle this challenge by impos- ing the finiteness of the α-exponential Orlicz norm of the spectral measure (Λ) associated to the kernel, in other words

∃α > 0, c > 0 such that E ω∼Λ

e ^ckωk

^α²

< +∞. (6)

(7)

Assumption on the spectral measure Conditions Convergence rate on p, q for

∂ ^p,q k − ∂ \ ^p,q k S

M

2nd moment exists p = q = 0 O _a.s.

√

log |S

_M

|∨ √ log M

√ M

Ref: Sriperumbudur and Szab´ o (2015, Th. 1)

bounded support any p, q O _a.s.

√

log |S

M

|∨ √ log M

√ M

Ref: Sriperumbudur and Szab´ o (2015, Th. 4)

Bernstein condition small p, q O _a.s.

√

log |S

_M

|∨ √ log M

√ M

Ref: Szab´ o and Sriperumbudur (2019)

Orlicz condition any p, q O _a.s.

√

log |S

_M

|∨ √ log M

√ M

Ref: now

Table 1: Summary of RFF guarantees on kernel values and derivatives. Last line: it includes any measure Λ with a finite α-exponential moment (for some α, c > 0, E ω∼Λ e ^ckωk

^α²

<

+∞), like the Gaussian, the inverse multiquadratic, or the sech kernel, see Corollary 4. For further examples see Table 2.

Kernels with α-exponential Orlicz spectrum include the popular Gaussian, the inverse multi- quadric, or the sech kernel; for further examples see Table 2 and Remark 5(ii). We establish the consistency and prove finite-sample uniform guarantees of the resulting Orlicz RFF scheme for the approximation of kernel derivatives at any order, as it is briefly illustrated in the last line of Table 1.

To allow this level of generality, we prove a new finite-sample deviation bound for the

empirical process related to a general class of functions f with polynomial growth of the

sample X _m . The distribution of the latter is assumed to have finite α-exponential Orlicz

norm and consequently, the random variables f (X _m ) belong to a γ-exponential Orlicz space

with index γ smaller than 1. For deriving such deviation bounds, we have been inspired by

the work of Adamczak (2008) which elegantly combines the Klein and Rio (2005) inequality

for truncated variables, the Hoffman-Jorgensen inequality to deal with sum of residual of

truncated variables, and a Talagrand (1989) inequality in γ -exponential Orlicz norms for

sum of centered random variables. However, our work significantly differs from that of

Adamczak (2008). First, our aims are different: Adamczak (2008) focuses on getting large

deviation bounds while we are looking for all-scale deviation bounds, which leads to a

different analysis (in the application of Klein-Rio inequalities for instance). Second, we are

concerned by getting upper bounds with quite explicit control. In particular, this requires

a careful treatment of Orlicz-type estimates since the function Ψ γ (x) = e ^x

^γ

− 1 defining

the Orlicz space is not convex for γ < 1 (see Figure 1), as opposed to the usual case; see

the results in Section 4. We also derive sharp estimates from the Dudley entropy integral

bound (Theorem 9), which enables us to get a tight dependency w.r.t. the diameter of the

parameter space. Furthermore, we clarify the use of the Talagrand inequality (Theorem 7);

(8)

in Adamczak (2008, Theorem 5) it is seemingly invoked for supremum over functions while it is related to sum over centered random variables. With this novel finite-sample deviation bound, the analysis of Orlicz RFFs readily follows, using optimized inequalities.

The paper is structured as follows. Our problem is formulated in Section 2. The main result on the approximation quality of kernel derivatives with random Fourier features is presented in Section 3. Properties of the Orlicz norm are summarized in Section 4. Proofs are provided in Section 5. The appendix contains additional technical details (Section A), the definition of special functions (Section B) and external statements used in the proofs (Section C).

2. Problem Formulation

In this section we formally define our problem after introducing a few notations.

Notations: Let the set of natural, real and complex numbers, positive integers, positive reals, non-negative reals and non-positive integers be denoted by N = {0, 1, . . .}, R, C, Z ⁺ = {1, 2, . . .}, R ⁺ = (0, ∞), R ^≥0 = [0, ∞) and Z ^≤0 = {0, −1, −2, . . .}, respectively. The positive value of x ∈ R is denoted by (x) ₊ = x ∨0. The Gamma function is Γ(t) = R ∞

0 x ^t−1 e ^−x dx for t ∈ C \ Z ^≤0 . For x ∈ R the secant function is sech(x) = _cosh(x) ¹ . Let aS + b = {as + b : s ∈ S}

where S ⊂ R and a, b ∈ R . For an N ∈ Z ⁺ , [N ] = {1, . . . , N }. Let the n ^th derivative of the cos(·) function (n ∈ N ) be c _n = cos ⁽ⁿ⁾ . For multi-indices p, q ∈ N ^d and ω ∈ R ^d let |p| = P _d

j=1 p _j , ω ^p = Q _d

j=1 ω _j ^p

^j

, ∂ ^p f (x) = ^∂

^|p|

^f(x)

∂

^px¹1

···∂

_xd^pd

, ∂ ^p,q g(x, y) = ^∂

^|p|+|q|

^g(x,y)

∂

^px¹1

···∂

_xd^pd

∂

y^q¹1

···∂

_yd^qd

. The diameter of a compact set T ⊂ R ^d is denoted by |T | = sup _x,y∈T kx − yk ₂ < ∞. Let S ⊂ R ^d be a Borel set. Let S _∆ = S − S = {a − b : a ∈ S, b ∈ S}. The set of Borel probability mea- sures on S is written as M ⁺ ₁ (S). Let the Banach space of real-valued, r-power µ-integrable functions on S (1 ≤ r < ∞) be L ^r (S, µ), with kf k _L

r

(S,µ) = R

S |f(x)| ^r dµ(x)

¹_r

. We use the shorthand µf = R

S f (x)dµ(x) where µ ∈ M ⁺ ₁ (S) and f ∈ L ¹ (S, µ). The product measure of µ ₁ , . . . , µ _M ∈ M ⁺ ₁ (S) is ⊗ ^M _m=1 µ _m ; specifically when all the components coin- cide we use the shorthand µ ^M = ⊗ _m∈[M] µ. The empirical measure is P M = _M ¹ P M

m=1 δ _X

_m

with δ _X being the Dirac measure concentrated on X and X ₁ , . . . , X _M ∼ ⊗ ^M _m=1 µ _m . Let (r _n ) n∈ N be a positive sequence. The boundedness of ^X _r

ⁿ

n

almost surely is denoted by X _n = O _a.s. (r _n ). Let n ∈ R ⁺ . We say that an f : R ^d → R function is of polynomial growth of order n (shortly f ∈ F _P(n) ) if sup _x∈ _R

d

|f (x)|

1+kxk

ⁿ₂

< ∞; F _P = ∪ _n∈ _R

+

F _P(n) . Let us assume that Ψ : R ^≥0 → R ^≥0 is a continuous, strictly increasing mapping, Ψ(0) = 0 and lim x→∞ Ψ(x) = ∞. The set of R ^d -valued random variables having finite Ψ-Orlicz norm is defined as L Ψ =

n

X : kXk _Ψ := inf n

c > 0 : E Ψ _kXk

2

c

≤ 1 o

< +∞ o

. Throughout the paper we will be particularly interested in (see Figure 1)

Ψ α : x ∈ R ^≥0 7→ e ^x

^α

− 1 ∈ R ^≥0 (α > 0),

in other words in random variables having finite α-exponential Orlicz norm. The fact X ∈ L Ψ

α

is equivalent to the existence of an s > 0 constant such that E

h e ^skXk

^α²

i

< ∞.

Random variables X ∈ L _Ψ

₂

and X ∈ L _Ψ

₁

are called sub-Gaussian and sub-exponential,

respectively. For f ∈ F _P and random variable X having α-exponential moment (X ∈ L Ψ

α

)

E f (X) < ∞. Special functions are defined in Table 4.

(9)

0 0.5 1 1.5 0

2 4 6

=0.1

=0.3

=1.5

Figure 1: Ψ α for different α values.

We proceed by formally defining our task. Let k : R ^d × R ^d → R be a continuous, bounded and shift-invariant kernel. Then, by the Bochner theorem (Rudin, 1990) one can assume w.l.o.g. the existence of a Λ ∈ M ⁺ ₁ R ^d

spectral measure such that k(x, y) =

Z

R

^d

cos

ω ^> (x − y) dΛ(ω)

= Z

R

^d

cos

ω ^> x

cos

ω ^> y

+ sin

ω ^> x

sin

ω ^> y

dΛ(ω).

Let p, q ∈ N ^d and assume that R

R

^d

|ω ^p+q | dΛ(ω) < ∞. In this case ∂ ^p,q k(x, y) exists, and by the dominated convergence theorem one arrives at

∂ ^p,q k(x, y) = Z

R

^d

∂ ^p cos

ω ^> x

∂ ^q cos

ω ^> y

+ ∂ ^p sin

ω ^> x

∂ ^q sin

ω ^> y

dΛ(ω).

The integral can be estimated by Monte-Carlo technique replacing Λ with Λ _M = _M ¹ P M

m=1 δ _ω

_m

, (ω m ) m∈[M]

i.i.d.

∼ Λ:

\ ∂ ^p,q k(x, y) = 1 M

M

X

m=1

h

∂ ^p cos ω _m ^> x

∂ ^q cos ω _m ^> y

+ ∂ ^p sin ω ^> _m x

∂ ^q sin

ω _m ^> y i

= hλ _p (x), λ q (y)i

R

^2M

, (7)

where λ p (x) = ^√ ¹

M

h

∂ ^p cos ω _m ^> x

m∈[M] ; ∂ ^p sin ω _m ^> x

m∈[M ]

i

∈ R ^2M ; this is the RFF feature approximation λ _p in (4). For p = q = 0, the construction reduces to the traditional RFF technique (Rahimi and Recht, 2007).

This form implies that our target quantity can be written as

∂ \ ^p,q k − ∂ ^p,q k

S = sup

z∈S

_∆

|(Λ _M − Λ)(f z )|, f z (ω) = ω ^p (−ω) ^q c |p+q|

ω ^> z

, (8)

= sup

f∈F

|(Λ _M − Λ)(f )|, F = {f _z : z ∈ S ∆ } , (9)

thus the problem boils down to the study of supremum of empirical processes with F ⊂

F _P(n) where n = |p + q| + 1. In the next section we detail our main result about the

fluctuation of such processes.

(10)

3. Main Result

In this section we present our main result on the supremum of empirical processes of poly- nomial growth, and specialize it to the approximation quality of RFFs for kernel derivatives.

The proofs are given in Section 5.

We investigate the concentration of the sup _f _∈F | _M ¹ P _M

m=1 f(X _m )| quantity under the following assumptions:

1. Compact parameterization: F = {f _t : t ∈ T } where f t : R ^d → R is parameterized by a compact set T ⊂ R ^d

0

.

2. Lipschitz condition: There exists n ∈ R ⁺ and function L : R ^d → R ^≥0 , L ∈ F _P(n) such that

(a) |f _t

₀

(x)| ≤ L(x) for some t 0 ∈ T ,

(b) |f _t

₁

(x) − f _t

₂

(x)| ≤ L(x)ρ (kt ₁ − t ₂ k ₂ ) for all x ∈ R ^d , t ₁ ∈ T, t ₂ ∈ T ,

(c) with ρ : [0, |T|] → R ^≥0 continuous strictly increasing mapping with ρ(0) = 0 such that I _ρ (|T |) := ρ (|T |) R 1

0 r log

1 + _ρ

−1

(uρ(|T ^2|T ^| |))

du < ∞.

3. Independence, finite α-exponential Orlicz norm:

(a) (X _m ) _m∈[M] are independent R ^d -valued random variables; shortly, (X _m ) _m∈[M] ∼

⊗ _m∈[M _] µ _m with µ _m ∈ M ⁺ ₁ R ^d . (b) ∃α ∈ R ⁺ such that kX _m k _Ψ

α

< ∞ for all m ∈ [M].

4. Centering: E [f (X m )] = 0 for all f ∈ F and m ∈ [M ].

Under these conditions, our main result is as follows.

Theorem 1 (Concentration of processes with polynomial growth) Assume that F and (X _m ) _m∈[M _] satisfy Assumptions 1-4 and γ := ^α _n ≤ 1. Let log stand for the natural logarithm, β _γ := Γ

1 + _γ ¹ −γ

, P := ⊗ _m∈[M] µ _m , and kLk _L

2

(X

1:M

) := q

1 M

P

m∈[M ] L ² (X _m ).

Let Ψ ^(l) γ be the convexification ⁵ of Ψ γ , A γ := ^Ψ

(l) γ

⁻¹

(1)

Ψ

⁻¹γ

(1) , B γ := Ψ ^(l) γ

−1

(1), C γ and C D be the constants defined in (45) and (46), and K _γ := 2

1 γ

−1

C _γ

16B _γ + 2

1 γ

−1

1 + A _γ

+ 8A _γ

. Then for any ε > 0 satisfying

ε ≥ 6B, B := 2C _D √

d ⁰ E

h kLk _L

2

(X

1:M

)

i

√

M I _ρ (|T |), (10)

we have

P



sup

t∈T

1 M

X

m∈[M ]

f t (X m ) ≥ ε



 ≤ 2e

−





M ε

3Kγ

k

^maxm∈[M] supt∈T|ft(Xm)|

k

_Ψ

γ





γ

+ e ⁻

M ε2

72σ2+84c ε

, (11)

5. The function Ψ

γ

is not convex for γ < 1. We convexify Ψ

γ

and use the Section 4(v) based integral

control property holding for convex Ψ-s; for details on Ψ

^(l)γ

see Section A.1.

(11)

where σ ² := sup

t∈T

1 M

X

m∈[M ]

E

f _t ² (X m ) ,

c := max

m∈[M] sup

t∈T

kf _t (X _m )k _Ψ

γ



 1 β _γ log



 6Γ

1 + _γ ¹

max _m∈[M] sup _t∈T kf _t (X _m )k _Ψ

γ

γε









1 γ

∨ 8 E

m∈[M] max sup

t∈T

|f _t (X _m )|

∈ [0, +∞).

Remark 2

(i) Two-sided bound: For P

inf t∈T 1 M

P

m∈[M] f t (X m ) ≤ −ε

the same one-sided devi- ation bound can be obtained by replacing f _t with −f _t . As a result one can estimate P

sup _t∈T

1 M

P

m∈[M] f t (X m ) ≥ ε

by twice the bound above.

(ii) Assumption (3): Assumption (3a) with Assumption (4) is weaker than being i.i.d.:

for example E [f t (X m )] = 0 holds for X m = N 0, σ _m ²

and f t (x) = c t x ³ , but X m -s can differ in their variance.

(iii) Assumption α/n ≤ 1: This condition holds without loss of generality. Indeed, in case of α/n > 1, one can get a modified (α ⁰ , n ⁰ ) pair satisfying α ⁰ /n ⁰ ≤ 1 by either increasing n to the value n ⁰ = α using that F _P(n) ⊂ F _P(n

⁰

₎ , or by decreasing α to the value α ⁰ = n using that kX _m k _Ψ

α

< ∞ implies kX _m k _Ψ

α0

< ∞ for any α ⁰ ∈ (0, α).

(iv) Proof-related remarks:

1. Compactness of T : This compactness with the Lipschitz property enables one to control the covering number of F.

2. Truncated functions: The Lipschitz property of F implies that of the truncated functions: for ∀x ∈ R ^d , s and t ∈ T

|T _c f t (x) − T _c f s (x)| ≤ |f _t (x) − f s (x)| ≤ L(x)ρ (kt − sk ₂ ) , (12) where T _c f (x) := f(x) 1 |f(x)|≤c + c 1 f(x)>c − c 1 f (x)<−c is f soft-thresholded at level c.

3. F ⊂ F _P(n) : This property is inherited (Section 5.4) from L ∈ F _P(n) by the Lipschitz conditions (2a)-(2b).

4. Finiteness of the terms in Theorem 1:

max _m∈[M] sup _t∈T |f _t (X _m )|

Ψ

^α

n

and E

max _m∈[M] sup _t∈T |f _t (X _m )|

are finite (see Section 5.4) in Theorem 1 by the Lip- schitz assumption (2a)-(2b), kX _m k _Ψ

α

< ∞ (Assumption (3b)) and L ∈ F _P(n) (As- sumption (2)).

(v) RFF specialization: Assuming that the α-exponential Orlicz condition holds for the spectral measure Λ associated to k (∃α ∈ R ⁺ such that kωk _Ψ

α

< ∞, ω ∼ Λ), ⁶ one can see (Section 5.1) that RFFs are covered by choosing

d ⁰ = d, f t (x) ← f z (ω) − Λf z , t ← z, T ← S _∆ , X m ← ω m ,

6. This requirement implies that R

R^d

|ω

^p+q

| dΛ(ω) < ∞ and thus the existence of ∂

^p,q

k for any p, q ∈ N

^d

.

(12)

ρ(u) = u ^β , β = 1

1 + (log|S _∆ |) ₊ ∈ (0, 1], n ← |p + q|+β.

While any value of β ∈ (0, 1] would meet the assumptions, allowing β to depend on the diameter of S _∆ enables us to get optimal convergence rates w.r.t. the diameter (see Corollary 4).

The terms driving the guarantee for RFF can be bounded (Section 5.6) as follows: there is a constant C _RFF ∈ R ⁺ , depending only on Λ, |p + q|, but not on |S _∆ | and M , such that

B ≤ C RFF

p 1 + (log|S _∆ |) ₊

√

M ,

σ ² ≤ C RFF ,

m∈[M max ] sup

z∈S

_∆

kg _z (ω m )k _Ψ

γ

≤ C RFF ,

m∈[M] max sup

z∈S

∆

|g _z (ω _m )|

Ψ

γ

≤ C _RFF [log(1 + M )] ^n/α ,

E

"

m∈[M] max sup

z∈S

∆

|g _z (ω _m )|

#

≤ C _RFF [log(1 + M )] ^n/α .

(13)

Using these bounds, our finite-sample uniform guarantee on Orlicz RFFs is as follows.

Corollary 3 (Orlicz RFFs for kernel derivative approximation) Let k : R ^d × R ^d → R be a continuous, bounded, shift-invariant kernel with spectral measure Λ. Suppose that Λ satisfies the α-exponential Orlicz assumption (∃α ∈ R ⁺ such that kωk _Ψ

α

< ∞, ω ∼ Λ) and let S ⊂ R ^d be a compact set. Let β = _1+(log|S ¹

∆

|)

+

∈ (0, 1], let p, q ∈ N ^d , n := |p +q|+β, and assume that γ := ^α _n ≤ 1. Let ∂ \ ^p,q k be the RFF estimate of ∂ ^p,q k using (ω m ) _m∈[M _] ^i.i.d. ∼ Λ samples as given in Eq. (7). Then, there exists a constant C ˜ ∈ R ⁺ (depending only on Λ,

|p + q|, but not on S and M) such that for any ≥ ^C ^˜

√ 1+(log|S

_∆

|)

₊

√

M ,

Λ ^M

∂ \ ^p,q k − ∂ ^p,q k S

≥

≤ 2e ⁻

(M ε)γ C˜log(1+M)

+ e

−

^{M ε}²

C˜

1+ε

[

^log

(

^C/ε^˜

)

^∨log(1+M⁾

]

^1/γ

. (14) Corollary 4 (Almost sure convergence for kernel derivative approximation) Let p, q ∈ N ^d and k : R ^d × R ^d → R be a continuous, bounded, shift-invariant kernel with spectral mea- sure Λ which satisfies the α-exponential Orlicz assumption for some α > 0. Then, for any sequence of compact sets (S _M ) ^∞ _M ₌₂ such that (log |S _M |) ₊ = o(M), we have

∂ \ ^p,q k − ∂ ^p,q k S

M

= O _a.s.

p (log |S _M |) ₊ ∨ log M

√ M

!

. (15)

(13)

Remark 5

(i) Spectral measure (Λ) examples: Our result assumes the α-exponential Orlicz prop- erty of the spectral measure Λ associated to k. In Table 2 we provide various examples for Λ (with the relevant case of unbounded support) satisfying this requirement; their relation is summarized in Figure 2. While for the RFF approximation it is not neces- sary, in many of these examples the corresponding kernel value can also be computed, see Table 3.

(ii) α-exponential Orlicz assumption for tensor product kernels: Using the α- exponential Orlicz spectral measures of Table 2 on R , one can immediately construct Orlicz spectral measures on R ^d . Indeed, assume that (i) k is a product kernel, i.e.

k(x, y) = Q

i∈[d] k _i (x _i , y _i ), Λ = ⊗ _i∈[d] Λ _i , and (ii) Λ _i , the spectral measure associated to k i , satisfies the α i -exponential Orlicz assumption (α i ∈ R ⁺ ). Then ω ∼ Λ is α- exponential Orlicz with α = min _i∈[d] α i ; see Section A.2.

(iii) α-exponential Orlicz vs. Bernstein assumption: Our result complements Szab´ o and Sriperumbudur (2019)’s work, where the authors showed that for d = 1 and spectral densities f _λ (ω) ∝ e ^−ω

^2`

the Bernstein condition (and hence fast rates) holds for |p+q| ≤ 2` = α. Indeed, we proved under the more general α-exponential Orlicz assumption the same a.s. convergence rates for any arbitrary order (see Corollary 4) kernel derivatives.

Spectrum Spectral density: f Λ (ω) Parameters α

Gaussian ^√ ¹

2πσ e ⁻

^ω

2

2σ2

σ > 0 2

Laplace ^σ ₂ e ^−σ|ω| σ > 0 1

generalized Gaussian ^α

2βΓ (

_α¹

) e ⁻

|ω|

β α

α > 0, β > 0 α

variance Gamma ^σ

2b

|ω|

^b−¹²

K

_b−1 2

(σ|ω|)

√ πΓ(b)(2σ)

^b−¹²

σ > 0, b > ¹ ₂ 1

Weibull (S) _2λ ^s _|ω|

λ

s−1

e ⁻

_|ω|

λ

s

s > 0, λ > 0 s exponentiated exponential (S) _2λ ^α

1 − e ⁻

^|ω|^λ

α−1

e ⁻

^|ω|^λ

λ > 0, α > 0 1 exponentiated Weibull (S) ^αs _2λ

_|ω|

λ

s−1 1 − e ⁻

_|ω|

λ

s

α−1

× s > 0, λ > 0, α > 0 s

×e ⁻

_|ω|

λ

s

Nakagami (S) _Γ(m)Ω ^m

^mm

|ω| ^2m−1 e ⁻

^mω

2

Ω

m ≥ ¹ ₂ , Ω > 0 2

chi-squared (S) ¹

2

^s²⁺¹

Γ (

^s₂

) |ω|

²^s

⁻¹ e ⁻

^|ω|²

s ∈ Z ⁺ 1

Erlang (S) ^λ

^s

^|ω| _2(s−1)!

^s−1

^e

^−λ|ω|

s ∈ Z ⁺ , λ > 0 1

(14)

Spectrum Spectral density: f _Λ (ω) Parameters α Gamma (S) _2Γ(s)θ ¹

s

|ω| ^s−1 e ⁻

^|ω|^θ

s > 0, θ > 0 1 generalized Gamma (S) ^p/a

^D

2Γ

D

p

|ω| ^D−1 e ⁻

_|ω|

a

p

a > 0, D > 0, p > 0 p Rayleigh (S) _2σ ^|ω|

2

e ⁻

^ω

2

2σ2

σ > 0 2

Maxwell-Boltzmann (S) ^√ ¹

2π ω

²

e

⁻

ω2 2a2

a

³

a > 0 2

chi (S) ¹

2

^s²

Γ (

₂^s

) |ω| ^s−1 e ⁻

^ω

2

s > 0 2

exponential-logarithmic (S) − _{2 log(p)} ¹ _1−(1−p)e ^β(1−p)e

^−β|ω|_−β|ω|

p ∈ (0, 1), β > 0 1 Weibull-logarithmic (S) − _{2 log(p)} ¹ ^{αβ(1−p)|ω|}

^α−1

^e

^−β|ω|

α

1−(1−p)e

^−β|ω|^α

p ∈ (0, 1) , β > 0, α > 0 α Gamma/Gompertz (S) ^bse

^b|ω|

^β

^s

2 ( ^β−1+e

^b|ω|

)

^s+1

b > 0, β > 0, s > 0 bs hyperbolic secant ¹ ₂ sech ^π ₂ ω

1 logistic ^e

⁻

ω s

s h

1+e

⁻^ω^s

i

2

s > 0 1

normal-inverse Gaussian ^αδK

¹

( ^α ^√ ^δ

²

^+ω

²

)

π √

δ

²

+ω

²

e ^δα α > 0, δ ∈ R 1

hyperbolic _2δK ¹

1

(δα) e ^−α

√ δ

²

+ω

²

α > 0, δ ∈ R 1 generalized hyperbolic ^√ ^(α/δ)

^λ

2πK

λ

(δγ) K

_λ−1

2

( ^α ^√ ^δ

²

^+ω

²

)

√

δ2+ω2 α

¹₂−λ

α > 0, λ ∈ R , δ ∈ R 1 Table 2: Kernel spectrum examples in one dimension (d = 1) obeying the α-exponential Orlicz assumption. ’(S)’ stands for symmetrized. The symmetrization guarantees that the kernel associated to Λ is real-valued. Last column: Orlicz exponent. For the variance Gamma distribution the Orlicz exponent follows from the known K u (z) ∼ p

π/(2z)e ^−z asymptotics (Barndorff-Nielsen et al., 2001, page 297) where K _u is the modified Bessel function of the second kind, as defined in Table 4. Notice that the ’normal-inverse Gaussian

δ=σ

²

α, α→∞

− −−−−−−−− → Gaussian’ limit (see Figure 2) changed the Orlicz exponent from 1 to 2.

4. Properties of the Orlicz Norm

In this section, for self-containedness we summarize the properties of k·k _Ψ which hold inde- pendently of the convexity/non-convexity of Ψ (unless explicitly required).

Let X, X ⁰ ∈ R ^d be random variables, and assume that Ψ : R ^≥0 → R ^≥0 (and similarly Φ

below) is continuous, strictly increasing, Ψ(0) = 0 and lim x→∞ Ψ(x) = ∞.

(15)

exponentiated exponential (S)

Weibull

-logarithmic (S) Gamma/Gompertz (S)

exponentiated Weibull (S)

exponential

-logarithmic (S) chi-squared (S) Erlang (S)

Weibull (S) generalized

Gamma (S) Gamma (S) Laplace

Rayleigh (S) chi (S) variance Gamma generalized

Gaussian

hyperbolic secant Maxwell-Boltzmann (S) normal-inverse

Gaussian Gaussian

logistic generalized

hyperbolic hyperbolic Nakagami (S)

α = 1

β = 1 s = 1

α=1 s = 1

s = 2 s = 1

p = D p = 1

p = 2, a = 2

s ∈ Z

⁺

s ∈ Z

⁺

/2

s = 2 (σ = 1)

s = 3 (a = 1)

b = 1 α = 1

α = 2 δ=σ

²

α,

α→∞

λ = −

¹₂

δ = 0

λ = 1

m =

¹₂

m =

^s₂

Figure 2: Relation of the spectral density examples of Table 2. ’(S)’ stands for symmetrized.

(16)

Kernel name Kernel value: k(x, y) Spectrum

Gaussian e ⁻

^σ2(x−y)2²

Gaussian

inverse quadric _σ

2

+(x−y) ^σ

² ²

Laplace

–

√ π

Γ(1/α) Ψ _1,1

1 α , ² _α

; ¹ ₂ ; 1

, ^{−[β(x−y)]} ₄

²

generalized Gaussian ^a inverse multiquadric h

σ

²

σ

²

+(x−y)

²

i b

variance Gamma

– P

n∈2 N

(−1)

ⁿ2

(x−y)

ⁿ

λ

ⁿ

n! Γ 1 + ⁿ _s

Weibull (S) ^b – [1−2i(x−y)]

⁻^s²

+[1+2i(x−y)]

⁻^s²

2 chi-squared (S) ^b

–

h

1−

^i(x−y)_λ

i

−s

+ h

1+

^i(x−y)_λ

i

−s

2 Erlang (S) ^b

– [1−θi(x−y)]

^−s

+[1+θi(x−y)]

^−s

2 Gamma (S) ^b

– 1 − σ(x − y)e ⁻

^σ2(x−y)2²

p _π

2 erfi

_σ(x−y)

√ 2

Rayleigh (S) ^b

– F _1,1

s

2 ; ¹ ₂ ; ^−(x−y) ₂

²

chi (S) ^b

– P

n∈2 N

(−1)

ⁿ2

(x−y)

ⁿ

Γ (

ⁿ_α

⁺¹ )

− log(p)n!β

^αⁿ

Li

ⁿ

α

+1 (1 − p) Weibull-logarithmic (S) ^b,c – ¹ ₂ [c _Λ (x − y) + c _Λ (y − x)], with c _Λ (t) = Gamma/Gompertz (S) ^b

= β ^s _sb−ti ^sb F _2,1 s + 1; − ^ti _b + s; − ^ti _b + s + 1; 1 − β

sech sech(x − y) hyperbolic secant

– sinh(πs(x−y)) ^πs(x−y) logistic

– e ^δ

h α− √

α

²

+(x−y)

²

i

normal-inverse Gaussian

– ^αK

¹

δ √

α

²

+(x−y)

²

√

α

²

+(x−y)

²

K

1

(δα) hyperbolic

–

√ α

α

²

+(x−y)

²

λ K

λ

δ

√

α

²

+(x−y)

²

K

λ

(δα) generalized hyperbolic

a. The analytical computation of the characteristic function (and hence the kernel value) was carried out for α > 1 (Pog´ any and Nadarajah, 2010).

b. In case of symmetrization (S): k(x, y) =

¹₂

[c

Λ

(x − y) + c

Λ

(y − x)] where c

Λ

(t) = E

^ω∼Λ

[e

^itω

] is the char- acteristic function of the spectral measure (on R

^≥0

) before symmetrization; i = √

−1.

c. The characteristic function was obtained by Ciumara and Preda (2009).

Table 3: Kernel examples for the spectral densities given in Table 2. The special functions

Ψ 1,1 , erfi, F 1,1 , Li, F 2,1 and K λ are defined in Table 4.

(17)

(i) Normalization: If X ∈ L Ψ then E h

Ψ _kX _k

2

kXk

_Ψ

i

≤ 1.

(ii) Constant: For a λ ∈ R constant kλk _Ψ = |λ|/Ψ ⁻¹ (1).

(iii) Monotonicity in Ψ: Ψ ≤ Φ implies kXk _Ψ ≤ kXk _Φ .

(iv) Monotonicity in the argument: If d = 1 and 0 ≤ X ≤ X ⁰ a.s., then kXk _Ψ ≤ kX ⁰ k _Ψ . (v) Finite k·k _Ψ implies integrability: If Ψ is convex and X ∈ L _Ψ , then E [kXk ₂ ] ≤

kXk _Ψ Ψ ⁻¹ (1).

(vi) Generalized triangle inequality: Let X, X ⁰ ∈ L _Ψ

_α

and α ∈ R ⁺ . Then X + X ⁰ ∈ L _Ψ

_α

and

X + X ⁰

Ψ

α

≤ 2 (

_α¹

⁻¹ )

₊

kXk _Ψ

α

+ X ⁰

Ψ

α

. (vii) Deviation inequality from k·k _Ψ : If X ∈ L Ψ then P (kXk ₂ ≥ c) ≤ _Ψ(c/kXk ²

Ψ

)+1 for any c ≥ 0.

(viii) Maximal inequality for k·k _Ψ

α

and α ∈ R ⁺ : for any sequence (X m ) ^M _m=1 of random variables in L Ψ

α

, we have

m∈[M] max kX _m k ₂ Ψ

α

≤ max

m∈[M ] kX _m k _Ψ

α

log(1 + M ) log(3/2)

1/α

. The proofs of these properties are available in Section A.3.

5. Proofs

After introducing a few additional notations, we provide the proofs of our results and re- marks presented in Sections 3 and 4. External statements used in the proofs are summarized in Section C.

Notations: For γ ∈ (0, 1] and x ∈ R ^≥0 , let I _γ (x) = R x

0 e ^−t

^γ

dt be the incomplete Gamma function. Let (Z, m) be a semi-metric space and ∈ R ⁺ . The set S ⊆ Z is said to be an -net of Z if for any z ∈ Z there exists s ∈ S such that m(s, z) ≤ . The -covering number of Z is defined as the size of the smallest -net, i.e., N (, m, Z) = inf

n

` ≥ 1 : ∃s ₁ , . . . , s _l ∈ Z such that Z ⊆ ∪ ^` _j=1 B m (s j , ) o

, where B m (s, ) = {z ∈ Z : m(z, s) ≤ } is the closed ball with center s ∈ Z and radius .

5.1 Proof of Remark 2(v)

In view of (8)-(9), we need to check Assumptions 1-4 with the parameterized function class g z (ω) := f z (ω) − Λf z = ω ^p (−ω) ^q c _|p+q|

ω ^> z

− Λf z , (z ∈ S _∆ ).

Thanks to the α-exponential Orlicz condition on Λ and the i.i.d. property of (ω m ) m∈[M ] in (7), Assumption 3 is trivially fulfilled. Assumption 4 holds by the definition of g _z (·) and because the distribution of ω _m is Λ. Assumption 1 is satisfied since S _∆ is a compact set of R ^d . Therefore, it remains to prove Assumption 2, with the existence of n ∈ R ⁺ and L ∈ F _P(n) . First, notice that

|f _z (ω)| ≤ Y

i∈[d]

|ω _i | ^p

ⁱ

^+q

ⁱ

≤ kωk ^|p+q| ₂ . (16)

(18)

• Order: (16) implies that

|g _z (ω)| ≤ |f _z (ω)| + Λ|f _z | ≤ kωk ^|p+q| ₂ + Λ h

k · k ^|p+q| ₂ i

=: L ₁ (ω). (17)

• Lipschitz condition: Let [z 1 , z 2 ] = {az ₁ + (1 − a)z 2 : a ∈ [0, 1]} denote the segment connecting z ₁ , z ₂ ∈ R ^d . By using the mean value theorem

|g _z

₁

(ω) − g _z

₂

(ω)| ≤ max

z∈[z

₁

,z

2

]

∂g z (ω)

∂z 2

kz ₁ − z ₂ k ₂ , (18)

∂g

z

(ω)

∂z = ^∂f _∂z

^z

^(ω) − Λ ^∂f _∂z

^z

^(ω) with ^∂f _∂z

^z

^(ω) = ω ^p (−ω) ^q c _|p+q|+1 ω ^> z

ω, and by using similar computations as before, one gets

∂g _z (ω)

∂z 2

≤ kωk ^|p+q|+1 ₂ + Λ h

k · k ^|p+q|+1 ₂ i

=: L ₂ (ω). (19)

As a result, to fulfill Assumption 2, we can take L(ω) = max(L 1 (ω), L 2 (ω)) and ρ(u) = u.

For such L, we have n = |p + q| + 1.

Refined L and ρ: We now derive refined L and ρ, by interpolating different bounds. From (17), we can obtain the crude estimate |g _z

₁

(ω) − g z

2

(ω)| ≤ 2L 1 (ω), which combined with (18)-(19) gives

|g _z

₁

(ω) − g _z

₂

(ω)| ≤ (2L ₁ (ω)) ^1−β h

kz ₁ − z ₂ k ₂ L ₂ (ω) i β

(20) for any β ∈ (0, 1]. Here we have used that if 0 ≤ x ≤ min(x ₁ , x ₂ ) then x ≤ x ^1−β ₁ x ^β ₂ . It follows that one can take

ρ(u) = u ^β , n = |p + q| + β, L(ω) = max

L ₁ (ω), (2L ₁ (ω)) ^1−β L ^β ₂ (ω)

∈ F _P(n) . (21) For β = 1, we retrieve the former choice of L and ρ. Furthermore, we have

I _ρ (|T |) =|T | ^β Z ₁

0 s log

1 + 2|T | (u|T | ^β ) ^1/β

du =|T | ^β Z ₁

0 s log

1 + 2

u ^1/β

du < +∞. (22) Notice that the advantage of having the additional degree-of-freedom β is two-fold, and it is striking when β → 0 (compared to β = 1). Firstly, it gives a smaller n, which has a (slight) positive impact on the control of statistical fluctuations; secondly, the dependence of I _ρ (|T |) in the diameter |T | is smaller through the growth exponent.

To conclude, we have proved that Orlicz RFFs fulfill the assumptions of Theorem 1. Later in Section 5.6, we will establish that I _ρ (|T|) satisfies a (tight) bound w.r.t. p

1 + (log|T |) ₊ . 5.2 Proof that Polynomial Growth Preserves the Exponential Orlicz Property We show that kf (X)k _Ψ

γ

< ∞ for kXk _Ψ

α

< ∞, f ∈ F _P(n) , n ∈ R ⁺ , γ = ^α _n . Indeed, by the definition of f ∈ F _P(n) , there exists C ∈ R ⁺ such that |f (x)| ≤ C(1 + kxk ⁿ ₂ ) for all x ∈ R ^d . Hence for any γ > 0

|f (x)| ^γ ≤ C ^γ (1 + kxk ⁿ ₂ ) ^γ

(∗)

≤ 2 ^(γ−1)

⁺

C ^γ (1 + kxk ^nγ ₂ ) , (23)

(19)

where in (∗) we used that

(a + b) ^γ ≤ 2 ^(γ−1)

⁺

(a ^γ + b ^γ ) , a, b ≥ 0, γ > 0. (24) Since X ∈ L Ψ

α

there is some s ∈ R ⁺ for which E

h e ^skXk

^α²

i

< ∞. Combining this property with (23) and recalling that nγ = α yields

e ^s

⁰

^|f ^(x)|

^γ

≤ e ^s

⁰

²

^(γ−¹⁾⁺

^C

^γ

( ^1+kxk

^α2

) ⇒ E h

e ^s

⁰

^|f(X)|

^γ

i

≤ e ^s

⁰

²

^(γ−¹⁾⁺

^C

^γ

E h

e ^skXk

^α²

i

< ∞ with s ⁰ = ^s

2

^(γ−¹⁾⁺

C

^γ

; this shows that f (X) ∈ L _Ψ

_γ

. 5.3 Proof of Theorem 1

By introducing the R _c f(x) := f (x) − T _c f (x) notation of residuals obtained at level c ∈ R ⁺ (the value of c will be specified later), we bound the target quantity by using the sub- additivity of supremum

sup

t∈T

1 M

X

m∈[M]

f t (X m )

| {z }

T

c

f

t

(X

m

)+R

c

f

t

(X

m

)

= sup

t∈T

1 M

X

m∈[M]

(T _c f t (X m ) − E [T _c f t (X m )] + E [T _c f t (X m )] + R _c f t (X m ))

≤ sup

t∈T

1 M

X

m∈[M]

T _c f _t (X _m ) − E [T _c f _t (X _m )]

| {z }

Z

^T^c

+ sup

t∈T E



 1 M

X

m∈[M]

T _c f _t (X _m )





| {z }

E

^T^c

+ sup

t∈T

1 M

X

m∈[M ]

R _c f _t (X _m )

| {z }

Z

^R^c

.

This means that using c for which E ^T

^c

≤ ₃ ^ε ,

P



sup

t∈T

1 M

X

m∈[M ]

f _t (X _m ) ≥ ε



 ≤ P Z ^R

^c

≥ ε/3 + P

Z ^T

^c

≥ ε/3

. (25)

The structure of our proof is as follows.

1. Unbounded part (Z ^R

^c

): Based on the Talagrand and the Hoffman-Jorgensen inequali- ties, for large enough c (referred to as c HJ ) we will derive an exponential control over P Z ^R

^c

≥ ε/3

expressed with

max _m∈[M _] sup _t∈T |f _t (X _m )|

Ψ

γ

which is finite by Sec- tion 5.4.

2. Bounded part (Z ^T

^c

): We handle this term using the Klein-Rio inequality and the Dudley entropy integral bound. In addition, this part will give rise to the constraint (10) on ε.

3. Truncation (E ^T

^c

): As E [f t (X m )] = 0, T _c f t ≈ f t and E [T _c f t (X m )] ≈ 0 for large c (called

c _min ). The E ^T

^c

≤ ^ε ₃ requirement can be controlled via the integral form of the expectation

of non-negative random variables and the incomplete Gamma function.

(20)

The bounding of the Z ^R

^c

, Z ^T

^c

and E ^T

^c

quantities is detailed in the following sections.

Plugging the (26) and (28) results of the computations into (25) gives the final bound (11).

The ε constraint comes from (33), provided that c ≥ c _min ∨ c _HJ . The constants c _min and c _HJ are defined in (35) and (38), respectively.

5.3.1 Bounding Z ^R

^c

P Z ^R

^c

≥ ε/3

is bounded as

P Z ^R

^c

≥ ε/3

≤ P



sup

t∈T

X

m∈[M ]

|R _c f t (X m )| ≥ M ε/3





(a)

≤ P



 X

m∈[M]

sup

t∈T

|R _c f _t (X _m )| ≥ M ε/3





(b)

≤ 2e

−





M ε/3

k

^Pm∈[M] supt∈T|Rcft(Xm)|

k

_Ψγ





γ

(c)

≤ 2e

−





^{M ε}

3Kγ

k

^maxm∈[M] supt∈T|ft(Xm)|

k

_Ψ

γ





γ

, (26)

where in (a) we used the the sub-additivity of the supremum, in (b) the deviation inequality Section (4)(vii) was applied, (c) holds by Section 5.5 for c ≥ c HJ (the value of c HJ is defined in Section 5.5).

5.3.2 Bounding Z ^T

^c

Below we will invoke the Klein-Rio inequality and control the expectation E h

Z ^T

^c

i

.

• Klein-Rio inequality: Let g _m,t : x ∈ R ^d 7→ T _c f _t (x) − E [T _c f _t (X _m )] and let us define the function classes

T _c F ^[M ^] := {g _t := (g 1,t , . . . , g M,t ) : t ∈ T }, T _c F := {T _c f t : t ∈ T }.

– g m,t ∈ [−2c, 2c] are measurable and bounded functions.

– Centering: by construction E [g m,t (X m )] = 0 (∀m ∈ [M]).

– Countability: Since t 7→ f _t is continuous, the sup _t∈T can be restricted to rational numbers (T ∩ Q ^d ), one can take T ← T ∩ Q ^d , and assume that T _c F ^[M ^] is countable.

Orlicz Random Fourier Features

HAL Id: hal-02418576

https://hal.archives-ouvertes.fr/hal-02418576v2

Preprint submitted on 19 Jun 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

To cite this version:

Linda Chamakh, Emmanuel Gobet, Zoltán Szabó. Orlicz Random Fourier Features. 2020. �hal-

02418576v2�

Orlicz Random Fourier Features

Linda Chamakh †‡ [email protected] Emmanuel Gobet † [email protected] Zolt´ an Szab´ o † [email protected]

† CMAP, CNRS, ´ Ecole Polytechnique, Institut Polytechnique de Paris 91128 Palaiseau, France

‡ Global Markets Quantitative Research – BNP Paribas

Editor: Jean-Philippe Vert

Abstract

Keywords: random Fourier features, kernel derivative, polynomial-growth functions, α-exponential Orlicz norm, unbounded empirical processes

1. Introduction

© 2020 Linda Chamakh, Emmanuel Gobet and Zolt´ an Szab´ o.

In applications one is often given {x n } N n=1 samples and is facing with an optimization problem expressed in terms of function values and derivatives 1

f min ∈ H

l {∂ p f (x n )} n∈[N]

p∈D

, kf k 2 H

!

, (1)

where [N ] = {1, . . . , N }, D n ⊂ N d , N := {0, 1, . . .}, l : R 1+

P

#D

→ R is a loss function, #D n is the cardinality of the set D n , ∂ p f (x n ) := ∂

f (x

)

∂

···∂

and the RKHS H k is characterized by f (x) = hf, k(·, x)i H

where φ(x) = k(·, x) ∈ H k .

For example by taking the quadratic loss, Tikhonov regularization, only function values (D n = {0}, ∀n ∈ [N ]) and λ > 0, (1) reduces to kernel ridge regression

f∈ min H

1 N

X

n∈[N]

[f (x n ) − y n ] 2 + λ kfk 2 H

.

Alternatively, one can get back Hermite learning with gradient data (Zhou, 2008; Shi et al., 2010) by additionally including first-order derivatives

f∈ min H

1 N

X

n∈[N]

[f (x n ) − y n ] 2 +

f 0 (x n ) − y 0 n

2 2

+ λ kf k 2 H

, λ > 0

where f 0 (x) = [∂ e

f (x); . . . ; ∂ e

f(x)] ∈ R d is the derivative of f , e j ∈ R d is the j th canonical basis vector, k·k 2 is the Euclidean norm and D n = n

0, {e j } d j=1 o

An appealing property of RKHSs is that their geometry makes the optimization prob- lem (1) defined over function spaces computationally tractable. Indeed, assuming that l is increasing in its last argument, the ∂ p f (x) =

f, ∂ p,0 k(·, x)

H

derivative-reproducing prop- erty of kernels and the representer theorem (Zhou, 2008) guarantee that the solution of (1) has a finite-dimensional parameterization f (·) = P

n∈[N]

P

p∈D

a n,p ∂ p,0 k(·, x n ) (a n,p ∈ R ) and it is sufficient to solve

min a l













 X

m∈[N ]

X

q∈D

a m,q ∂ p,q k(x n , x m )





 n∈[N]

p∈D

Linda Chamakh ^†‡ [email protected] Emmanuel Gobet ^† [email protected] Zolt´ an Szab´ o ^† [email protected]

In applications one is often given {x _n } ^N _n=1 samples and is facing with an optimization problem expressed in terms of function values and derivatives ¹

l {∂ ^p f (x n )} _n∈[N]

, kf k ² _H

where [N ] = {1, . . . , N }, D n ⊂ N ^d , N := {0, 1, . . .}, l : R ¹⁺

→ R is a loss function, #D n is the cardinality of the set D n , ∂ ^p f (x n ) := ^∂

^f ^(x

⁾

and the RKHS H k is characterized by f (x) = hf, k(·, x)i _H

For example by taking the quadratic loss, Tikhonov regularization, only function values (D _n = {0}, ∀n ∈ [N ]) and λ > 0, (1) reduces to kernel ridge regression

[f (x _n ) − y _n ] ² + λ kfk ² _H

[f (x n ) − y n ] ² +

f ⁰ (x n ) − y ⁰ _n

+ λ kf k ² _H

where f ⁰ (x) = [∂ ^e

f (x); . . . ; ∂ ^e

f(x)] ∈ R ^d is the derivative of f , e _j ∈ R ^d is the j ^th canonical basis vector, k·k ₂ is the Euclidean norm and D _n = n

0, {e _j } ^d _j=1 o

An appealing property of RKHSs is that their geometry makes the optimization prob- lem (1) defined over function spaces computationally tractable. Indeed, assuming that l is increasing in its last argument, the ∂ ^p f (x) =

f, ∂ ^p,0 k(·, x)

a _n,p ∂ ^p,0 k(·, x _n ) (a _n,p ∈ R ) and it is sufficient to solve

a m,q ∂ ^p,q k(x n , x m )

 _n∈[N]

a n,p a m,q ∂ ^p,q k(x n , x m )

determined by the ∂ ^p,q k(x, y) := ^∂

kernel derivatives; a = (a _n,p ) _{n∈[N],p∈D}

ω ^> (x − y)

k(x, x ⁰ ) ≈

λ(x), λ(x ⁰ )

\ ∂ ^p,q k − ∂ ^p,q k

_S := sup

∂ ^p,q k(x, y) − ∂ \ ^p,q k(x, y) ,

where S ⊂ R ^d is a compact set. Despite the large number of successful RFF applications, quite little is understood theoretically on its approximation quality. Below we provide a brief summary with particular focus on optimal guarantees and results related to kernel derivatives.

• Kernel values (p = q = 0): The uniform finite-sample bounds (Rahimi and Recht, 2007; Sutherland and Schneider, 2015) have recently been improved (Sriperumbudur and Szab´ o, 2015) exponentially in terms of the diameter of the compact set S _M (|S _M |) arriving to ⁴

= O _a.s.

= O _p _|S

, where ∨ denotes the maximum. The result shows that the diameter of the set S _M can grow at a

|S _M | = e ^o(M) rate while still getting a consistent estimate; this rate is optimal as shown in the characteristic function literature (Cs¨ org¨ o and Totik, 1983).