Adaptive and optimal online linear regression on $\ell^1$-balls

(1)

HAL Id: hal-00594399

https://hal.archives-ouvertes.fr/hal-00594399v4

Submitted on 14 Jan 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Adaptive and optimal online linear regression on ℓ ¹ -balls

Sébastien Gerchinovitz, Jia Yuan Yu

To cite this version:

Sébastien Gerchinovitz, Jia Yuan Yu. Adaptive and optimal online linear regression on ℓ ¹ -balls. The-

oretical Computer Science, Elsevier, 2014, 519, pp.4-28. �10.1016/j.tcs.2013.09.024�. �hal-00594399v4�

(2)

Adaptive and optimal online linear regression on ` ¹ -balls

S´ ebastien Gerchinovitz ^a,1,∗ , Jia Yuan Yu ^b

a Ecole Normale Sup´ ´ erieure, 45 rue d’Ulm, 75005 Paris, France

b IBM Research, Damastown Technology Campus, Dublin 15, Ireland

Abstract

We consider the problem of online linear regression on individual sequences. The goal in this paper is for the forecaster to output sequential predictions which are, after T time rounds, almost as good as the ones output by the best linear predictor in a given ` ¹ -ball in R ^d . We consider both the cases where the dimension d is small and large relative to the time horizon T . We first present regret bounds with optimal dependencies on d, T , and on the sizes U , X and Y of the ` ¹ -ball, the input data and the observations. The minimax regret is shown to exhibit a regime transition around the point d = √

T U X/(2Y ). Furthermore, we present efficient algorithms that are adaptive, i.e., that do not require the knowledge of U, X , Y , and T, but still achieve nearly optimal regret bounds.

Keywords: Online learning, Linear regression, Adaptive algorithms, Minimax regret

1. Introduction

In this paper, we consider the problem of online linear regression against arbitrary sequences of input data and observations, with the objective of being competitive with respect to the best linear predictor in an ` ¹ -ball of arbitrary radius. This extends the task of convex aggregation. We consider both low- and high-dimensional input data. Indeed, in a large number of contemporary problems, the available data can be high-dimensional—the dimension of each data point is larger than the number of data points. Examples include analysis of DNA sequences, collaborative filtering, astronomical data analysis, and cross-country growth regression. In such high-dimensional problems, performing linear regression on an ` ¹ -ball of small diameter may be helpful if the best linear predictor is sparse. Our goal is, in both low and high dimensions, to provide online linear regression algorithms along with bounds on ` ¹ -balls that characterize their robustness to worst-case scenarios.

1.1. Setting

We consider the online version of linear regression, which unfolds as follows. First, the environment chooses a sequence of observations (y _t ) _t _> ₁ in R and a sequence of input vectors (x _t ) _t _> ₁ in R ^d , both initially hidden from the forecaster. At each time instant t ∈ N ^∗ = {1, 2, . . .}, the environment reveals the data x _t ∈ R ^d ; the forecaster then gives a prediction y b _t ∈ R ; the environment in turn reveals the observation y t ∈ R ; and finally, the forecaster incurs the square loss (y t − y b t ) ² . The dimension d can be either small or large relative to the number T of time steps: we consider both cases.

In the sequel, u · v denotes the standard inner product between u, v ∈ R ^d , and we set kuk _∞ , max 1 6 j 6 d |u j | and kuk ₁ , P d

j=1 |u j |. The ` ¹ -ball of radius U > 0 is the following bounded subset of R ^d : B ₁ (U ) ,

u ∈ R ^d : kuk ₁ 6 U .

∗ Corresponding author

Email addresses: sebastien.gerchinovitz@ens.fr (S´ ebastien Gerchinovitz), jiayuanyu@ie.ibm.com (Jia Yuan Yu)

1 This research was carried out within the INRIA project CLASSIC hosted by ´ Ecole Normale Sup´ erieure and CNRS.

(3)

Given a fixed radius U > 0 and a time horizon T > 1, the goal of the forecaster is to predict almost as well as the best linear forecaster in the reference set

x ∈ R ^d 7→ u · x ∈ R : u ∈ B ₁ (U ) , i.e., to minimize the regret on B 1 (U) defined by

T

X

t=1

(y t − y b t ) ² − min

u∈B ₁ (U)

( _T X

t=1

(y t − u · x t ) ² )

.

We shall present algorithms along with bounds on their regret that hold uniformly over all sequences ² (x t , y t ) 1 6 t 6 T such that kx t k _∞ 6 X and |y t | 6 Y for all t = 1, . . . , T , where X, Y > 0. These regret bounds depend on four important quantities: U , X , Y , and T , which may be known or unknown to the forecaster.

1.2. Contributions and related works

In the next paragraphs we detail the main contributions of this paper in view of related works in online linear regression.

Our first contribution (Section 2) consists of a minimax analysis of online linear regression on ` ¹ -balls in the arbitrary sequence setting. We first provide a refined regret bound expressed in terms of Y , d, and a quantity κ = √

T U X/(2dY ). This quantity κ is used to distinguish two regimes: we show a distinctive regime transition ³ at κ = 1 or d = √

T U X/(2Y ). Namely, for κ < 1, the regret is of the order of dY ² κ (proportional to √

T ), whereas it is of the order of dY ² ln κ (proportional to ln T ) for κ > 1.

The derivation of this regret bound partially relies on a Maurey-type argument used under various forms with i.i.d. data, e.g., in [1, 2, 3, 4] (see also [5]). We adapt it in a straightforward way to the deterministic setting. Therefore, this is yet another technique that can be applied to both the stochastic and individual sequence settings.

Unsurprisingly, the refined regret bound mentioned above matches the optimal risk bounds for stochastic settings ⁴ [6, 2] (see also [7]). Hence, linear regression is just as hard in the stochastic setting as in the arbi- trary sequence setting. Using the standard online to batch conversion, we make the latter statement more precise by establishing a lower bound for all κ at least of the order of √

ln d/d. This lower bound extends those of [8, 9], which only hold for small κ of the order of 1/d.

The algorithm achieving our minimax regret bound is both computationally inefficient and non-adaptive (i.e., it requires prior knowledge of the quantities U , X, Y , and T that may be unknown in practice).

Those two issues were first overcome by [10] via an automatic tuning termed self-confident (since the forecaster somehow trusts himself in tuning its parameters). They indeed proved that the self-confident p-norm algorithm with p = 2 ln d and tuned with U has a cumulative loss L b _T = P T

t=1 (y _t − b y _t ) ² bounded by L b _T 6 L ^∗ _T + 8U X

q

(e ln d) L ^∗ _T + (32e ln d) U ² X ² 6 8U XY √

eT ln d + (32e ln d) U ² X ² , where L ^∗ _T , min _{u∈ _R d :kuk ₁ 6 U } P T

t=1 (y t − u · x t ) ² 6 T Y ² . This algorithm is efficient, and our lower bound in terms of κ shows that it is optimal up to logarithmic factors in the regime κ 6 1 without prior knowledge of X, Y , and T.

Our second contribution (Section 3) is to show that similar adaptivity and efficiency properties can be obtained via exponential weighting. We consider a variant of the EG ^± algorithm [9]. The latter has a manageable computational complexity and our lower bound shows that it is nearly optimal in the regime

2 Actually our results hold whether (x t , y t ) _t>1 is generated by an oblivious environment or a non-oblivious opponent since we consider deterministic forecasters.

3 In high dimensions (i.e., when d > ωT , for some absolute constant ω > 0), we do not observe this transition (cf. Figure 1).

4 For example, (x t , y t ) _16t6T may be i.i.d. , or x t can be deterministic and y t = f(x t ) + ε t for an unknown function f and

an i.i.d. sequence (ε t ) _16t6T of Gaussian noise.

(4)

κ 6 1. However, the EG ^± algorithm requires prior knowledge of U , X , Y , and T . To overcome this adaptivity issue, we study a modification of the EG ^± algorithm that relies on the variance-based automatic tuning of [11]. The resulting algorithm – called adaptive EG ^± algorithm – can be applied to general convex and differentiable loss functions. When applied to the square loss, it yields an algorithm of the same computational complexity as the EG ^± algorithm that also achieves a nearly optimal regret but without needing to know X, Y , and T beforehand.

Our third contribution (Section 3.3) is a generic technique called loss Lipschitzification. It transforms the loss functions u 7→ (y t − u · x t ) ² (or u 7→

y t − u · x t

α if the predictions are scored with the α-loss for a real number α > 2) into Lipschitz continuous functions. We illustrate this technique by applying the generic adaptive EG ^± algorithm to the modified loss functions. When the predictions are scored with the square loss, this yields an algorithm (the LEG algorithm) whose main regret term slightly improves on that derived for the adaptive EG ^± algorithm without Lipschtizification. The benefits of this technique are clearer for loss functions with higher curvature: if α > 2, then the resulting regret bound roughly grows as U instead of a naive U ^α/2 .

Finally, in Section 4, we provide a simple way to achieve minimax regret uniformly over all ` ¹ -balls B ₁ (U) for U > 0. This method aggregates instances of an algorithm that requires prior knowledge of U . For the sake of simplicity, we assume that X, Y , and T are known, but explain in the discussions how to extend the method to a fully adaptive algorithm that requires the knowledge neither of U , X, Y , nor T .

This paper is organized as follows. In Section 2, we establish our refined upper and lower bounds in terms of the intrinsic quantity κ. In Section 3, we present an efficient and adaptive algorithm — the adaptive EG ^± algorithm with or without loss Lipschitzification — that achieves the optimal regret on B 1 (U ) when U is known. In Section 4, we use an aggregating strategy to achieve an optimal regret uniformly over all

` ¹ -balls B 1 (U ), for U > 0, when X , Y , and T are known. Finally, in Section 5, we discuss as an extension a fully automatic algorithm that requires no prior knowledge of U , X , Y , or T . Some proofs and additional tools are postponed to the appendix.

2. Optimal rates

In this section, we first present a refined upper bound on the minimax regret on B 1 (U ) for an arbitrary U > 0. In Corollary 1, we express this upper bound in terms of an intrinsic quantity κ , √

T U X/(2dY ).

The optimality of the latter bound is shown in Section 2.2.

We consider the following definition to avoid any ambiguity. We call online forecaster any sequence F = ( f e t ) t > 1 of functions such that f e t : R ^d × ( R ^d × R ) ^t−1 → R maps at time t the new input x t and the past data (x _s , y _s ) ₁ ₆ _s ₆ _t−1 to a prediction f e _t x _t ; (x _s , y _s ) ₁ ₆ _s ₆ _t−1

. Depending on the context, the latter prediction may be simply denoted by f e t x t ) or by y b t .

2.1. Upper bound

Theorem 1 (Upper bound). Let d, T ∈ N ^∗ , and U, X, Y > 0. The minimax regret on B ₁ (U ) for bounded base predictions and observations satisfies

inf F sup

kx _t k _∞ 6 X, |y _t | 6 Y

( _T X

t=1

(y t − b y t ) ² − inf

kuk ₁ 6 U T

X

t=1

(y t − u · x t ) ² )

6 

 



 



3U XY p

2T ln(2d) if U < ^Y _X

q ln(1+2d) T ln 2 , 26 U XY

r T ln

1 + ^√ ^2dY

T U X

if _X ^Y

q ln(1+2d)

T ln 2 6 U 6 ^√ ^2dY _{T X} , 32 dY ² ln

1 +

√ T U X dY

+ dY ² if U > ^2dY

X √ T ,

where the infimum is taken over all forecasters F and where the supremum extends over all sequences

(x t , y t ) 1 6 t 6 T ∈ ( R ^d × R ) ^T such that |y 1 |, . . . , |y T | 6 Y and kx 1 k _∞ , . . . , kx T k _∞ 6 X .

(5)

Theorem 1 improves the bound of [9, Theorem 5.11] for the EG ^± algorithm. First, our bound depends logarithmically—as opposed to linearly—on U for U > 2dY /( √

T X ). Secondly, it is smaller by a factor ranging from 1 to √

ln d when

Y X

r ln(1 + 2d)

T ln 2 6 U 6 2dY

√ T X . (1)

Hence, Theorem 1 provides a partial answer to a question ⁵ raised in [9] about the gap of p

ln(2d) between the upper and lower bounds.

Before proving the theorem (see below), we state the following immediate corollary. It expresses the upper bound of Theorem 1 in terms of an intrinsic quantity κ , √

T U X/(2dY ) that relates √

T U X/(2Y ) to the ambient dimension d.

Corollary 1 (Upper bound in terms of an intrinsic quantity). Let d, T ∈ N ^∗ , and U, X, Y > 0. The upper bound of Theorem 1 expressed in terms of d, Y , and the intrinsic quantity κ , √

T U X/(2dY ) reads:

inf

F sup

kx t k _∞ 6 X, |y t |6 Y

( _T X

t=1

(y t − y b t ) ² − inf

kuk ₁ 6 U T

X

t=1

(y t − u · x t ) ² )

6 

 



 



6 dY ² κ p

2 ln(2d) if κ <

√

ln(1+2d) 2d √

ln 2 , 52 dY ² κ p

ln(1 + 1/κ) if

√

ln(1+2d) 2d √

ln 2 6 κ 6 1 , 32 dY ² ln(1 + 2κ) + 1

if κ > 1 .

The parametrization by (d, Y, κ) helps to unify the different upper bounds of Theorem 1: on both regimes κ 6 1 and κ > 1, the regret bound scales as dY ² , the only difference lies in the dependence in κ (linear versus logarithmic).

The upper bound of Corollary 1 is shown in Figure 1. Observe that, in low dimension (Figure 1(b)), a clear transition from a regret of the order of √

T to one of ln T occurs at κ = 1. This transition is absent for high dimensions: for d > ωT , where ω , 32(ln(3) + 1) ⁻¹

, the regret bound 32 dY ² ln(1 + 2κ) + 1 is worse than a trivial bound of T Y ² when κ > 1.

1 min Y ² T

Y ² ln ^d 52 dY ² ln(1+1 ^/ )

(a) High dimension d > ωT .

1 max min

Y ² T Y ² d

Y ² ln ^d 52 ^dY ² ln(1+1 / )

cdY ² (ln(1+2 )+1)

(b) Low dimension d < ωT .

Figure 1: The regret bound of Corollary 1 over B 1 (U) as a function of κ = √

T U X/(2dY ). The constant c is chosen to ensure continuity at κ = 1, and ω , 32(ln(3) + 1) −1

. We define: κ min = p

ln(1 + 2d)/(2d √

ln 2) and κ max = (e ^{(T /d−1)/c} − 1)/2.

5 The authors of [9] asked: “For large d there is a significant gap between the upper and lower bounds. We would like to

know if it possible to improve the upper bounds by eliminating the ln d factors.”

(6)

We now prove Theorem 1. The main part of the proof relies on a Maurey-type argument. Although this argument was used in the stochastic setting [1, 2, 3, 4], we adapt it to the deterministic setting. This is yet another technique that can be applied to both the stochastic and individual sequence settings.

Proof (of Theorem 1): First note from Lemma 5 in Appendix B that the minimax regret on B ₁ (U ) is upper bounded ⁶ by

min (

3U XY p

2T ln(2d), 32 dY ² ln 1 +

√ T U X dY

! + dY ²

)

. (2)

Therefore, the first case U < _X ^Y

q ln(1+2d)

T ln 2 and the third case U > ^dY

X √

T are straightforward.

Therefore, we assume in the sequel that _X ^Y

q ln(1+2d)

T ln 2 6 U 6 ^√ ^2dY _{T X} .

We use a Maurey-type argument to refine the regret bound (2). This technique was used under various forms in the stochastic setting, e.g., in [1, 2, 3, 4]. It consists of discretizing B ₁ (U ) and looking at a random point in this discretization to study its approximation properties. We also use clipping to get a regret bound growing as U instead of a naive U ² .

More precisely, we first use the fact that to be competitive against B 1 (U ), it is sufficient to be compet- itive against its finite subset

B e U,m ,





 k 1 U

m , . . . , k d U m

: (k 1 , . . . , k d ) ∈ Z ^d ,

d

X

j=1

|k j | 6 m







⊂ B 1 (U ) ,

where m , bαc with α , U X Y

s

T (ln 2)/ ln

1 + 2dY

√ T U X

.

By Lemma 7 in Appendix C, and since m > 0 (see below), we indeed have inf

u∈ B e _U,m T

X

t=1

(y _t − u · x _t ) ² 6 inf

u∈B 1 (U) T

X

t=1

(y _t − u · x _t ) ² + T U ² X ² m

6 inf

u∈B 1 (U) T

X

t=1

(y _t − u · x _t ) ² + 2

√

ln 2 U XY s

T ln

1 + 2dY

√ T U X

, (3)

where (3) follows from m , bαc > α/2 since α > 1 (in particular, m > 0 as stated above).

To see why α > 1, note that it suffices to show that x p

ln(1 + x) 6 2d √

ln 2 where we set x , 2dY /( √

T U X). But from the assumption U > (Y /X) p

ln(1 + 2d)/(T ln 2), we have x 6 2d p

ln(2)/ ln(1 + 2d) , y, so that, by monotonicity, x p

ln(1 + x) 6 y p

ln(1 + y) 6 y p

ln(1 + 2d) = 2d √ ln 2.

Therefore it only remains to exhibit an algorithm which is competitive against B e _U,m at an aggregation price of the same order as the last term in (3). This is the case for the standard exponentially weighted average forecaster applied to the clipped predictions

u · x t

Y , min n

Y, max

−Y, u · x t

o

, u ∈ B e U,m ,

6 As proved in Lemma 5, the regret bound (2) is achieved either by the EG ^± algorithm, the algorithm SeqSEW ^B,η τ of [12]

(we could also get a slightly worse bound with the sequential ridge regression forecaster [13, 14]), or the trivial null forecaster.

(7)

and tuned with the inverse temperature parameter η = 1/(8Y ² ). More formally, this algorithm predicts at each time t = 1, . . . , T as

y b _t , X

u∈ B e _U,m

p _t (u) u · x _t

Y ,

where p 1 (u) , 1/

e B U,m

(denoting by e B U,m

the cardinality of the set B e U,m ), and where the weights p t (u) are defined for all t = 2, . . . , T and u ∈ B e U,m by

p _t (u) ,

exp

−η P t−1

s=1 y s − [u · x s ] Y

2 P

v∈ B e _U,m exp

−η P t−1

s=1 y s − [v · x s ] Y ² . By Lemma 6 in Appendix B, the above forecaster tuned with η = 1/(8Y ² ) satisfies

T

X

t=1

(y t − b y t ) ² − inf

u∈ B e _U,m T

X

t=1

(y t − u · x t ) ² 6 8Y ² ln e B U,m

6 8Y ² ln

e(2d + m) m

^m

(4)

= 8Y ² m 1 + ln(1 + 2d/m)

6 8Y ² α 1 + ln(1 + 2d/α)

(5)

= 8Y ² α + 8Y ² α ln



1 + 2dY

√ T U X s

ln 1 + 2dY /( √ T U X ) ln 2





6 8Y ² α + 16Y ² α ln

1 + 2dY

√ T U X

(6) 6

8 √

ln 2 + 16 √ ln 2

U XY

s T ln

1 + 2dY

√ T U X

. (7)

To get (4) we used Lemma 8 in Appendix C. Inequality (5) follows by definition of m 6 α and the fact that x 7→ x 1 + ln(1 + A/x)

is nondecreasing on R ^∗ + for all A > 0. Inequality (6) follows from the assumption U 6 2dY /( √

T X) and the elementary inequality ln 1 + x p

ln(1 + x)/ ln 2

6 2 ln(1 + x) which holds for all x > 1 and was used, e.g., at the end of [3, Theorem 2-a)]. Finally, elementary manipulations combined with the assumption that 2dY /( √

T U X) > 1 lead to (7).

Putting Eqs. (3) and (7) together, the previous algorithm has a regret on B 1 (U ) which is bounded from above by

10 √ ln 2 + 16 √ ln 2

U XY

s T ln

1 + 2dY

√ T U X

,

which concludes the proof since 10/ √

ln 2 + 16 √

ln 2 6 26.

2.2. Lower bound

Corollary 1 gives an upper bound on the regret in terms of the quantities d, Y , and κ , √

T U X/(2dY ).

We now show that for all d ∈ N ^∗ , Y > 0, and κ > p

ln(1 + 2d)/(2d √

ln 2), the upper bound can not be improved ⁷ up to logarithmic factors.

7 For T sufficiently large, we may overlook the case κ < p

ln(1 + 2d)/(2d √

ln 2) or √

T < (Y /(U X)) p

ln(1 + 2d)/ ln 2.

Observe that in this case, the minimax regret is already of the order of Y ² ln(1 + d) (cf. Figure 1).

(8)

Theorem 2 (Lower bound). For all d ∈ N ^∗ , Y > 0, and κ >

√

ln(1+2d) 2d √

ln 2 , there exist T > 1, U > 0, and X > 0 such that √

T U X/(2dY ) = κ and inf

F sup

kx t k _∞ 6 X, |y t | 6 Y

( _T X

t=1

(y _t − y b _t ) ² − inf

kuk ₁ 6 U T

X

t=1

(y _t − u · x _t ) ² )

>



 

 

c 1

ln 2+16d ² dY ² κ p

ln (1 + 1/κ) if

√

ln(1+2d) 2d √

ln 2 6 κ 6 1 ,

c ₂

ln 2+16d ² dY ² if κ > 1 ,

where c 1 , c 2 > 0 are absolute constants. The infimum is taken over all forecasters F and the supremum is taken over all sequences (x t , y t ) 1 6 t 6 T ∈ ( R ^d × R ) ^T such that |y 1 |, . . . , |y T | 6 Y and kx 1 k _∞ , . . . , kx T k _∞ 6 X . The above lower bound extends those of [8, 9], which hold for small κ of the order of 1/d. The proof is postponed to Appendix A.1. We perform a reduction to the stochastic batch setting—via the standard online to batch conversion—and employ a version of a lower bound of [2].

Note that in the proof of Theorem 2, we are free to choose the values of two parameters among T, U , and X , provided that √

T U X/(2dY ) = κ. This liberty is possible since the problem is now parametrized by d, Y , and κ only (as shown in Corollary 1, these three parameters are sufficient to express the regret bound of Theorem 1, and they actually help to unify the upper bounds of the two regimes). A more ambitious lower bound would consist in proving that the upper bound of Theorem 1 cannot be substantially improved for any fixed value of (d, Y, T, U, X ). This question is left for future work.

3. Adaptation to unknown X, Y and T via exponential weights

Although the proof of Theorem 1 already gives an algorithm that achieves the minimax regret, the latter takes as inputs U , X, Y , and T , and it is inefficient in high dimensions. In this section, we present a new method that achieves the minimax regret both efficiently and without prior knowledge of X, Y , and T provided that U is known. Adaptation to an unknown U is considered in Section 4. Our method consists of modifying an underlying efficient linear regression algorithm such as the EG ^± algorithm [9] or the sequential ridge regression forecaster [14, 13]. Next, we show that automatically tuned variants of the EG ^± algorithm nearly achieve the minimax regret for the regime d > √

T U X/(2Y ). A similar modification could be applied to the ridge regression forecaster — with a total computational efficiency of the same order as that of the standard ridge algorithm — to achieve a nearly optimal regret bound of order dY ² ln 1 +d

√ T U X dY

2 in the regime d < √

T U X/(2Y ). The latter analysis is more technical and hence is omitted.

3.1. An adaptive EG ^± algorithm for general convex and differentiable loss functions

The second algorithm of the proof of Theorem 1 is computationally inefficient because it aggregates approximately d

√ T experts. In contrast, the EG ^± algorithm has a manageable computational complexity that is linear in d at each time t. Next we introduce a version of the EG ^± algorithm — called the adaptive EG ^± algorithm — that does not require prior knowledge of X , Y and T (as opposed to the original EG ^± algorithm of [9]). This version relies on the automatic tuning of [11]. We first present a generic version suited for general convex and differentiable loss functions. The application to the square loss and to other α-losses will be dealt with in Sections 3.2 and 3.3.

The generic setting with arbitrary convex and differentiable loss functions corresponds to the online con- vex optimization setting [15, 16] and unfolds as follows: at each time t > 1, the forecaster chooses a linear combination u b t ∈ R ^d , then the environment chooses and reveals a convex and differentiable loss function

` t : R ^d → R , and the forecaster incurs the loss ` t ( u b t ). In online linear regression under the square loss, the

loss functions are given by ` t (u) = (y t − u · x t ) ² .

(9)

Parameter: radius U > 0.

Initialization: p ₁ = (p ⁺ _1,1 , p ⁻ _1,1 , . . . , p ⁺ _d,1 , p ⁻ _d,1 ) , 1/(2d), . . . , 1/(2d)

∈ R ^2d . At each time round t > 1,

1. Output the linear combination u b _t , U

d

X

j=1

p ⁺ _j,t − p ⁻ _j,t

e _j ∈ B ₁ (U );

2. Receive the loss function ` _t : R ^d → R and update the parameter η _t+1 according to (8);

3. Update the weight vector p _t+1 = (p ⁺ _1,t+1 , p ⁻ _1,t+1 , . . . , p ⁺ _d,t+1 , p ⁻ _d,t+1 ) ∈ X 2d defined for all j = 1, . . . , d and γ ∈ {+, −} by ^a

p ^γ _j,t+1 ,

exp −η t+1 t

X

s=1

γU ∇ j ` s ( u b s )

!

X

1 6 k 6 d µ∈{+,−}

exp −η t+1 t

X

s=1

µU ∇ k ` s ( b u s )

! .

a For all γ ∈ {+, −}, by a slight abuse of notation, γU denotes U or −U if γ = + or γ = − respectively.

Figure 2: The adaptive EG ^± algorithm for general convex and differentiable loss functions (see Proposition 1).

The adaptive EG ^± algorithm for general convex and differentiable loss functions is defined in Figure 2.

We denote by (e j ) 1 6 j 6 d the canonical basis of R ^d , by ∇` t (u) the gradient of ` t at u ∈ R ^d , and by ∇ j ` t (u) the j-th component of this gradient. The adaptive EG ^± algorithm uses as a blackbox the exponentially weighted majority forecaster of [11] on 2d experts — namely, the vertices ±U e j of B 1 (U ) — as in [9]. It adapts to the unknown gradient amplitudes k∇` t k ∞ by the particular choice of η t due to [11] and defined for all t > 2 by

η t = min ( 1

E b _t−1 , C s

ln(2d) V _t−1

)

, (8)

where C , q

2( √

2 − 1)/(e − 2) and where we set, for all t = 1, . . . , T ,

z ⁺ _j,s , U ∇ j ` s ( u b s ) and z ⁻ _j,s , −U ∇ j ` s ( u b s ) , j = 1, . . . , d, s = 1, . . . , t , E b t , inf

k∈ Z







2 ^k : 2 ^k > max

1 6 s 6 t max

1 6 j,k 6 d γ,µ∈{+,−}

z ^γ _j,s − z ^µ _k,s





 ,

V t ,

t

X

s=1

X

1 6 j 6 d γ∈{+,−}

p ^γ _j,s







z _j,s ^γ − X

1 6 k 6 d µ∈{+,−}

p ^µ _k,s z _k,s ^µ







2 .

Note that E b _t−1 approximates the range of the z _j,s ^γ up to time t − 1, while V _t−1 is the corresponding cumu-

lative variance of the forecaster.

(10)

Proposition 1 (The adaptive EG ^± algorithm for general convex and differentiable loss functions).

Let U > 0. Then, the adaptive EG ^± algorithm on B ₁ (U ) defined in Figure 2 satisfies, for all T > 1 and all sequences of convex and differentiable ⁸ loss functions ` 1 , . . . , ` T : R ^d → R ,

T

X

t=1

` _t ( u b _t ) − min

kuk ₁ 6 U T

X

t=1

` _t (u)

6 4U v u u t

T

X

t=1

k∇` _t ( u b _t )k ² _∞

!

ln(2d) + U 8 ln(2d) + 12 max

1 6 t 6 T k∇` _t ( u b _t )k _∞ . In particular, the regret is bounded by 4U max ₁ ₆ _t ₆ _T k∇` t ( u b _t )k _∞ p

T ln(2d) + 2 ln(2d) + 3 .

Proof: The proof follows straightforwardly from a linearization argument and from a regret bound of [11]

applied to appropriately chosen loss vectors. Indeed, first note that by convexity and differentiability of

` t : R ^d → R for all t = 1, . . . , T , we get that

T

X

t=1

` t ( u b t ) − min

kuk ₁ 6 U T

X

t=1

` t (u) = max

kuk ₁ 6 U T

X

t=1

` t ( u b t ) − ` t (u)

6 max

kuk ₁ 6 U T

X

t=1

∇` t ( u b t ) · ( u b t − u)

= max

1 6 j 6 d γ∈{+,−}

T

X

t=1

∇` t ( u b t ) · ( u b t − γUe j ) (9)

=

T

X

t=1

X

1 6 j 6 d γ∈{+,−}

p ^γ _j,t γU∇ j ` t ( b u t ) − min

1 6 j 6 d γ∈{+,−}

T

X

t=1

γU∇ j ` t ( u b t ) , (10)

where (9) follows by linearity of u 7→ P T

t=1 ∇` t ( u b t ) · ( u b t − u) on the polytope B 1 (U ), and where (10) follows from the particular choice of u b t in Figure 2.

To conclude the proof, note that our choices of the weight vectors p _t ∈ X 2d in Figure 2 and of the time- varying parameter η t in (8) correspond to the exponentially weighted average forecaster of [11, Section 4.2]

when it is applied to the loss vectors U ∇ j ` t ( u b t ), −U ∇ j ` t ( u b t )

1 6 j 6 d ∈ R ^2d , t = 1, . . . , T . Since at time t the coordinates of the last loss vector lie in an interval of length E _t 6 2U k∇` _t ( u b _t )k _∞ , we get from [11, Corollary 1] that

T

X

t=1

X

1 6 j 6 d γ∈{±1}

p ^γ _j,t γU ∇ j ` t ( u b t ) − min

1 6 j 6 d γ∈{±1}

T

X

t=1

γU ∇ j ` t ( u b t )

6 4U v u u t

T

X

t=1

k∇` t ( u b t )k ² _∞

!

ln(2d) + U 8 ln(2d) + 12

1 max 6 t 6 T k∇` t ( u b t )k _∞ .

Substituting the last upper bound in (10) concludes the proof.

3.2. Application to the square loss

In the particular case of the square loss ` _t (u) = (y _t − u · x _t ) ² , the gradients are given by ∇` t (u) =

−2(y t − u · x _t ) x _t for all u ∈ R ^d . Applying Proposition 1, we get the following regret bound for the adaptive EG ^± algorithm.

8 Gradients can be replaced with subgradients if the loss functions ` t : R ^d → R are convex but not differentiable.

(11)

Corollary 2 (The adaptive EG ^± algorithm under the square loss).

Let U > 0. Consider the online linear regression setting defined in the introduction. Then, the adaptive EG ^± algorithm (see Figure 2) tuned with U and applied to the loss functions ` t : u 7→ (y t − u · x t ) ² satisfies, for all individual sequences (x 1 , y 1 ), . . . , (x T , y T ) ∈ R ^d × R ,

T

X

t=1

(y _t − u b _t · x _t ) ² − min

kuk ₁ 6 U T

X

t=1

(y _t − u · x _t ) ²

6 8U X v u u t min

kuk ₁ 6 U T

X

t=1

(y _t − u · x _t ) ²

!

ln(2d) + 137 ln(2d) + 24

U XY + U ² X ² 6 8U XY p

T ln(2d) + 137 ln(2d) + 24

U XY + U ² X ² ,

where the quantities X , max 1 6 t 6 T kx t k _∞ and Y , max 1 6 t 6 T |y t | are unknown to the forecaster.

Using the terminology of [17, 11], the first bound of Corollary 2 is an improvement for small losses : it yields a small regret when the optimal cumulative loss min _kuk

1 6 U P T

t=1 (y t − u · x t ) ² is small. As for the second regret bound, it indicates that the adaptive EG ^± algorithm achieves approximately the regret bound of Theorem 1 in the regime κ 6 1, i.e., d > √

T U X/(2Y ). In this regime, our algorithm thus has a manageable computational complexity (linear in d at each time t) and it is adaptive in X, Y , and T.

In particular, the above regret bound is similar ⁹ to that of the original EG ^± algorithm [9, Theorem 5.11], but it is obtained without prior knowledge of X , Y , and T . Note also that this bound is similar to that of the self-confident p-norm algorithm of [10] with p = 2 ln d (see Section 1.2). The fact that we were able to get similar adaptivity and efficiency properties via exponential weighting corroborates the similarity that was already observed in a non-adaptive context between the original EG ^± algorithm and the p-norm algorithm (in the limit p → +∞ with an appropriate initial weight vector, or for p of the order of ln d with a zero initial weight vector, cf. [18]).

Proof (of Corollary 2): We apply Proposition 1 with the square loss ` t (u) = (y t − u · x t ) ² . It yields

T

X

t=1

` t ( u b t ) − min

kuk ₁ 6 U T

X

t=1

` t (u)

6 4U v u u t

T

X

t=1

k∇` t ( u b t )k ² _∞

!

ln(2d) + U 8 ln(2d) + 12 max

1 6 t 6 T k∇` t ( u b t )k _∞ . (11) Using the equality ∇` _t (u) = −2(y _t − u · x _t ) x _t for all u ∈ R ^d , we get that, on the one hand, by the upper bound kx _t k _∞ 6 X ,

k∇` t ( u b _t )k ² _∞ 6 4X ² ` _t ( u b _t ) , (12) and, on the other hand, max ₁ ₆ _t ₆ _T k∇` _t ( u b _t )k _∞ 6 2(Y + U X )X (indeed, by H¨ older’s inequality,

u b _t · x _t 6 k u b t k ₁ kx t k _∞ 6 U X). Substituting the last two inequalities in (11), setting L b T , P T

t=1 ` t ( u b t ) as well as L ^∗ _T , min _kuk

1 6 U P T

t=1 ` _t (u), we get that L b T 6 L ^∗ _T + 8U X

q

L b T ln(2d) + 16 ln(2d) + 24

U XY + U ² X ²

| {z }

, C

.

9 By Theorem 5.11 of [9], the original EG ^± algorithm satisfies the regret bound 2U X p

2B ln(2d) + 2U ² X ² ln(2d), where B is an upper bound on min kuk ₁ 6U

P T

t=1 (y t − u · x t ) ² (in particular, B 6 T Y ² ). Note that our main regret term is larger by a multiplicative factor of 2 √

2. However, contrary to [9], our algorithm does not require the prior knowledge of X and B — or,

alternatively, X, Y , and T .

(12)

Solving for L b _T via Lemma 4 in Appendix B, we get that L b T 6 L ^∗ _T + C +

8U X p

ln(2d) p

L ^∗ _T + C + 8U X p

ln(2d) 2

6 L ^∗ _T + 8U X q

L ^∗ _T ln(2d) + 8U X p

C ln(2d) + 64U ² X ² ln(2d) + C . Using that

U X p

C ln(2d) = U X ln(2d) q

16 + 24/ ln(2d)

U XY + U ² X ² 6 p

U ² X ² + U XY ln(2d) q

16 + 24/ ln(2)

U XY + U ² X ²

= p

16 + 24/ ln(2) U XY + U ² X ² ln(2d)

and performing some simple upper bounds concludes the proof of the first regret bound. The second one follows immediately by noting that min _kuk

1 6 U

P T

t=1 (y t − u · x t ) ² 6 P T

t=1 y ² _t 6 T Y ² (since 0 ∈ B 1 (U )).

3.3. A refinement via Lipschitzification of the loss function

In Corollary 2 we used the adaptive EG ^± algorithm in conjunction with the square loss functions

` t : u 7→ (y t − u · x t ) ² . In this section we use yet another instance of the adaptive EG ^± algorithm ap- plied to a modification e ` t : R ^d → R of the square loss (or the α-loss, see below) which is Lipschitz continuous with respect to k·k ₁ . This leads to slightly refined regret bounds; see Theorem 3 below and Corollaries 3 and 4 thereafter.

We first present the Lipschtizification technique; its use with the adaptive EG ^± algorithm is to be addressed in a few paragraphs. Since our analysis is generic enough to handle both the square loss and other loss functions with higher curvature, we consider below a slightly more general setting than online linear regression stricto sensu. Namely, we fix a real number α > 2 and assume that the predictions y b _t of the forecaster and the base linear predictions u · x _t are scored with the α-loss, i.e., with the loss functions x 7→ |y _t − x| ^α for all t > 1. The particular case of the square loss (α = 2) is considered in Corollary 3 below, while loss functions with higher curvature (α > 2) are addressed in Corollary 4.

The Lipschitzification proceeds as follows. At each time t > 1, we set B t ,

2 ^dlog ² ^(max ^16s6t−1 ^|y ^s ^| ^α ^)e 1/α

,

where dxe , min{k ∈ Z : k > x} for all x ∈ R . Note that max ₁ ₆ _s ₆ _t−1 |y _s | 6 B _t 6 2 ^1/α max ₁ ₆ _s ₆ _t−1 |y _s |.

The modified (or Lipschitzified) loss function ` e t : R ^d → R is constructed as follows:

• if |y t | > B t , then

e ` t (u) , 0 for all u ∈ R ^d ;

• if |y t | 6 B _t , then ` e _t is the convex function that coincides with the loss function u 7→ |y t − u · x _t | ^α when

u · x _t

6 B _t and is linear elsewhere. An example of such function is shown in Figure 3 in the case where α = 2. It can be formally defined as

e ` t (u) ,



 

 

y _t − u · x _t

α if

u · x _t 6 B _t , y _t − B _t

α + α

y _t − B _t

α−1 (u · x _t − B _t ) if u · x _t > B _t , y _t + B _t

α − α

y _t + B _t

α−1 (u · x _t + B _t ) if u · x _t < −B t .

Observe that in both cases |y t | > B t and |y t | 6 B t , the function ` e t is continuously differentiable. By

construction it is also Lipschitz continuous with respect to k·k ₁ with an easy-to-control Lipschitz constant

(see Appendix A.2). Another key property that we can glean from Figure 3 is that, when |y t | 6 B t , the

(13)

modified loss function ` e _t : R ^d → R lies in between the α-loss function u 7→ |y _t − u · x _t | ^α and its clipped version:

∀u ∈ R ^d ,

y t − [u · x t ] B _t

α 6 e ` t (u) 6

y t − u · x t

α , (13)

where the clipping operator [·] _B is defined by [x] _B , min

B, max{−B, x} for all x ∈ R and all B > 0.

−2 0 2 4

0 5 10 15

u ⋅ x _t

●

● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ●

y _t

B _t

− B _t

● Square loss Lipschitzified Clipped

Figure 3: Example with the square loss (α = 2) when |y t | 6 B t . The square loss (y t −u·x t ) ² , its clipped version y t −[u·x t ] B _t

2 and its Lipschitzified version e ` t (u) are plotted as a function of u · x t .

Next we illustrate the Lipschitzification technique introduced above: we apply the adaptive EG ^± algo- rithm to the Lipschitzified loss functions e ` _t . The resulting algorithm is called the Lipschitzifying Exponen- tiated Gradient (LEG) algorithm and is formally defined in Figure 4. Recall that (e _j ) ₁ ₆ _j ₆ _d denotes the canonical basis of R ^d and that ∇ _j denotes the j-th component of the gradient.

We point out that this technique is not specific to the pair of dual norms (k·k ₁ , k·k _∞ ) and to the EG ^± algorithm; it could be used with other pairs (k·k _q , k·k _p ) (with 1/p + 1/q = 1) and other gradient-based algorithms, such as the p-norm algorithm [18, 10] and its regularized variants (SMIDAS and COMID) [19, 20].

The next theorem bounds the cumulative α-loss of the LEG algorithm. The proof is postponed to Appendix A.2. It follows from the bound on the adaptive EG ^± algorithm for general convex and differentiable loss functions that we derived in Proposition 1 (Section 3.1). See Corollaries 3 and 4 below for regret bounds in the particular cases of the square loss (α = 2) or of losses with higher curvature (α > 2).

Theorem 3. Assume that the predictions are scored with the α-loss x 7→ |y t − x| ^α , where α > 2 is a real number. Let U > 0. Then, the LEG algorithm defined in Figure 4 and tuned with U satisfies, for all T > 1 and all individual sequences (x 1 , y 1 ), . . . , (x T , y T ) ∈ R ^d × R ,

T

X

t=1

|y t − y b t | ^α 6 inf

kuk ₁ 6 U T

X

t=1

` e t (u) + a α U XY ^α/2−1 v u u t inf

kuk ₁ 6 U T

X

t=1

` e t (u)

! ln(2d)

+

a ⁰ _α ln(2d) + 12b α

U XY ^α−1 + a ⁰⁰ _α ln(2d) U ² X ² Y ^α−2 + a ⁰⁰⁰ _α Y ^α ,

where the Lipschitzified loss functions ` e t are defined above, where the quantities X , max 1 6 t 6 T kx t k _∞ and Y , max 1 6 t 6 T |y t | are unknown to the forecaster, and where, setting a α , 4α 1 + 2 ^1/α α/2−1

and

(14)

Parameter: radius U > 0.

Initialization: B ₁ , 0, p ₁ = (p ⁺ _1,1 , p ⁻ _1,1 , . . . , p ⁺ _d,1 , p ⁻ _d,1 ) , 1/(2d), . . . , 1/(2d)

∈ R ^2d . At each time round t > 1,

1. Compute the linear combination u b t , U

d

X

j=1

p ⁺ _j,t − p ⁻ _j,t

e j ∈ B 1 (U);

2. Get x _t ∈ R ^d and output the clipped prediction y b _t , u b _t · x _t

B _t ; 3. Get y t ∈ R and define the modified loss function ` e t : R ^d → R as above;

4. Update the parameter η t+1 according to (8);

5. Update the weight vector p _t+1 = (p ⁺ _1,t+1 , p ⁻ _1,t+1 , . . . , p ⁺ _d,t+1 , p ⁻ _d,t+1 ) ∈ X _2d defined for all j = 1, . . . , d and γ ∈ {+, −} by ^a

p ^γ _j,t+1 ,

exp −η t+1 t

X

s=1

γU ∇ j ` e s ( u b s )

!

X

1 6 k 6 d µ∈{+,−}

exp −η t+1 t

X

s=1

µU ∇ k ` e s ( b u s )

! .

6. Update the threshold B t+1 , 2 ^dlog ² ^(max ^16s6t ^|y ^s ^| ^α ^)e ^1/α .

a For all γ ∈ {+, −}, by a slight abuse of notation, γU denotes U or −U if γ = + or γ = − respectively.

Figure 4: The Lipschitzifying Exponentiated Gradient (LEG) algorithm.

b _α , α 1 + 2 ^1/α α−1

, the constants a ⁰ _α , a ⁰⁰ _α , a ⁰⁰⁰ _α > 0 are defined by



 



 



a ⁰ _α , a α

q

b α 4 + 6/ ln 2

+ 2 1 + 2 ^−1/α ^α/2 / √

ln 2 + 8b α

a ⁰⁰ _α , a α

q

b α 4 + 6/ ln 2 + a α

a ⁰⁰⁰ _α , 4 1 + 2 ^−1/α α

.

Corollary 3 (Application to the square loss). Consider the online linear regression setting under the square loss (i.e., α = 2). Let U > 0. Then, the LEG algorithm defined in Figure 4 and tuned with U satisfies, for all T > 1 and all individual sequences (x 1 , y 1 ), . . . , (x T , y T ) ∈ R ^d × R ,

T

X

t=1

(y _t − y b _t ) ² 6 inf

kuk ₁ 6 U T

X

t=1

e ` _t (u) + 8U X v u u t inf

kuk ₁ 6 U T

X

t=1

` e _t (u)

! ln(2d) + 134 ln(2d) + 58

U XY + U ² X ²

+ 12Y ² ,

where the Lipschitzified loss functions ` e t are defined above and where the quantities X , max 1 6 t 6 T kx t k _∞ and Y , max ₁ ₆ _t ₆ _T |y _t | are unknown to the forecaster.

Note that, in the case of the square loss, the first two terms of the bound of Corollary 3 slightly improve on those obtained without Lipschitzification (cf. Corollary 2) since we always have

inf

kuk ₁ 6 U T

X

t=1

` e _t (u) 6 inf

kuk ₁ 6 U T

X

t=1

(y _t − u · x _t ) ² ,

(15)

where we used the key property e ` _t (u) 6 (y _t − u · x _t ) ² that holds for all u ∈ R ^d and all t = 1, . . . , T (by (13) if |y _t | 6 B _t , obvious otherwise). In particular, the LEG algorithm is adaptive in X, Y , and T ; it achieves approximately — and efficiently — the regret bound of Theorem 1 in the regime κ 6 1, i.e., d > √

T U X/(2Y ).

In the case of α-losses with a higher curvature than that of the square loss (α > 2), the improvement is more substantial as indicated after the following corollary.

Corollary 4 (Application to α-losses with α > 2). Assume that the predictions are scored with the α-loss x 7→ |y t − x| ^α , where α > 2. Then, the regret of the LEG algorithm on B 1 (U ) is at most of the order of

U XY ^α−1 p

T ln(2d) +

U XY ^α−1 + U ² X ² Y ^α−2

ln(2d) + Y ^α ,

where X , max ₁ ₆ _t ₆ _T kx _t k _∞ and Y , max ₁ ₆ _t ₆ _T |y _t | are unknown to the forecaster. The above regret bound improves on the bound we would have obtained via a similar analysis for the adaptive EG ^± algorithm applied to the original losses ` t (u) = |y t − u · x t | ^α (without Lipschitzification), namely, a bound of the order of

U X(Y + U X) ^α/2−1 Y ^α/2 p

T ln(2d) +

U X(Y + U X ) ^α−1 + U ² X ² (Y + U X) ^α−2

ln(2d) .

The main difference between the two regret bounds above lies in the dependence in U : our main regret term scales as U XY ^α−1 while the one obtained without Lipschitzification scales as U X (Y +U X) ^α/2−1 Y ^α/2 . The first term grows linearly in U while the second one grows as U ^α/2 , hence a clear improvement for α > 2.

The last property stems from the fact that, thanks to Lipschitzification, the gradients ∇e ` _t

_∞ are bounded as U → +∞ (cf. (A.29) in Appendix A.2).

Remark 1 (Another benefit of Lipschitzification).

Another benefit of Lipschitzification is that all online convex optimization regret bounds expressed in terms of the maximal dual norm of the gradients — i.e., max ₁ ₆ _t ₆ _T k∇e ` _t k _∞ in our case — can be used fruitfully with the Lipschitzified loss functions e ` t . For instance, in the case of the square loss, using the very last bound of Proposition 1, we get that

T

X

t=1

(y t − y b t ) ² − inf

kuk ₁ 6 U T

X

t=1

(y t − u · x t ) ² 6 c 1 U XY p

T ln(2d) + 8 ln(2d)

+ c 2 Y ² ,

where c 1 , 8 √ 2 + 1

and c 2 , 4 1 + 1/ √ 2 ²

. The bound is no longer an improvement for small losses (as compared to Corollary 2), but it does not require to solve any quadratic inequality. The corresponding simple proof is postponed to the end of Appendix A.2.

4. Adaptation to unknown U

In the previous section, the forecaster is given a radius U > 0 and asked to ensure a low worst-case regret on the ` ¹ -ball B 1 (U ). In this section, U is no longer given: the forecaster is asked to be competitive against all balls B 1 (U ), for U > 0. Namely, its worst-case regret on each B 1 (U ) should be almost as good as if U were known beforehand. For simplicity, we assume that X , Y , and T are known: we explain in Section 5 how to simultaneously adapt to all parameters. Note that from now on, we consider again the main framework of this paper, i.e., online linear regression under the square loss (cf. Section 1.1).

We define

R , dlog ₂ (2T /c)e + and U r , Y X

2 ^r

p T ln(2d) , for r = 0, . . . , R , (14)

(16)

Parameters: X, Y, η > 0, T > 1, and c > 0 (a constant).

Initialization: R = dlog ₂ (2T /c)e ₊ , w ₁ =

1 R+1 , · · · , _R+1 ¹

∈ R ^R+1 . For time steps t = 1, . . . , T :

1. For experts r = 0, . . . , R:

• Run the sub-algorithm A(U r ) on the ball B ₁ (U _r ) and obtain the pre- diction b y _t ^(r) .

2. Output the prediction b y _t = P R r=0

w _t ^(r) P R

r 0 =0 w ^(r _t ⁰ ⁾

y b ^(r) _t

Y . 3. Update w ^(r) _t+1 = w ^(r) _t exp

−η y t − y b ^(r) _t

Y

²

for r = 0, . . . , R.

Figure 5: The Scaling algorithm.

where c > 0 is a known absolute constant and dxe + , min

k ∈ N : k > x for all x ∈ R .

The Scaling algorithm of Figure 5 works as follows. We have access to a sub-algorithm A(U) which we run simultaneously for all U = U r , r = 0, . . . , R. Each instance of the sub-algorithm A(U r ) performs online linear regression on the ` ¹ -ball B ₁ (U _r ). We employ an exponentially weighted forecaster to aggregate these R + 1 sub-algorithms to perform online linear regression simultaneously on the balls B ₁ (U ₀ ), . . . , B ₁ (U _R ).

The following regret bound follows by exp-concavity of the square loss.

Theorem 4. Suppose that X, Y > 0 are known. Let c, c ⁰ > 0 be two absolute constants. Suppose that for all U > 0, we have access to a sub-algorithm A(U ) with regret against B 1 (U ) of at most

cU XY p

T ln(2d) + c ⁰ Y ² for T > T 0 , (15)

uniformly over all sequences (x _t ) and (y _t ) bounded by X and Y . Then, for a known T > T ₀ , the Scaling algorithm with η = 1/(8Y ² ) satisfies

T

X

t=1

(y t − y b t ) ² 6 inf

u∈ R ^d

( _T X

t=1

(y t − u · x t ) ² + 2c kuk ₁ XY p

T ln(2d) )

+ 8Y ² ln dlog ₂ (2T /c)e + + 1

+ (c + c ⁰ )Y ² . (16)

In particular, for every U > 0,

T

X

t=1

(y _t − y b _t ) ² 6 inf

u∈B 1 (U)

( _T X

t=1

(y _t − u · x _t ) ² )

+ 2cU XY p

T ln(2d) + 8Y ² ln dlog ₂ (2T /c)e + + 1

+ (c + c ⁰ )Y ² .

Remark 2. By Remark 1 the LEG algorithm satisfies assumption (15) with T 0 = ln(2d), c , 9c 1 = 72 √

2 + 1

, and c ⁰ , c 2 = 4 1 + 1/ √ 2 ²

.

Proof: Since the Scaling algorithm is an exponentially weighted average forecaster (with clipping) applied

(17)

to the R + 1 experts A(U r ) = b y ^(r) _t

t > 1 , r = 0, . . . , R, we have, by Lemma 6 in Appendix B,

T

X

t=1

(y t − y b t ) ² 6 min

r=0,...,R T

X

t=1

b y ^(r) _t − b y t

²

+ 8Y ² ln(R + 1)

6 min

r=0,...,R

( inf

u∈B 1 (U _r )

( _T X

t=1

(y t − u · x t ) ² )

+ cU r XY p

T ln(2d) )

+ z , (17)

where the last inequality follows by assumption (15), and where we set z , 8Y ² ln(R + 1) + c ⁰ Y ² . Let u ^∗ _T ∈ arg min _u∈ _R d

n P T

t=1 (y t − u · x t ) ² + 2c kuk ₁ XY p

T ln(2d) o

. Next, we proceed by considering three cases: U 0 < ku ^∗ _T k ₁ < U R , ku ^∗ _T k ₁ 6 U 0 , and ku ^∗ _T k ₁ > U R .

Case 1: U 0 < ku ^∗ _T k ₁ < U R . Let r ^∗ , min

r = 0, . . . , R : U r > ku ^∗ _T k ₁ . Note that r ^∗ > 1 since ku ^∗ _T k ₁ > U 0 . By (17) we have

T

X

t=1

(y _t − y b _t ) ² 6 inf

u∈B 1 (U _r ∗ )

( _T X

t=1

(y _t − u · x _t ) ² )

+ cU _r ^∗ XY p

T ln(2d) + z

6 T

X

t=1

(y t − u ^∗ _T · x t ) ² + 2c ku ^∗ _T k ₁ XY p

T ln(2d) + z ,

where the last inequality follows from u ^∗ _T ∈ B ₁ (U _r ^∗ ) and from the fact that U _r ^∗ 6 2 ku ^∗ _T k ₁ (since, by defini- tion of r ^∗ , ku ^∗ _T k ₁ > U _r ^∗ ₋₁ = U _r ^∗ /2). Finally, we obtain (16) by definition of u ^∗ _T and z , 8Y ² ln(R+1)+c ⁰ Y ² . Case 2: ku ^∗ _T k ₁ 6 U ₀ . By (17) we have

T

X

t=1

(y t − y b t ) ² 6 ( _T

X

t=1

(y t − u ^∗ _T · x t ) ² + cU 0 XY p

T ln(2d) )

+ z , (18)

which yields (16) by the equality cU ₀ XY p

T ln(2d) = cY ² (by definition of U ₀ ), by adding the nonnegative quantity 2c ku ^∗ _T k ₁ XY p

T ln(2d), and by definition of u ^∗ _T and z.

Case 3: ku ^∗ _T k ₁ > U R . By construction, we have y b t ∈ [−Y, Y ], and by assumption, we have y t ∈ [−Y, Y ], so that

T

X

t=1

(y t − b y t ) ² 6 4Y ² T 6

T

X

t=1

(y t − u ^∗ _T · x t ) ² + 2cU R XY p

T ln(2d)

6 T

X

t=1

(y t − u ^∗ _T · x t ) ² + 2c ku ^∗ _T k ₁ XY p

T ln(2d) , where the second inequality follows by 2cU _R XY p

T ln(2d) = 2cY ² 2 ^R > 4Y ² T (since 2 ^R > 2T /c by definition of R), and the last inequality uses the assumption ku ^∗ _T k ₁ > U _R . We finally get (16) by definition of u ^∗ _T .

This concludes the proof of the first claim (16). The second claim follows by bounding kuk ₁ 6 U.

(18)

5. Extension to a fully adaptive algorithm

The Scaling algorithm of Section 4 uses prior knowledge of Y , Y /X, and T . In order to obtain a fully automatic algorithm, we need to adapt efficiently to these quantities. Adaptation to Y is possible via a technique already used for the LEG algorithm, i.e., by updating the clipping range B _t based on the past observations |y _s |, s 6 t − 1.

In parallel to adapting to Y , adaptation to Y /X can be carried out as follows. We replace the exponential sequence {U ₀ , . . . , U _R } by another exponential sequence {U ₀ ⁰ , . . . , U _R ⁰ 0 }:

U _r ⁰ , 1 T ^k

2 ^r

p T ln(2d) , r = 0, . . . , R ⁰ , (19)

where R ⁰ , R +

log ₂ T ^2k

= dlog ₂ (2T /c)e + +

log ₂ T ^2k

, and where k > 1 is a fixed constant. On the one hand, for T > T 0 , max

(X/Y ) ^1/k , (Y /X) ^1/k , we have (cf. (14) and (19)), [U 0 , U R ] ⊂ [U ₀ ⁰ , U _R ⁰ 0 ] .

Therefore, the analysis of Theorem 4 applied to the grid {U ₀ ⁰ , . . . , U _R ⁰ } yields ¹⁰ a regret bound of the order of U XY √

T ln d + Y ² ln(R ⁰ + 1). On the other hand, clipping the predictions to [−Y, Y ] ensures the crude regret bound 4Y ² T ₀ for small T < T ₀ . Hence, the overall regret for all T > 1 is of the order of

U XY √

T ln d + Y ² ln(k ln T) + Y ² max

(X/Y ) ^1/k , (Y /X ) ^1/k .

Adaptation to an unknown time horizon T can be carried out via a standard doubling trick on T . However, to avoid restarting the algorithm repeatedly, we can use a time-varying exponential sequence {U _−R ⁰ 0 (t) (t), . . . , U _R ⁰ 0 (t) (t)} where R ⁰ (t) grows at the rate of k ln(t). This gives ¹¹ us an algorithm that is fully automatic in the parameters U , X, Y and T. In this case, we can show that the regret is of the order of

U XY √

T ln d + Y ² k ln(T ) + Y ² max n √

T X/Y 1/k

, Y /( √

T X ) 1/k o , where the last two terms are negligible when T → +∞ (since k > 1).

Acknowledgments

The authors would like to thank Gilles Stoltz for his valuable comments and suggestions, as well as two anonymous reviewers for their insightful feedback. This work was supported in part by French National Research Agency (ANR, project EXPLO-RA, ANR-08-COSI-004) and the PASCAL2 Network of Excellence under EC grant no. 216886. J. Y. Yu was partly supported by a fellowship from Le Fonds qu´ eb´ ecois de la recherche sur la nature et les technologies.

An extended abstract of the present paper appeared in the Proceedings of the 22nd International Con- ference on Algorithmic Learning Theory (ALT’11).

Appendix A. Proofs

Appendix A.1. Proof of Theorem 2

To prove Theorem 2, we perform a reduction to the stochastic batch setting (via the standard online to batch trick), and employ a version of the lower bound proved in [2] for convex aggregation.

10 The proof remains the same by replacing 8Y ² ln(R + 1) with 8Y ² ln(R ⁰ + 1).

11 Each time the exponential sequence (U _r ⁰ ) expands, the weights assigned to the existing points U _r ⁰ are appropriately reassigned

to the whole new sequence.

Adaptive and optimal online linear regression on $\ell^1$-balls

HAL Id: hal-00594399

https://hal.archives-ouvertes.fr/hal-00594399v4

Submitted on 14 Jan 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Adaptive and optimal online linear regression on ℓ 1 -balls

Sébastien Gerchinovitz, Jia Yuan Yu

To cite this version:

Sébastien Gerchinovitz, Jia Yuan Yu. Adaptive and optimal online linear regression on ℓ 1 -balls. The-

oretical Computer Science, Elsevier, 2014, 519, pp.4-28. �10.1016/j.tcs.2013.09.024�. �hal-00594399v4�

Adaptive and optimal online linear regression on ` 1 -balls

S´ ebastien Gerchinovitz a,1,∗ , Jia Yuan Yu b

a Ecole Normale Sup´ ´ erieure, 45 rue d’Ulm, 75005 Paris, France

b IBM Research, Damastown Technology Campus, Dublin 15, Ireland

Abstract

T U X/(2Y ). Furthermore, we present efficient algorithms that are adaptive, i.e., that do not require the knowledge of U, X , Y , and T, but still achieve nearly optimal regret bounds.

Keywords: Online learning, Linear regression, Adaptive algorithms, Minimax regret

1. Introduction

1.1. Setting

In the sequel, u · v denotes the standard inner product between u, v ∈ R d , and we set kuk ∞ , max 1 6 j 6 d |u j | and kuk 1 , P d

j=1 |u j |. The ` 1 -ball of radius U > 0 is the following bounded subset of R d : B 1 (U ) ,

u ∈ R d : kuk 1 6 U .

∗ Corresponding author

Email addresses: sebastien.gerchinovitz@ens.fr (S´ ebastien Gerchinovitz), jiayuanyu@ie.ibm.com (Jia Yuan Yu)

1 This research was carried out within the INRIA project CLASSIC hosted by ´ Ecole Normale Sup´ erieure and CNRS.

Given a fixed radius U > 0 and a time horizon T > 1, the goal of the forecaster is to predict almost as well as the best linear forecaster in the reference set

x ∈ R d 7→ u · x ∈ R : u ∈ B 1 (U ) , i.e., to minimize the regret on B 1 (U) defined by

T

X

t=1

(y t − y b t ) 2 − min

u∈B 1 (U)

( T X

t=1

(y t − u · x t ) 2 )

.

1.2. Contributions and related works

In the next paragraphs we detail the main contributions of this paper in view of related works in online linear regression.

Our first contribution (Section 2) consists of a minimax analysis of online linear regression on ` 1 -balls in the arbitrary sequence setting. We first provide a refined regret bound expressed in terms of Y , d, and a quantity κ = √

T U X/(2dY ). This quantity κ is used to distinguish two regimes: we show a distinctive regime transition 3 at κ = 1 or d = √

T U X/(2Y ). Namely, for κ < 1, the regret is of the order of dY 2 κ (proportional to √

T ), whereas it is of the order of dY 2 ln κ (proportional to ln T ) for κ > 1.

ln d/d. This lower bound extends those of [8, 9], which only hold for small κ of the order of 1/d.

The algorithm achieving our minimax regret bound is both computationally inefficient and non-adaptive (i.e., it requires prior knowledge of the quantities U , X, Y , and T that may be unknown in practice).

Those two issues were first overcome by [10] via an automatic tuning termed self-confident (since the forecaster somehow trusts himself in tuning its parameters). They indeed proved that the self-confident p-norm algorithm with p = 2 ln d and tuned with U has a cumulative loss L b T = P T

t=1 (y t − b y t ) 2 bounded by L b T 6 L ∗ T + 8U X

q

(e ln d) L ∗ T + (32e ln d) U 2 X 2 6 8U XY √

eT ln d + (32e ln d) U 2 X 2 , where L ∗ T , min {u∈ R d :kuk 1 6 U } P T

t=1 (y t − u · x t ) 2 6 T Y 2 . This algorithm is efficient, and our lower bound in terms of κ shows that it is optimal up to logarithmic factors in the regime κ 6 1 without prior knowledge of X, Y , and T.

2 Actually our results hold whether (x t , y t ) t>1 is generated by an oblivious environment or a non-oblivious opponent since we consider deterministic forecasters.

3 In high dimensions (i.e., when d > ωT , for some absolute constant ω > 0), we do not observe this transition (cf. Figure 1).

4 For example, (x t , y t ) 16t6T may be i.i.d. , or x t can be deterministic and y t = f(x t ) + ε t for an unknown function f and

an i.i.d. sequence (ε t ) 16t6T of Gaussian noise.

Our third contribution (Section 3.3) is a generic technique called loss Lipschitzification. It transforms the loss functions u 7→ (y t − u · x t ) 2 (or u 7→

y t − u · x t

` 1 -balls B 1 (U ), for U > 0, when X , Y , and T are known. Finally, in Section 5, we discuss as an extension a fully automatic algorithm that requires no prior knowledge of U , X , Y , or T . Some proofs and additional tools are postponed to the appendix.

2. Optimal rates

In this section, we first present a refined upper bound on the minimax regret on B 1 (U ) for an arbitrary U > 0. In Corollary 1, we express this upper bound in terms of an intrinsic quantity κ , √

T U X/(2dY ).

The optimality of the latter bound is shown in Section 2.2.

. Depending on the context, the latter prediction may be simply denoted by f e t x t ) or by y b t .

2.1. Upper bound

Theorem 1 (Upper bound). Let d, T ∈ N ∗ , and U, X, Y > 0. The minimax regret on B 1 (U ) for bounded base predictions and observations satisfies

inf F sup

kx t k ∞ 6 X, |y t | 6 Y

( T X

t=1

(y t − b y t ) 2 − inf

kuk 1 6 U T

X

t=1

(y t − u · x t ) 2 )

6



 

 



 

Adaptive and optimal online linear regression on ℓ ¹ -balls

Sébastien Gerchinovitz, Jia Yuan Yu. Adaptive and optimal online linear regression on ℓ ¹ -balls. The-

Adaptive and optimal online linear regression on ` ¹ -balls

S´ ebastien Gerchinovitz ^a,1,∗ , Jia Yuan Yu ^b

In the sequel, u · v denotes the standard inner product between u, v ∈ R ^d , and we set kuk _∞ , max 1 6 j 6 d |u j | and kuk ₁ , P d

j=1 |u j |. The ` ¹ -ball of radius U > 0 is the following bounded subset of R ^d : B ₁ (U ) ,

u ∈ R ^d : kuk ₁ 6 U .

x ∈ R ^d 7→ u · x ∈ R : u ∈ B ₁ (U ) , i.e., to minimize the regret on B 1 (U) defined by

(y t − y b t ) ² − min

u∈B ₁ (U)

( _T X

(y t − u · x t ) ² )

Our first contribution (Section 2) consists of a minimax analysis of online linear regression on ` ¹ -balls in the arbitrary sequence setting. We first provide a refined regret bound expressed in terms of Y , d, and a quantity κ = √

T U X/(2dY ). This quantity κ is used to distinguish two regimes: we show a distinctive regime transition ³ at κ = 1 or d = √

T U X/(2Y ). Namely, for κ < 1, the regret is of the order of dY ² κ (proportional to √

T ), whereas it is of the order of dY ² ln κ (proportional to ln T ) for κ > 1.

Those two issues were first overcome by [10] via an automatic tuning termed self-confident (since the forecaster somehow trusts himself in tuning its parameters). They indeed proved that the self-confident p-norm algorithm with p = 2 ln d and tuned with U has a cumulative loss L b _T = P T

t=1 (y _t − b y _t ) ² bounded by L b _T 6 L ^∗ _T + 8U X

(e ln d) L ^∗ _T + (32e ln d) U ² X ² 6 8U XY √

eT ln d + (32e ln d) U ² X ² , where L ^∗ _T , min _{u∈ _R d :kuk ₁ 6 U } P T

t=1 (y t − u · x t ) ² 6 T Y ² . This algorithm is efficient, and our lower bound in terms of κ shows that it is optimal up to logarithmic factors in the regime κ 6 1 without prior knowledge of X, Y , and T.

2 Actually our results hold whether (x t , y t ) _t>1 is generated by an oblivious environment or a non-oblivious opponent since we consider deterministic forecasters.

4 For example, (x t , y t ) _16t6T may be i.i.d. , or x t can be deterministic and y t = f(x t ) + ε t for an unknown function f and

an i.i.d. sequence (ε t ) _16t6T of Gaussian noise.

Our third contribution (Section 3.3) is a generic technique called loss Lipschitzification. It transforms the loss functions u 7→ (y t − u · x t ) ² (or u 7→

` ¹ -balls B 1 (U ), for U > 0, when X , Y , and T are known. Finally, in Section 5, we discuss as an extension a fully automatic algorithm that requires no prior knowledge of U , X , Y , or T . Some proofs and additional tools are postponed to the appendix.

Theorem 1 (Upper bound). Let d, T ∈ N ^∗ , and U, X, Y > 0. The minimax regret on B ₁ (U ) for bounded base predictions and observations satisfies

kx _t k _∞ 6 X, |y _t | 6 Y

( _T X

(y t − b y t ) ² − inf

kuk ₁ 6 U T

(y t − u · x t ) ² )

2T ln(2d) if U < ^Y _X

1 + ^√ ^2dY

if _X ^Y

T ln 2 6 U 6 ^√ ^2dY _{T X} , 32 dY ² ln

+ dY ² if U > ^2dY

(x t , y t ) 1 6 t 6 T ∈ ( R ^d × R ) ^T such that |y 1 |, . . . , |y T | 6 Y and kx 1 k _∞ , . . . , kx T k _∞ 6 X .

Theorem 1 improves the bound of [9, Theorem 5.11] for the EG ^± algorithm. First, our bound depends logarithmically—as opposed to linearly—on U for U > 2dY /( √

Hence, Theorem 1 provides a partial answer to a question ⁵ raised in [9] about the gap of p

Corollary 1 (Upper bound in terms of an intrinsic quantity). Let d, T ∈ N ^∗ , and U, X, Y > 0. The upper bound of Theorem 1 expressed in terms of d, Y , and the intrinsic quantity κ , √

kx t k _∞ 6 X, |y t |6 Y

( _T X

(y t − y b t ) ² − inf

kuk ₁ 6 U T

(y t − u · x t ) ² )

6 dY ² κ p

ln 2 , 52 dY ² κ p